WO2008143701A1 - Diagnosing intermittent faults - Google Patents

Diagnosing intermittent faults Download PDF

Info

Publication number
WO2008143701A1
WO2008143701A1 PCT/US2007/085527 US2007085527W WO2008143701A1 WO 2008143701 A1 WO2008143701 A1 WO 2008143701A1 US 2007085527 W US2007085527 W US 2007085527W WO 2008143701 A1 WO2008143701 A1 WO 2008143701A1
Authority
WO
WIPO (PCT)
Prior art keywords
behavior
intermittent
faults
component
model
Prior art date
Application number
PCT/US2007/085527
Other languages
French (fr)
Inventor
Johan De Kleer
Original Assignee
Palo Alto Research Center Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Palo Alto Research Center Incorporated filed Critical Palo Alto Research Center Incorporated
Priority to JP2010509324A priority Critical patent/JP5140722B2/en
Priority to EP07864789A priority patent/EP2153325B1/en
Publication of WO2008143701A1 publication Critical patent/WO2008143701A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2257Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using expert systems

Definitions

  • Model-based diagnosis involves model-based testing in which test cases are derived in whole or in part from a model that describes some, usually functional, aspects of the system under test.
  • the model is usually an abstract, partial representation of the system under test-desired behavior.
  • the test cases derived from this model are functional tests on the same level of abstraction as the model.
  • Model-based diagnosis is diagnostic and system-directed. Particularly, it starts with the observed misbehavior and works back toward the underlying components that may be broken.
  • Model-based diagnosis may be employed in a variety of arenas, including detecting faulty system behavior, identifying faulty components, repairing of the system, and reconfiguring of the system.
  • Other areas to which MBD may be applied include debugging cognitive models, designing experiments to build improved models of gene pathways, troubleshooting power grids, troubleshooting manufacturing lines, identifying faults in spacecraft, airplanes, and debugging programs, among other uses.
  • a method and system for diagnosing any combination of persistent and intermittent faults The behavior of a system under test is obtained by measuring or probing the system at a particular location(s).
  • the predicted behavior of a modeled system corresponding to the system under test is investigated by drawing inferences based on at least conditional probabilities, prior observations and component models. The predictions are compared to their corresponding points in the system under test. A determination is made if a conflict exists between the measured behavior and the predicted behavior, and the posterior probabilities are adjusted to more and more accurately reflect the action fault(s) in the system under test.
  • the conflicts or deviations between the obtained predicted behavior and the actual behavior are used to obtain a final result such as isolating the component(s) of the system causing the faults.
  • Figure 1 illustrates the main inputs and outputs of a model-based diagnosis engine
  • Figure 2 illustrates a diagram directed to the concepts of a system and method which efficiently diagnoses intermittent faults in simple and complex systems
  • Figure 3 illustrates a full adder, and this circuit computes the binary sum of d (carry in), a and b; q is the least significant bit of the result and C 0
  • Figure 5 is a Table showing the probes required to isolate a failing 01.
  • Figure 6 is a Table illustrating a set of diagnostic situations to evaluate diagnostic cost (DC) for the full adder of Figure 3, where X represents irrelevant;
  • Figure 7 is an exemplary And-gate circuit
  • Figure 9 is an exemplary two-buffer circuit
  • Figure 10 illustrates a graph plotting the cost of isolating a single intermittent buffer, versus the number of buffers
  • Figure 11 shows the average diagnostic cost of the present concepts when used in connection with well known benchmark circuits ISCAS-85;
  • Figure 12 is directed to one embodiment of a diagnostic device incorporating the concepts of the present application.
  • Figure 13 is directed to an embodiment of an embedded diagnostic device.
  • Figure 1 characterizes the main inputs and outputs of a model-based, component-based diagnosis (MBD) engine architecture 10.
  • MBD model-based diagnosis
  • component topology e.g., the schematic for analog circuits
  • component models e.g., resistors obey ohm's law
  • observations e.g., the voltage across resistor R6 is 4 volts
  • model-based diagnosis (MBD) engine 14 computes diagnoses 16 which explain all the observations 12c. Observations inconsistent with expectations guide the discovery of diagnoses.
  • the model-based diagnosis (MBD) engine can find no diagnoses it signals a failure 18.
  • MBD model-based diagnosis
  • a system consists of a set of components.
  • a faulty component is one which is physically degraded such that it will not always function correctly.
  • a faulted resistor may no longer conduct the expected current when a specified voltage is applied across it.
  • a worn roller in a printer may no longer grip the paper consistently thereby causing intermittent paper jams. In the case of a worn roller, it usually operates correctly but will infrequently slip and cause a paper jam.
  • the present concept associates two probabilities with each potentially intermittent component (1) the probability the actual component deviates from its design such that it may exhibit a malfunction, and (2) the conditional probability the faulted component malfunctions when observed.
  • the probability of a roller being worn might be 10 "5 while the probability of a worn roller actually malfunctioning might be .01.
  • the device may contain any number of faults.
  • the approach both computes the posterior probabilities after observations are made as well as additional probes needed to further isolate the fault(s) in the device.
  • the dynamic behavior of a system is not modeled over the time. Instead, the system is viewed in a sequence of observation events. For example, a piece of paper is fed into a printer, it may be observed to jam. A test- vector is applied to a combinational digital circuit and its output signals subsequently observed.
  • An intermittently faulted component is one whose output(s) are not a function of its inputs.
  • lntermittency can arise from at least two sources. If the system can be modeled at a more detailed (i.e., abstraction) level, apparent intermittency can disappear. For example, two wires may be shorted, making it look like a gate is intermittent, when in fact there is an unmodeled and unwanted bridge connection.
  • the second type of intermittency arises from some stochastic physical process which is intermittent at any level of detail. This discussion focuses on this second type of intermittency.
  • MBD model-based diagnosis
  • FIG. 2 presented is an overview diagram directed to the concepts of a system and method which efficiently diagnoses intermittent faults in simple and complex systems.
  • components are provided with conditional probabilities 24.
  • Each component of the system or device under test has associated with it two probabilities, (1 ) the probability the actual component deviates from its design such that it may exhibit a malfunction, and (2) (if potentially intermittent) the conditional probability the failed component malfunctions when observed.
  • step 26 observations of the "real world” system having potential faults, and the corresponding model of that system are observed. These "real world” observations may in one instance be obtained by probing the circuit via a probing device such as known in the art ⁇ e.g., a volt meter, current tester, logic analyzer, high-speed camera).
  • a probing device such as known in the art ⁇ e.g., a volt meter, current tester, logic analyzer, high-speed camera).
  • step 28 observations that differ from the prediction yield conflicts, step 28. These conflicts are used in step 30 to compute a diagnosis.
  • the computed diagnoses 30, which identifies deviations of actual behavior from the model's behavior is then used to isolate the faulty component(s).
  • the posterior probabilities associated with the components of the model are adjusted so they may continue to accept the possibility that a subsequent measurement may produce a different or more acceptable value, thus avoiding irreconcilable inconsistencies.
  • step 34 an inquiry is made as to whether sufficient observations (i.e., measurements and/or probing) have taken place.
  • This "ENOUGH?" inquiry may be directed to something as simple as the number of probing or measurement operations which have taken place, or it may be inquiring if a probability threshold value has been reached so the most probable diagnosis can be accepted as a true diagnosis. For example, if it is inferred that a resistor (e.g., R4) is .90 likely to be the faulty component of a system, and if the inquiry in step 34 has a threshold value that anything over a likelihood of .80 is sufficient to be acceptable as a final diagnosis, then in this case, the process would move to step 36, and a final diagnosis would be reached.
  • a resistor e.g., R4
  • step 34 determines whether the number of measurements or probes that are determined to be the maximum, or the probability factor is not greater than a determined threshold.
  • the process moves to step 38, where a next (in the case of intermittents, this may repeat the same probe at a next sample) placement of the measurement or probing location is determined. In one example, this would simply mean moving a probing device (e.g., from the input of a device A to the output of a device B). Once the new measurement or probing location is determined in step 38, and the actual measurements have been obtained, the process moves back to the input of step 28, and the process is repeated.
  • FIG. 2 Also shown in Figure 2 is a database 40.
  • the system shown in flow diagram 20 may be implemented in a computing device, either particularly built for MBD operations to diagnose intermittent faults, or any other general computing device. Such devices will have storage elements such as defined by database 40.
  • block 44 is a learning block particularly related to learning and providing accurate values for one of the associated conditional probabilities of step 44, and in particular the conditional probability g(c) that a component is behaving correctly when it is faulted. While such information in one embodiment may be generally obtained from the manufacture or previous experience with that component in similar systems, such information can vary widely. Therefore module or step 44 can update this conditional probability information.
  • step 32 by computing or recalculating the probabilities is expanded upon in Sections 2.2, 3.0 and 10.0 of the following discussion.
  • GDE probability framework includes having the behavior of components expressed as constraints, predicate calculus, or as conventional rules. GDE can use an Assumption-Based Truth Maintenance System (ATMS) to record all the conclusions it draws from these models and the observations.
  • ATMS Assumption-Based Truth Maintenance System
  • GDE computes a probability for each candidate diagnosis. Given component failure probabilities and assuming components fail independently, GDE assigns a prior probability to each diagnosis. As observations accumulate the (posterior) probabilities of diagnoses shift according to Bayes' rule.
  • GDE uses the consistency-based definition of diagnosis (as opposed to the abductive definition), applying Bayes' rule drives up the posterior probabilities of those diagnoses that entail the observation compared with those that are just consistent with it. As a consequence, its use of Bayes' rule to update posterior probabilities results in GDE exhibiting a synthesis of the properties of consistency-based diagnosis and of abduction-based diagnosis.
  • To determine what is actually wrong with a system usually requires obtaining additional measurements. In one embodiment, GDE performs sequential diagnosis by choosing the best measurement to make next.
  • An Assumption-based Truth Maintenance System (ATMS) and Hybrid- Truth Maintenance System (HTMS) framework involves, a propositional inference engine designed to simplify the construction of problem solvers that search complex search spaces efficiently.
  • the ATMS represents problem states with assumptions, which correspond to primary binary choices, and nodes, which correspond to propositions whose truth is dependent on the truth of the assumptions.
  • Dependency relationships among assumptions and nodes are determined by a domain-specific problem solver such as a conventional inference engine.
  • the problem solver presents these relationships to the ATMS as clauses justifications.
  • the ATMS determines which combinations of assumptions are consistent and identifies the conclusions to which they lead.
  • the ATMS is conventionally utilized by combining it with a conventional inference engine appropriate to the problem being solved.
  • the extension includes a propositional reasoner and an interface that receives calls from the inference engine, passes them to the propositional reasoner, and returns results to the inference engine.
  • Definition 2 A system is a triple (SD 1 COMPS, OBS) where:
  • SD the system description, is a set of first-order sentences.
  • COMPS the system components, is a finite set of constants.
  • OBS a set of observations, is a set of first-order sentences.
  • AB(x) represents that the component x is ABnormai (faulted).
  • a diagnosis is a sentence describing one possible state of the system, where this state is an assignment of the status normal or abnormal to each system component.
  • An AB-clause is a disjunction of AB-literals containing no complementary pair of AB-literals.
  • P f-J (D) is determined by the preceding measurements or prior probabilities of failure.
  • p(c) is the prior probability that component c is faulted.
  • two probabilities are associated with each component: (1 ) p(c) represents the prior probability that a component is intermittently faulted, and (2) (for potentially intermittent components) g(c) represents the conditional probability that component c is behaving correctly when it is faulted.
  • row 0 are the prior probabilities of the components intermittently failing
  • row 1 are the probabilities conditioned on knowing the system has a fault
  • p([>41]) will continue to drop (i.e., as more time instants of measurement in rows 0-12 are made).
  • An intelligent probing strategy might switch to observing x at time (row)) 4.
  • y is insensitive to any error at [O1], no erroneous value would ever be observed.
  • S is the set of diagnostic situations
  • p(s) is the probability of a particular diagnostic situation
  • D is the device
  • A is the algorithm used to select probes
  • c(D,s,A) is the cost of diagnosing device D in diagnostic context s with algorithm A.
  • DC can be exactly calculated.
  • DC must be estimated by, for example, with S consisting of pairs (f,v) where f is the fault and v is a set of device input-output pairs which cause D to manifest the symptom.
  • f O1 output 1
  • Table 64 of Figure 6 is a set of situations which can be used to evaluate expected diagnostic cost for the circuit of Figure 3. Each line lists the faulty component, its faulty output, an input vector, and an output observation which is sensitive to the fault (i.e., if the component produces the correct value, it would change this output value).
  • Table 60 of Figure 4 suggests a very simple probing strategy from which we can compute an upper bound on optimal diagnostic cost (DC).
  • the simplest strategy is to pick the lowest posterior probability component and repeatedly measure its output until either a fault is observed or its posterior probability drops below some desired threshold.
  • the number of samples needed depends on the acceptable misdiagnosis threshold e. Diagnosis stops when one diagnosis has been found with posterior probability p > 1 -e.
  • To compute the upper bound we: (1 ) take no advantage of the internal structure of the circuit; (2) presume that every measurement can exonerate only one component; (3) are maximally unlucky and never witness another incorrect output.
  • n the number of components c, derived from the priors, and Oi the number of samples of the output of c,.
  • g(c) typically varies widely. Therefore, it is useful to learn the g(c) over the diagnostic session instead of initially presuming it has some specific value.
  • Estimating g(c) requires significant additional machinery which is only described in the single fault case. An estimate of g ⁇ c) is made by counting the number of samples c is observed to be functioning correctly or incorrectly. Then G(e) is defined to be the number of times c could have been working correctly, and B(c) as the number of times c has been observed working incorrectly.
  • Or- gate 01 cannot be behaving improperly because its inputs are both 0, and its observed output is 0.
  • And-gate Al cannot be behaving improperly because its inputs are both 0, and its output must be 0, as 01 is behaving correctly and its output was observed to be 0.
  • And-gate A2 cannot be faulted. However, there is no evidence as to whether XI is behaving improperly or not, because if XI were behaving improperly, its output would be 1 , but that cannot affect the observation because And-gate A2's other input is 0. X2 cannot causally affect C 0 .
  • the final column (E) is the entropy of the distribution, a measure of how closely the correct diagnosis has been isolated.
  • the Bayes' rule update checks are performed within the ATMS which respects the time assumptions. For the purpose of generating candidates these time assumptions are invisible to GDE.
  • the Bayes' rule checks are performed within the ATMS which respects the time assumptions. In this way, the same efficient GDE/Sherlock mechanisms can be exploited for diagnosing intermittent faults.
  • the ATMS representation of a conflict specifies that at least one of its /40-literals indicates a component which is producing a faulty output at time instant /.
  • every inference includes a time assumption.
  • Figure 10 illustrates a graph 100 which plots diagnostic cost (obtained from one simulation) of isolating a single intermittent buffer in a sequence of buffers using the minimum entropy strategy.
  • the graph illustrates that, as is the case with persistent faults, groups of components are simultaneously exonerated so DC grows roughly logarithmic with circuit size. Note that the minimum entropy strategy is much better than the simple approach of Section 4.
  • a heuristic may be employed to ensure measuring the best variable which is directly adjacent to some remaining suspect component.
  • the first column is the ISCAS-85 circuit name.
  • the second column (components) is the number of distinct gates of the circuit. The remaining two columns are computed by hypothesizing each faulty component with a 1 switched to 0, and a 0 switched to a 1. For each fault, a single (constant) test-vector is found which gives rise to an observable symptom for one of the outputs. For each row, for each component, two simulations are performed. The next column (non /-cost) is the diagnostic cost to isolate the faulty non-intermittent component.
  • diagnosis is generalized to be an assignment of modes to each component.
  • diagnosis [01 ] for the Full-Adder is represented as [U(OI )] (and G modes are not included).
  • the approach of this application can be used to distinguish between an inverter stuck-at-1 and inverter being faulted, e.g., [SAI (X)] VS. [U(X)]. 11 Related Work
  • Probes 134 are designed to be positioned in operative association with a device under test 136.
  • Body may include an input 138 and an output 140.
  • the input 138 can include an alphanumeric keypad, stylus, voice, or other input design or interface known to input data or instructions.
  • Output 140 may be any sort of display to display the results of a diagnostic investigation.
  • Body 132 may also include a secondary set of inputs 142, wherein information detected by probes 184 are automatically input into diagnostic device 130.
  • body 132 includes computational capabilities including at least a processor 144 and memory 146, which permits the processing of software code, including code incorporating the concepts described herein.
  • diagnostic device or system 130 may include output 148, for connection to an output device 150 to permit the printing of hardcopies of output reports regarding the results of the diagnostic investigation.
  • output 148 for connection to an output device 150 to permit the printing of hardcopies of output reports regarding the results of the diagnostic investigation.
  • Figure 12 may not include probes 134, but rather the diagnostics may be undertaken on computer software operating on the diagnostic device, or associated with another device having computational capabilities.
  • the diagnostic device or system 160 is itself embedded as part of a larger overall system 162 which may, for example, include components 164-170 shown in operative connection with each other via solid lines 172.
  • the diagnostic device or system 160 in turn is shown to be in operative connection with the components 164-170 via dotted lines 174.
  • Figure 13 is a high level example of a system in which a diagnostic device or system according to the present application may be used.
  • the purpose of the diagnostic device or system 160 is to identify faults in the overall system 162 and initiate repairs without any human intervention. Examples of such overall systems would be reprographic equipment, automobiles, spacecraft, and airplanes, among others.

Abstract

A method and system for diagnosing any combination of persistent and intermittent faults. The behavior of a system under test is obtained by measuring or probing the system at a particular location(s). The predicted behavior of a modeled system corresponding to the system under test is investigated by drawing inferences based on at least conditional probabilities, prior observations and component models. The predictions are compared to their corresponding points in the system under test. A determination is made if a conflict exists between the measured behavior and the predicted behavior, and the conditional probabilities are adjusted to more and more accurately reflect the action fault(s) in the system under test. The conflicts or deviations between the obtained predicted behavior and the actual behavior are used to isolate the components of the system causing the faults.

Description

DIAGNOSING INTERMITTENT FAULTS
[0001] This application claims priority to U.S. Provisional Application No. 60/931 ,524, filed May 24, 2007, entitled "Diagnosing Intermittent Faults" by Johan de Kleer; and U.S. Provisional Application No. 60/931 ,526, filed May 24, 2007, entitled "Troubleshooting Temporal Behavior in 'Combinational' Circuits" by Johan de Kleer, the disclosures of which are hereby incorporated by reference in their entireties.
BACKGROUND
[0002] Model-based diagnosis (MBD) involves model-based testing in which test cases are derived in whole or in part from a model that describes some, usually functional, aspects of the system under test. The model is usually an abstract, partial representation of the system under test-desired behavior. The test cases derived from this model are functional tests on the same level of abstraction as the model.
[0003] Model-based diagnosis is diagnostic and system-directed. Particularly, it starts with the observed misbehavior and works back toward the underlying components that may be broken.
[0004] Model-based diagnosis may be employed in a variety of arenas, including detecting faulty system behavior, identifying faulty components, repairing of the system, and reconfiguring of the system. Other areas to which MBD may be applied, include debugging cognitive models, designing experiments to build improved models of gene pathways, troubleshooting power grids, troubleshooting manufacturing lines, identifying faults in spacecraft, airplanes, and debugging programs, among other uses.
[0005] However, an issue related to the diagnosis of systems using MBD as well as other testing approaches, such as "ad hoc" hand-coded rules, machine learning of patterns, D-algorithm searching, and analytical redundancy relationships, among others, is the inability to accurately diagnose intermittent faults.
[0006] Experience with diagnosis of automotive systems and reprographic machines shows that intermittent faults are among the most challenging kinds of faults to isolate. Such systems raise many modeling complexities. The approach to isolating these intermittent faults is presented in the context of logic systems. However, it is to be appreciated the present concepts may be employed in many other environments.
[0007] Presently, the main approach used to attempt to find intermittent faults is to stress the system under test, in an attempt to convert the intermittent faults to persistent failures, and then diagnose those failures. However, these approaches tend to lead to increased system failures of the overall system. [0008] Therefore, it is desirable to employ a system and method which finds or isolates intermittent faults but does not stress the system under test, and thereby avoid increased system failures.
INCORPORATION BY REFERENCE
[0009] U.S. Application Serial No. XX/XXX,XXX (Attorney Docket 20070320- US-NP/XERZ201587 filed XXXXXXX, entitled "TROUBLESHOOTING TEMPORAL BEHAVIOR IN 1COIvIBI NATIONAL1 CIRCUITS", by Johan de Kleer; and U.S. Application Serial No. 11/925,444 (Attorney Docket 20070258-US- NP/XERZ201588 filed October 30, 2007, entitled "DYNAMIC DOMAIN ABSTRACTION THROUGH META-ANALYSIS", by Johan de Kleer.
BRIEF DESCRIPTION
[0010] A method and system for diagnosing any combination of persistent and intermittent faults. The behavior of a system under test is obtained by measuring or probing the system at a particular location(s). The predicted behavior of a modeled system corresponding to the system under test is investigated by drawing inferences based on at least conditional probabilities, prior observations and component models. The predictions are compared to their corresponding points in the system under test. A determination is made if a conflict exists between the measured behavior and the predicted behavior, and the posterior probabilities are adjusted to more and more accurately reflect the action fault(s) in the system under test. The conflicts or deviations between the obtained predicted behavior and the actual behavior are used to obtain a final result such as isolating the component(s) of the system causing the faults. BRIEF DESCRIPTION OF THE DRAWINGS
[0011] Figure 1 illustrates the main inputs and outputs of a model-based diagnosis engine;
[0012] Figure 2 illustrates a diagram directed to the concepts of a system and method which efficiently diagnoses intermittent faults in simple and complex systems;
[0013] Figure 3 illustrates a full adder, and this circuit computes the binary sum of d (carry in), a and b; q is the least significant bit of the result and C0
(carry out) the high order bit;
[0014] Figure 4 is a table illustrating probabilities of component failure over time, where row 0 are the prior probabilities of the components intermittently failing, row 1 are the probabilities conditioned on knowing the system has a fault, and row 2 are the probabilities conditioned on C0 = 1 and so on;
[0015] Figure 5 is a Table showing the probes required to isolate a failing 01.
[0016] Figure 6 is a Table illustrating a set of diagnostic situations to evaluate diagnostic cost (DC) for the full adder of Figure 3, where X represents irrelevant;
[0017] Figure 7 is an exemplary And-gate circuit;
[0018] Figure 8 is a Table showing the result of probing strategy using learning, with >41 failing at i=11 and the final column being the entropy of the distribution, a measure of how closely the correct diagnosis has been isolated;
[0019] Figure 9 is an exemplary two-buffer circuit;
[0020] Figure 10 illustrates a graph plotting the cost of isolating a single intermittent buffer, versus the number of buffers;
[0021] Figure 11 shows the average diagnostic cost of the present concepts when used in connection with well known benchmark circuits ISCAS-85;
[0022] Figure 12 is directed to one embodiment of a diagnostic device incorporating the concepts of the present application; and
[0023] Figure 13 is directed to an embodiment of an embedded diagnostic device.
DETAILED DESCRIPTION
[0024] Figure 1 characterizes the main inputs and outputs of a model-based, component-based diagnosis (MBD) engine architecture 10. Given the component topology (e.g., the schematic for analog circuits) 12a, component models (e.g., resistors obey ohm's law) 12b and observations (e.g., the voltage across resistor R6 is 4 volts) 12c, model-based diagnosis (MBD) engine 14 computes diagnoses 16 which explain all the observations 12c. Observations inconsistent with expectations guide the discovery of diagnoses. When the model-based diagnosis (MBD) engine can find no diagnoses it signals a failure 18.
[0025] Existing model-based diagnosis (MBD) approaches presume the system being diagnosed behaves non-intermittently MBD system therefore and analyze the system's under-test behavior over a small number (often one) of time instants. The following discussion shows how existing approaches to model-based diagnosis can be extended to diagnose intermittent failures as they manifest themselves over numbers of time instants. In addition, the following discloses where to insert probe points to best distinguish among the intermittent faults that best explain the symptoms created by the fault and isolate the fault in minimum expected cost.
[0026] The notion of intermittency is a hard-to-define concept, so it is first described here intuitively before being defined more formally. A system consists of a set of components. A faulty component is one which is physically degraded such that it will not always function correctly. For example, a faulted resistor may no longer conduct the expected current when a specified voltage is applied across it. A worn roller in a printer may no longer grip the paper consistently thereby causing intermittent paper jams. In the case of a worn roller, it usually operates correctly but will infrequently slip and cause a paper jam. Therefore the present concept associates two probabilities with each potentially intermittent component (1) the probability the actual component deviates from its design such that it may exhibit a malfunction, and (2) the conditional probability the faulted component malfunctions when observed. For example, the probability of a roller being worn might be 10"5 while the probability of a worn roller actually malfunctioning might be .01. In this document we show how to troubleshoot devices containing any combination of intermittent or persistent faults. The device may contain any number of faults. The approach both computes the posterior probabilities after observations are made as well as additional probes needed to further isolate the fault(s) in the device. [0027] The dynamic behavior of a system is not modeled over the time. Instead, the system is viewed in a sequence of observation events. For example, a piece of paper is fed into a printer, it may be observed to jam. A test- vector is applied to a combinational digital circuit and its output signals subsequently observed.
[0028] Definition 1 An intermittently faulted component is one whose output(s) are not a function of its inputs.
[0029] lntermittency can arise from at least two sources. If the system can be modeled at a more detailed (i.e., abstraction) level, apparent intermittency can disappear. For example, two wires may be shorted, making it look like a gate is intermittent, when in fact there is an unmodeled and unwanted bridge connection. The second type of intermittency arises from some stochastic physical process which is intermittent at any level of detail. This discussion focuses on this second type of intermittency.
1 Overview of Flow Process and System for Diagnosing Intermittent Faults
[0030] Turning now more particularly to the concepts of the present application, it is known that intermittent faults in complex systems are very hard to pinpoint, in part because faults and symptoms can appear and disappear without any observable pattern. Such faults are notoriously difficult for technicians to isolate, and therefore pose a problem for almost all approaches to diagnosis. The present concepts propose an approach that is grounded in model-based diagnosis (MBD), which works from a model of a system (e.g., circuit schematic and behavior models of behavior models of components), and utilizes deviations of actual behavior from the model's behavior to isolate the faulty component(s). At each stage, recalculated conditional probabilities associated with possible fault causes, guide the next stage of observation and measurement.
[0031] Thus, the present concepts focus on diagnosing intermittent faults by adjusting component probabilities so that they continue to accept the possibility a subsequent measurement will produce a different value, thus avoiding irreconcilable inconsistencies. [0032] Turning to Figure 2, presented is an overview diagram directed to the concepts of a system and method which efficiently diagnoses intermittent faults in simple and complex systems. In the model-based diagnosis flow diagram 20 of Figure 2, components are provided with conditional probabilities 24. Each component of the system or device under test has associated with it two probabilities, (1 ) the probability the actual component deviates from its design such that it may exhibit a malfunction, and (2) (if potentially intermittent) the conditional probability the failed component malfunctions when observed. In step 26, observations of the "real world" system having potential faults, and the corresponding model of that system are observed. These "real world" observations may in one instance be obtained by probing the circuit via a probing device such as known in the art {e.g., a volt meter, current tester, logic analyzer, high-speed camera).
[0033] Next, with regard to the MBD model, knowing certain inputs, predictions of the model are made using inferences as done in conventional MBD analysis. As part of this process, observations that differ from the prediction yield conflicts, step 28. These conflicts are used in step 30 to compute a diagnosis. As shown in step 32, the computed diagnoses 30, which identifies deviations of actual behavior from the model's behavior, is then used to isolate the faulty component(s). Thus, in step 32, the posterior probabilities associated with the components of the model are adjusted so they may continue to accept the possibility that a subsequent measurement may produce a different or more acceptable value, thus avoiding irreconcilable inconsistencies. [0034] More particularly, in step 34, an inquiry is made as to whether sufficient observations (i.e., measurements and/or probing) have taken place. This "ENOUGH?" inquiry may be directed to something as simple as the number of probing or measurement operations which have taken place, or it may be inquiring if a probability threshold value has been reached so the most probable diagnosis can be accepted as a true diagnosis. For example, if it is inferred that a resistor (e.g., R4) is .90 likely to be the faulty component of a system, and if the inquiry in step 34 has a threshold value that anything over a likelihood of .80 is sufficient to be acceptable as a final diagnosis, then in this case, the process would move to step 36, and a final diagnosis would be reached. [0035] On the other hand, if step 34 does not reach the number of measurements or probes that are determined to be the maximum, or the probability factor is not greater than a determined threshold, the process moves to step 38, where a next (in the case of intermittents, this may repeat the same probe at a next sample) placement of the measurement or probing location is determined. In one example, this would simply mean moving a probing device (e.g., from the input of a device A to the output of a device B). Once the new measurement or probing location is determined in step 38, and the actual measurements have been obtained, the process moves back to the input of step 28, and the process is repeated.
[0036] Also shown in Figure 2 is a database 40. It is to be appreciated the system shown in flow diagram 20 may be implemented in a computing device, either particularly built for MBD operations to diagnose intermittent faults, or any other general computing device. Such devices will have storage elements such as defined by database 40. As can be seen by arrow 42, the intent is to emphasize that the information used and generated by the various modules or steps in flow diagram 20 can be stored at memory locations within database 40. [0037] In the same regard, block 44 is a learning block particularly related to learning and providing accurate values for one of the associated conditional probabilities of step 44, and in particular the conditional probability g(c) that a component is behaving correctly when it is faulted. While such information in one embodiment may be generally obtained from the manufacture or previous experience with that component in similar systems, such information can vary widely. Therefore module or step 44 can update this conditional probability information.
[0038] As can be seen from the drawings, the concepts of step 32, by computing or recalculating the probabilities is expanded upon in Sections 2.2, 3.0 and 10.0 of the following discussion.
[0039] The concepts of computing a "best" placement for a next measurement is discussed in Sections 4.0 and 8.0, of the following discussion. Further, details on learning the second associated probability is expanded upon in section 5.0 of the following discussion. 2 GDE probability framework [0040] The GDE framework includes having the behavior of components expressed as constraints, predicate calculus, or as conventional rules. GDE can use an Assumption-Based Truth Maintenance System (ATMS) to record all the conclusions it draws from these models and the observations. [0041] GDE computes a probability for each candidate diagnosis. Given component failure probabilities and assuming components fail independently, GDE assigns a prior probability to each diagnosis. As observations accumulate the (posterior) probabilities of diagnoses shift according to Bayes' rule. Candidate diagnoses which are logically eliminative by the evidence receive probability 0. Although GDE uses the consistency-based definition of diagnosis (as opposed to the abductive definition), applying Bayes' rule drives up the posterior probabilities of those diagnoses that entail the observation compared with those that are just consistent with it. As a consequence, its use of Bayes' rule to update posterior probabilities results in GDE exhibiting a synthesis of the properties of consistency-based diagnosis and of abduction-based diagnosis. [0042] To determine what is actually wrong with a system usually requires obtaining additional measurements. In one embodiment, GDE performs sequential diagnosis by choosing the best measurement to make next. It commonly uses a one-step look-ahead function based on minimum entropy (e.g., a myopic minimum entropy strategy). Multistep lookahead will choose better measurements but is typically computationally impractical. GDE proposes a set of measurements to make which will, on average, require a minimum number of measurements to localize the correct diagnosis. [0043] An Assumption-based Truth Maintenance System (ATMS) and Hybrid- Truth Maintenance System (HTMS) framework involves, a propositional inference engine designed to simplify the construction of problem solvers that search complex search spaces efficiently. The ATMS represents problem states with assumptions, which correspond to primary binary choices, and nodes, which correspond to propositions whose truth is dependent on the truth of the assumptions. Dependency relationships among assumptions and nodes are determined by a domain-specific problem solver such as a conventional inference engine. The problem solver presents these relationships to the ATMS as clauses justifications. The ATMS determines which combinations of assumptions are consistent and identifies the conclusions to which they lead. [0044] The ATMS is conventionally utilized by combining it with a conventional inference engine appropriate to the problem being solved. The extension includes a propositional reasoner and an interface that receives calls from the inference engine, passes them to the propositional reasoner, and returns results to the inference engine. [0045] Definition 2 A system is a triple (SD1COMPS, OBS) where:
1. SD, the system description, is a set of first-order sentences.
2. COMPS, the system components, is a finite set of constants.
3. OBS, a set of observations, is a set of first-order sentences.
[0046] Definition 3 Given two sets of components Cp and Cn define D(Cp1Cn) to be the conjunction:
Figure imgf000010_0001
[0047] Where AB(x) represents that the component x is ABnormai (faulted). [0048] A diagnosis is a sentence describing one possible state of the system, where this state is an assignment of the status normal or abnormal to each system component.
[0049] Definition 4 Let Δ cCOMPS. A diagnosis for (SD,COMPS,OBS) is £>(Δ, COMPS — Δ) such that the following is satisfiable:
SD U OBS U {P(Δ, COMPS - Δ)}
[0050] Definition 5 An AB-literal is AB(c) or -^AB(c) for some c e COMPS,
[0051] Definition 6 An AB-clause is a disjunction of AB-literals containing no complementary pair of AB-literals.
[0052] Definition 7 A conflict of (SC,COMPS,OBS) is an AB-clause entailed by SD u OBS.
[0053] For most of this document we assume 'weak' fault models; or the ignorance of abnormal behavior assumption: No fault models are presumed.
Section 10, Extension to multiple persistent or intermittent faults, shows how this assumption can be relaxed.
2.1 Representing time
[0054] Time is expressed easily in the preceding formalism. For example, the model of an inverter is often written as; INVERTER(,τ) —
[-i,4fi(.τ) -, [iniJ-j) = 0 = out(xΛ) = I]I .
[0055] When ambiguous this document represents the value v of variable x at time f as 7(x = v,t). Time is a sequence of instants t0, th . , .. The probability of X at time f is represented as Pt(X)-
2.2 Updating diagnosis probabilities
[0056] Components are assumed to fail independently. Therefore, the prior probability a particular diagnosis V(Cp1Cn) is correct is:
Pt(ϊ>) = J] Mr) Jl (1 - p(c)l (1)
[0057] where p(c) is the prior probability that component c is faulted.
[0058] The posterior probability of a diagnosis t> after an observation that x has value v at time t is given by Bayes' rule:
Figure imgf000011_0001
[0059] Pf-J (D) is determined by the preceding measurements or prior probabilities of failure. The denominator Pt(x = v) is a normalizing term that is identical for all p(D) and thus needs not be computed directly. The only term remaining to be evaluated in the equation is pt(x = v\t>):
Pt(x = v\t>) = 0 if V, SD1OBS1 T(X = v,t) are inconsistent, else,
Pt(x = v\V) = 1 if T(x = v,t) follows from Z>,SD,OBS
[0060] If neither holds,
Figure imgf000011_0002
[0061] Various e-policies, which are known in the art, are possible and a i different e can be chosen for each variable x, and value vι<. Typically, ejk = S.
This corresponds to the intuition that if x ranges over m possible values, then each possible value is equally likely. In digital circuits m = 2 and thus e = .5. [0062] In the conventional framework, observations that differ with predictions yield conflicts, which are then used to compute diagnoses. Consider the full adder digital circuit 50 of Figure 3, which computes the binary sum of c, (carry in), a and b; q is the least significant bit of the result and C0 (carry out) the high order bit. Suppose all the inputs (c,, a, b) to the circuit are 0, and C0 is measured to be 1. This yields one minimal conflict:
AB[Al) V AB[Ai) V AB(OT),
[0063] For brevity sake we shall write all diagnoses with [f] where f are the faulted components. The single fault diagnoses are thus:[A1] [A2] [02] .
3 Extensions to the conventional framework to support intermittent faults
[0064] In the conventional framework, p(c) is the prior probability that component c is faulted. In the new framework, two probabilities are associated with each component: (1 ) p(c) represents the prior probability that a component is intermittently faulted, and (2) (for potentially intermittent components) g(c) represents the conditional probability that component c is behaving correctly when it is faulted.
[0065] In the intermittent case, the same sequential diagnosis model and Bayes' rule update applies. However, a more sophisticated e-policy is needed. Note that an e-policy applies only when a particular diagnosis neither predicts x, = Vk nor is inconsistent with it. Consider the case where there is only a single fault. As we assume all inputs are given, the only reason that a diagnosis could not predict a value for x, is when the faulted component causally affects x,. Consider the single fault diagnosis /e/. If c is faulted, then g(c) is the probability that it is producing a correct output. There are only two possible cases: (1 ) c is outputting a correct value with probability g(c), (2)c is producing a faulty value with probability 1 -g(c). Therefore, if x, = v* follows from V({},COMPS), SD;OBS and ignoring conflicts:
[0066] Pt<yXi = Vk^ = 9^'
Otherwise,
[0067] Pt(χi
Figure imgf000012_0001
i -g(c). ThΘse two e equations are equivalent to (no need to ignore conflicts): pt(Xi = = g(c) i£V.4c), OBSt SD \- Ti*t = vh. t )
+
1 - gtc) if T>,ϋ(c). OBS. SD h T(xt = Vk, t) [0068] Here, o(c) is the correct model for component c, and B(c) the incorrect
(output negated) model.
[0069] For example, observing C0 = 1 in Figure 3 results in three single faults:
[01], [>A1 ], [A2]. If all components are equally likely to fail a priori, then i p([i41]),p([/42]), p([O1}) = ϊ. Suppose it is next observed, y = 0. In the conventional framework this has no consequence, as the observation is the same as the prediction. However, in the intermittent case, such 'good' observations provide significant diagnostic information. Table 60 of Figure 4 illustrates the results if C0 is first observed to be faulty, g(c) = .9 for all components, 01 is the actual fault, and y is observed continuously. In Table 60, row 0 are the prior probabilities of the components intermittently failing, row 1 are the probabilities conditioned on knowing the system has a fault, and row 2 are the probabilities conditioned on C0 = 1 and so on. As sampling continues, p([>41]) will continue to drop (i.e., as more time instants of measurement in rows 0-12 are made). An intelligent probing strategy might switch to observing x at time (row)) 4. As y is insensitive to any error at [O1], no erroneous value would ever be observed.
[0070] Failing components closer to the input are easier to isolate as there are fewer confounding effects of intermediate components through which the signals must propagate before reaching a suspect component. Table 62 of Figure 5 presents a sequence of probes (i) to isolate a fault in 01. Although 01 can never be identified as failing with probability 1 , the table shows that repeated observations drive 01 's probability asymptotically to 1 (e.g., 0.998). [0071] It is important to note that the single fault assumption does not require an additional inference rule. The Bayes' rule update equation does all the necessary work. For example, when C0 = 1 is observed, the single fault diagnoses [X1] and [X2] both predict C0 = 0 and thus are both eliminated from consideration by the update equation. As an implementation detail, single faults can be computed efficiently as the intersection of all conflicts found in all the samples. 4 A simplistic probing strategy for single faults
[0072] Consider again the example of Figure 3 and Table 60 of Figure 4. Repeatedly measuring y = 0 will monotonically drive down the posterior probability of A1 being faulted. Choosing probes which drive down the posterior probability of component faults is at the heart of effective isolation of intermittent faults.
[0073] A probing strategy better for isolating some faults, might perform worse on others. To fairly compare proposed strategies by their expected diagnostic cost:
Figure imgf000014_0001
[0074] where S is the set of diagnostic situations, p(s) is the probability of a particular diagnostic situation, D is the device, A is the algorithm used to select probes, and c(D,s,A) is the cost of diagnosing device D in diagnostic context s with algorithm A. In limited cases, DC can be exactly calculated. Generally DC must be estimated by, for example, with S consisting of pairs (f,v) where f is the fault and v is a set of device input-output pairs which cause D to manifest the symptom. For example, the situation illustrated in the previous section corresponds to Table 64 of Figure 6 where f="O1 output 1 " and v=(Cj=0,a=0,b=0,co=1 ). Table 64 of Figure 6 is a set of situations which can be used to evaluate expected diagnostic cost for the circuit of Figure 3. Each line lists the faulty component, its faulty output, an input vector, and an output observation which is sensitive to the fault (i.e., if the component produces the correct value, it would change this output value).
[0075] Table 60 of Figure 4, therefore, suggests a very simple probing strategy from which we can compute an upper bound on optimal diagnostic cost (DC). The simplest strategy is to pick the lowest posterior probability component and repeatedly measure its output until either a fault is observed or its posterior probability drops below some desired threshold. [0076] The number of samples needed depends on the acceptable misdiagnosis threshold e. Diagnosis stops when one diagnosis has been found with posterior probability p > 1 -e. To compute the upper bound we: (1 ) take no advantage of the internal structure of the circuit; (2) presume that every measurement can exonerate only one component; (3) are maximally unlucky and never witness another incorrect output.
[0077] Let n be the number of components c,
Figure imgf000015_0001
derived from the priors, and Oi the number of samples of the output of c,. To shorten the mathematics Bayes' rule is written as:
Pti N) = apiix = r| N)jpt_i([<yj), where, α is chosen such that the posterior probabilities sum to 1. Notice that pt(x = v\[d\) = g{c) when the output of component c, is observed. After t samples: lH\ [ct]) = acg(cl)Otpt([ci] ),
where,
and,
Figure imgf000015_0002
[0078] Splitting the misdiagnosis threshold evenly across all components, o, is picked such that:
[0079] It is then necessary to solve for o, in:
Figure imgf000015_0003
a.
[0080] If Oj is sufficiently large, then α ~ 1 -p([ci]) so it is possible to solve for O; in:
_
Figure imgf000015_0004
[0081] In the case of the full adder example all priors are equal, g = .9, and e = .1. So based on the preceding equation, Oi=28.1. Hence, a worst case strategy is to measure each point 29 times. However, the last probe need not be made since the strategy would have failed to identify a fault earlier, and hence because the device has a fault, the final probe point is at the fault. There is no need to measure the non-symptom output because that cannot provide information on any single fault. Therefore, the upper bound is 58. This worst case is corroborated in Table 62 of Figure 5 where it takes 56 probes to isolate 01 (e.g., above 0.900). Faults in A1 and A2 would be detected with far fewer probes.
[0082] Rarely can we compute a bound on a particular diagnostic algorithm so neatly. In most cases the diagnostic cost (DC) of a particular algorithm can only be evaluated empirically. Applying the previous DC algorithm on the vectors of Table 64 of Figure 6, and assuming all faults are equally likely, g=0.9, e=0.1 , results in an expected cost of DC=22.2 with observed error of 0.05. This DC is far smaller than the 58 bound calculated earlier, because faults in A1 and A2 evaluated in far fewer probes and DC is an average. As the manifestation of intermittent faults is inherently random, there will always be a chance of misdiagnosis. The observed error of 0.05 is better than our desired 0.1. In general, the observed error will be close to the theoretical error, but it will vary because of specific characteristics of the circuit being analyzed and the sequence of random numbers generated in the simulation.
5 Learning g(c)
[0083] Although the prior probability of component failure can be estimated by the manufacturer or previous experience with that component in similar systems, g(c) typically varies widely. Therefore, it is useful to learn the g(c) over the diagnostic session instead of initially presuming it has some specific value. [0084] Estimating g(c) requires significant additional machinery which is only described in the single fault case. An estimate of g{c) is made by counting the number of samples c is observed to be functioning correctly or incorrectly. Then G(e) is defined to be the number of times c could have been working correctly, and B(c) as the number of times c has been observed working incorrectly. Corroboratory measurements which c cannot influence are ignored (conflicting measurements exonerate any component which cannot influence it, in which case g(c) is no longer relevant). The g(c) is estimated as (if either G(c) or B(c) is 0, g(c)=.5):
Figure imgf000016_0001
[0085] The situations in which c is working incorrectly are straightforward to detect — they are simply the situations in which the Bayes' update equation utilizes p((x, = vk/[ό]) = 1-g(c). The cases in which c is working correctly requires additional inferential machinery. Consider again the example of circuit 50 of Figure 3, where all the inputs are 0, and the expected C0 = 0 is observed. Or- gate 01 cannot be behaving improperly because its inputs are both 0, and its observed output is 0. And-gate Al cannot be behaving improperly because its inputs are both 0, and its output must be 0, as 01 is behaving correctly and its output was observed to be 0. Analogously, And-gate A2 cannot be faulted. However, there is no evidence as to whether XI is behaving improperly or not, because if XI were behaving improperly, its output would be 1 , but that cannot affect the observation because And-gate A2's other input is 0. X2 cannot causally affect C0. To summarize, if it is observed C0 = 0 at an instant in which all the inputs are 0, it has been learned that /41 , A2 and 01 cannot be misbehaving alone, and nothing is learned about the faultedness of XI and X2. Hence, G{A1 ), G(A2) and G(O1 ) are incremented. All other counters are left unchanged. In summary, as a consequence of observing C0 = 0, the G(A1 ), G(A2), and G(O1 ) are incremented. The net consequence will be that the diagnostician may have to take (slightly) more samples to eliminate A1 , A2 or 01 as faulted. [0086] Consider now the example circuit 70 of Figure 7. At t = 0, the inputs are both 0, and the output is observed to be z = 0. Neither the single fault A nor B can influence z = 0, therefore the counters for A and B are left unchanged. However, C can influence the output and therefore G[C) is incremented. More formally, in response to an observation x, = vk, G(c) is incremented if a different value for x, logically follows from the negated model~δ(c); c causally influences the observation and G(C) is incremented.
[0087] Table 80 of Figure 8 shows the results of a probing strategy (for circuit 50 of Figure 3), with Al failing at g = 0.9. The final column (E) is the entropy of the distribution, a measure of how closely the correct diagnosis has been isolated. Table 80 shows the sequence of probes used to isolate a fault in A1. AYs fault is manifest at / = 11. Notice that the conditional probabilities shift differently as compared to the non-learning case Table 60 of Figure 4. As g(c)=0.5 is assumed without any evidence, measuring x=0 immediately drops A2's probability to 0.5 compared to the other two diagnoses. When using learning, initial g(c)'s need to be chosen with some care as they significantly influence diagnosis.
6 Hybrid Truth Maintenance System (HTMS) implementation considerations
[0088] The approach has been fully implemented on GDE/HTMS architecture. Section 9 reports the performance of the approach on various benchmarks. [0089] Recall the Bayes1 rule update equation requires two repeated checks:
VMdOBS. SD h r(χ, = ι%.t),
DMe)JJBS^SD H T(xt = vk,t),
[0090] In the GDE approach each conflict is represented by a positive clause of ATMS assumptions corresponding to the /4S-literals. Unfortunately, this representation of conflicts makes it difficult to evaluate these two checks through efficient ATMS operations. The full adder example illustrates the problem very simply (all inputs 0, C0 = 1 observed once). If the conflict,
.4 B{ Al) </ AB(AS) v AB(Ol). were represented directly as an ATMS clause, then no value for C0 could be inferred at any later samples. If the g(c)'s were different for the three gates it can be useful to measure C0- In this implementation, every GDE conflict is represented as an ATMS clause which includes the time instant(s) at which it was noticed. Thus the conflict for Figure 3 is represented:
If φ 2) V AB(Al) V AB(Ai) v AB(Ol).
[0091] The Bayes' rule update checks are performed within the ATMS which respects the time assumptions. For the purpose of generating candidates these time assumptions are invisible to GDE. The Bayes' rule checks are performed within the ATMS which respects the time assumptions. In this way, the same efficient GDE/Sherlock mechanisms can be exploited for diagnosing intermittent faults. Hence, the ATMS representation of a conflict specifies that at least one of its /40-literals indicates a component which is producing a faulty output at time instant /. Likewise every inference includes a time assumption. GDE represents prime implicates of the form: (t ≠ i) V AB(Ci) V - • • V /tB{eΛ) V x = v<
as:
(i = t\ UJi = i), -ΑB(ei), . . . » -τΛB(cn)}}>.
[0092] The update tests required by both Bayes' rule update and the checks to decide whether G(c) should be incremented can often be significantly sped up. Consider the case where there is only one prime implicate of this form (except for the time assumption) for x = v. Given that there is no smaller set of -lA&s which support x = v; and that an AB component necessarily behaves incorrectly under its 1 — g case, it must be the situation that if x = i/ then any AB changed from negative to positive will result in x v. In our example, there is exactly one prime implicate which supports C0 = 0: it = 0) v_4B(i41) v AB(AT) v .4B(Ol) v C0 = 0.
[0093] Therefore, the single faults [01], [AI] and [/42] cannot explain the observation. When there is more than one such prime implicate, we must intersect all the antecedent AB sets. This is similar to a Clark-completion inference. Consider again the example circuit 70 of Figure 5 where z = 0 is observed. GDE finds two prime implicates which support z = 0 (written here as implications):
(t = Q) Λ -ιAB(A) Λ -TAB(C) → z = 0
(t = 0) Λ ^AB(B) Λ ^AB(C) -> z = 0
[0094] Under the single fault assumption, the intersection of antecedents of the implications must be behaving correctly. Therefore, G(C) is incremented.
7 Exploring Diagnostic cost
[0095] Consider the very simple two buffer circuit sequence 90 of Figure 9. Assume A and B are equally likely to fault and g = .9; As it is assumed the input is given, and that a faulty observation was observed, there is only one measurement point, the output of A, that can provide any useful information. [0096] In this simple case there is no choice of measurement point. First, consider the case where B is intermittent. The output of A will always be correct. Following the same line of development as Section 4 A simplistic probing strategy for single faults, after n measurements: To achieve the probability of misdiagnosis to be less than e:
1
Solving: log( 1 — e) — lo| log «?
As e ~ 1 : loge logs
With g = 0.9 and e = 0.01 , n = 43.7. Now consider the case where A is faulted. Until the fault is manifest, A will be producing a good value and the results will look the same as the case where B is faulted. Roughly, the fault will be manifest in,
1
= 4,5, "I — 'Ig samples. Thus, the expected cost of diagnosing circuit 90 of Figure 9 is roughly 24.1. Our implementation obtains a similar result. Again it is seen that it is considerably more expensive to isolate intermittents further from the input(s).
8 Probing strategy
[0097] To successfully isolate the faulty intermittent component, the diagnoser must propose measurements. In conventional model-based diagnosis, (following the myopic strategy, as is known in the art) the best probe to make next is the one which minimizes,
H = - Σ p(V} log piJ}}. (4)
T>£DIΛGNQSBS
[0098] Even though the definition of p(c) is different than in conventional model-based diagnosis, we still want to probe those variables which maximize the likelihood of distinguishing among the possible intermittent faults. The goal is the same — to lower the posterior probability of failure for all but one of the components. A myopic minimum entropy approach will guide probing to this goal.
[0099] Figure 10 illustrates a graph 100 which plots diagnostic cost (obtained from one simulation) of isolating a single intermittent buffer in a sequence of buffers using the minimum entropy strategy. The graph illustrates that, as is the case with persistent faults, groups of components are simultaneously exonerated so DC grows roughly logarithmic with circuit size. Note that the minimum entropy strategy is much better than the simple approach of Section 4. [00100] If inputs vary, a slightly different approach is needed. For all but the simplest devices, it is computationally impractical to find the variable to measure next which provides the maximum expected information gain over all possible system inputs. Instead, a heuristic may be employed to ensure measuring the best variable which is directly adjacent to some remaining suspect component. This avoids the possible suboptimal situation where the particular new inputs make the proposed measurement insensitive to any of the faults (e.g., suppose the symptomatic value propagated through a known good And-gate, and if the other input of this And-gate were 1 when the fault was first observed and was 0 in later inputs, measuring the output of that And-gate would be useless for those input vectors, and it would take many samples to isolate the faulty component),
9 Benchmarks
[00101] This section presents results of an implementation of the present concepts as applied to some standard benchmarks in model-based diagnosis. These examples are drawn from a widely available combinatorial logic test suite from iSCAS-85.
[00102] Turning to Table 1 10 of Figure 1 1 , the first column (circuit) is the ISCAS-85 circuit name. The second column (components) is the number of distinct gates of the circuit. The remaining two columns are computed by hypothesizing each faulty component with a 1 switched to 0, and a 0 switched to a 1. For each fault, a single (constant) test-vector is found which gives rise to an observable symptom for one of the outputs. For each row, for each component, two simulations are performed. The next column (non /-cost) is the diagnostic cost to isolate the faulty non-intermittent component. The next column (/ cost) is the diagnostic cost to isolate the intermittent component with misdiagnosis probability p < 1 — e, with e = 0.1 , g=0.9 and all components equally likely a priori. The final column (error) is the actual error rate observed in the simulation. We expect the actual error rate to approximate e = 0.1.
[00103] A review of Figure 1 1 , which lists the average cost of diagnosing circuits in the ISCAS-85 benchmark show, not surprisingly, the number of samples needed to isolate intermittent faults is far greater than the cost of identifying non-intermittent faults.
[00104] The generalization to multiple intermittent faults is a direct generalization of the e-policy of Section 3 Extensions to the conventional framework to support intermittent faults. Let M(Cp,CnS,r,t) be the predicate where S c Cp c COMPS which holds if r holds at time t if all components in S are outputting an incorrect value. More formally,
M(Cp, Cn,S, r, t) =
[Cn], OBS, S D, f\ o(c), /\ o(e) h T(r,i). c€S c€Cp- S
We can now define the observation function:
Figure imgf000022_0001
=
∑ U(1 ^(C)) π 9(c).
ScCpAM(CF,CnS,r,i) c€S c€CT -S r can represent any observation, but in all the examples in this discussion r is Xj=vk. The computation required is exponential in the number of faults, not the number of components, so PΛXI = vkP(cp , cny) can De evaluated efficiently. These can be evaluated directly with the ATMS/HTMS framework. 10 Extension to multiple persistent or intermittent faults [00105] A persistent fault in component c can be modeled as an intermittent fault with g(c)=0.0 As the probing strategy is the same for non-intermittent and intermittent faults, no change is needed to the algorithms (except g(c) is more difficult to learn) to diagnose multiple simultaneous faults of both types. Modeling persistent faults as g(c)=0.0 has two drawbacks: (1 ) it is a "strong" fault model (i.e., the component always behaves incorrectly if it is faulted), and (2) it cannot distinguish between intermittent and non-intermittent faults in a single component. Therefore we introduce fault modes as is known in the literature. An inverter might fail with its output stuck at 1 , or its output stuck at 0, or it might intermittently produce the wrong output:
INVERTER(x) → 0K{x) V SA0(x) V SAl(x) V U(x),
where OK (x) corresponds to the earlier -i AB(x),
SAQ(x) → [out(x) = 0], 0K(x) → [in(x, t) = 0 = σut(x,t) = 1], ^ SM {χ) → [øut{χ) = ^
and U(x) is treated as intermittent fault (just as AB(x) was earlier). The definition of diagnosis is generalized to be an assignment of modes to each component. For example, the diagnosis [01 ] for the Full-Adder is represented as [U(OI )] (and G modes are not included). The approach of this application can be used to distinguish between an inverter stuck-at-1 and inverter being faulted, e.g., [SAI (X)] VS. [U(X)]. 11 Related Work
[00106] There has been relatively little work in the model-based diagnosis community on intermittent faults. The approach presented here exploits the notion of exoneration — ruling out components as failing when observed to be behaving correctly. Previous work along this line are (1 ) the alibi notion, (2) corroboration, (3) converting simple single fault intermittent diagnosis tasks of combinational logic to dynamic programming. Another approach presents an intermittent diagnosis approach applicable to Discrete Event Systems. [00107] The preceding discussion focused on providing an improved system and method of diagnosing faults of a system under test. As previously mentioned, such diagnostic testing can be implemented in a wide range of areas. For example, as shown in Figure 12, concepts disclosed in the present application may be embodied in a diagnostic device or system 130, including a body 132 and probes 134. Probes 134 are designed to be positioned in operative association with a device under test 136. Body may include an input 138 and an output 140. The input 138 can include an alphanumeric keypad, stylus, voice, or other input design or interface known to input data or instructions. Output 140 may be any sort of display to display the results of a diagnostic investigation. Body 132 may also include a secondary set of inputs 142, wherein information detected by probes 184 are automatically input into diagnostic device 130. [00108] It is to be understood that body 132 includes computational capabilities including at least a processor 144 and memory 146, which permits the processing of software code, including code incorporating the concepts described herein. Still further, diagnostic device or system 130 may include output 148, for connection to an output device 150 to permit the printing of hardcopies of output reports regarding the results of the diagnostic investigation. [00109] It is to be appreciated the above description may be implemented on customized diagnostic devices, and/or may be included as part of hand-held computers, laptops, desktops or other computing devices, including personal digital assistants. Still further, the diagnostic device or system 130 is intended only as examples of how the concepts of the present application may be implemented.
[00110] In another embodiment, Figure 12 may not include probes 134, but rather the diagnostics may be undertaken on computer software operating on the diagnostic device, or associated with another device having computational capabilities.
[00111] In another embodiment illustrated in Figure 13, the diagnostic device or system 160 is itself embedded as part of a larger overall system 162 which may, for example, include components 164-170 shown in operative connection with each other via solid lines 172. The diagnostic device or system 160, in turn is shown to be in operative connection with the components 164-170 via dotted lines 174. It is to be appreciated Figure 13 is a high level example of a system in which a diagnostic device or system according to the present application may be used. The purpose of the diagnostic device or system 160 is to identify faults in the overall system 162 and initiate repairs without any human intervention. Examples of such overall systems would be reprographic equipment, automobiles, spacecraft, and airplanes, among others.
[00112] It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims. CLAIMS:

Claims

1. A method for diagnosing any combination of persistent and intermittent faults or intermittent faults in a system, comprising: obtaining a behavior of a system by measuring or probing the system at particular points; investigating a predicted behavior of a modeled system of the system by drawing inferences based on at least conditional probabilities, prior observations and component models of the modeled system; comparing the predictions to their corresponding points in the system; determining if conflicts or deviations exist between the measured behavior and the predicted behavior; adjusting the conditional probabilities to more and more accurately reflect action faults in the system, so they will accept a possibility that a subsequent measurement and/or probing will produce a different value or output; and using the conflicts or deviations between the predicted behavior and the actual behavior to obtain a final result based on a new value or output.
2. The method of claim 1 further including, generating an error output when the final result is not reached within a predetermined number of measurements and/or probes.
3. The method according to claim 1 wherein the final result includes isolating the component causing the fault or faults.
4. The method according to claim 1 further including a next stage of measuring and/or probing, which includes determining a next location and/or point to measure and/or probe.
5. The method according to claim 1 wherein the system model, models the dynamic behavior of the system.
6. The method according to claim 5 wherein the dynamic behavior of the system is viewed as a sequence of non-temporal events.
7. The method according to claim 1 wherein intermittent failures arise from a stochastic physical process which is intermittent at any level of abstraction of the system model.
8. A method of troubleshooting a real world system having multiple components, and containing any combination of intermittent or non-intermittent faults, the method comprising: generating a model of the real world system, the system model including multiple model components; associating with each model component of the system model, two probabilities, (1 ) a probability the real world component deviates from its design such that it may exhibit a malfunction, and (2) (for potentially intermittent components) a conditional probability the faulted component malfunctions when observed; recalculating the conditional probabilities associated with possible fault causes, to guide a next stage of observation and/or measurement; undertaking a next stage of observation and/or measurement; determining when enough data has been obtained; and reaching a final result.
9. The method of claim 8 further including, generating an error output when an acceptable final result is not reached within a predetermined number of measurements and/or observations.
10. The method according to claim 8 wherein reaching the final result includes isolating the component causing the fault or faults.
11. The method according to claim 1 wherein the step of undertaking the next stage of observation and/or measurement includes determining a next location to observe and/or take a measurement.
12. The method according to claim 8 wherein the system model, models the dynamic behavior of the system.
13. The method according to claim 12 wherein the dynamic behavior of the system is viewed as a sequence of non-temporal events.
14. The method according to claim 8 wherein intermittent failures arise from a stochastic physical process which is intermittent at any level of abstraction of the system model.
15. A computer program product for use with a computing device, the computer product comprising: a computer useable medium having computer readable program code embodied therein for, diagnosing any combination of persistent and intermittent faults or intermittent faults in a system by, obtaining a behavior of a system by measuring or probing the system at particular points; investigating a predicted behavior of a modeled system of the system by drawing inferences based on at least conditional probabilities, prior observations and component models; comparing the predictions to their corresponding points in the system; determining if conflicts or deviations exist between the measured behavior and the predicted behavior; adjusting the conditional probabilities to more and more accurately reflect action faults in the system, so they will accept a possibility that a subsequent measurement will produce a different value; and using the conflicts or deviations between the predicted behavior and the actual behavior to obtain a final result based on a new value or output.
16. The computer program product of claim 15 further including, generating an error output when an acceptable final result is not reached within a predetermined number of measurements and/or probes.
17. The computer program product according to claim 15 wherein obtaining the final result includes isolating the component causing the fault or faults
18. The computer program product according to claim 15 wherein the step of undertaking a next stage of observation and/or measurement includes determining a next location to observe and/or take a measurement.
19. The computer program product according to claim 15 wherein the system model, models the dynamic behavior of the system.
20. The computer program product according to claim 19 wherein the dynamic behavior of the system is viewed as a sequence of non-temporal events.
21. The computer program product according to claim 15 wherein intermittent failures arise from a stochastic physical process which is intermittent at any level of abstraction of the system model.
PCT/US2007/085527 2007-05-24 2007-11-26 Diagnosing intermittent faults WO2008143701A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2010509324A JP5140722B2 (en) 2007-05-24 2007-11-26 Method and program for determining intermittent failures and method for troubleshooting real world systems
EP07864789A EP2153325B1 (en) 2007-05-24 2007-11-26 Diagnosing intermittent faults

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US93152407P 2007-05-24 2007-05-24
US93152607P 2007-05-24 2007-05-24
US60/931,526 2007-05-24
US60/931,524 2007-05-24
US11/940,493 US8024610B2 (en) 2007-05-24 2007-11-15 Diagnosing intermittent faults
US11/940,493 2007-11-15

Publications (1)

Publication Number Publication Date
WO2008143701A1 true WO2008143701A1 (en) 2008-11-27

Family

ID=40032205

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/085527 WO2008143701A1 (en) 2007-05-24 2007-11-26 Diagnosing intermittent faults

Country Status (4)

Country Link
US (1) US8024610B2 (en)
EP (1) EP2153325B1 (en)
JP (1) JP5140722B2 (en)
WO (1) WO2008143701A1 (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9563525B2 (en) * 2009-06-02 2017-02-07 Palo Alto Research Center Incorporated Minimum cardinality candidate generation
US8386849B2 (en) * 2010-01-29 2013-02-26 Honeywell International Inc. Noisy monitor detection and intermittent fault isolation
AT511577B1 (en) * 2011-05-31 2015-05-15 Avl List Gmbh MACHINE IMPLEMENTED METHOD FOR OBTAINING DATA FROM A NON-LINEAR DYNAMIC ESTATE SYSTEM DURING A TEST RUN
DE102011079034A1 (en) 2011-07-12 2013-01-17 Siemens Aktiengesellschaft Control of a technical system
US8689055B2 (en) 2011-07-28 2014-04-01 International Business Machines Corporation Detecting device impairment through statistical monitoring
US8732524B2 (en) * 2011-08-03 2014-05-20 Honeywell International Inc. Systems and methods for using a corrective action as diagnostic evidence
US10203231B2 (en) * 2014-07-23 2019-02-12 Hach Company Sonde
CN104614989B (en) * 2014-12-26 2017-04-05 北京控制工程研究所 Improve the spacecraft attitude sensory perceptual system Optimal Configuration Method of fault diagnosability
CN105117331B (en) * 2015-08-17 2018-04-13 浪潮(北京)电子信息产业有限公司 Coincidence correctness test case recognition methods and device towards location of mistake
CN105204499B (en) * 2015-10-09 2018-01-02 南京航空航天大学 Helicopter collaboration formation method for diagnosing faults based on Unknown Input Observer
US10621061B2 (en) * 2016-11-28 2020-04-14 B. G. Negev Technologies amd Applications Ltd. at Ben-Gurion University Combined model-based approach and data driven prediction for troubleshooting faults in physical systems
US10735271B2 (en) * 2017-12-01 2020-08-04 Cisco Technology, Inc. Automated and adaptive generation of test stimuli for a network or system
US11321504B2 (en) * 2018-05-09 2022-05-03 Palo Alto Research Center Incorporated Learning constitutive equations of physical components with constraints discovery
CN109032107B (en) * 2018-06-05 2021-07-20 国家电网公司 Equipment fault signal frequency prediction method based on Bayesian classification
CN110865630B (en) * 2019-11-14 2022-07-05 深圳供电局有限公司 Acceptance method and system for built-in program of intelligent substation
US11815560B2 (en) * 2020-02-12 2023-11-14 Dit-Mco International Llc Methods and systems for wire harness test results analysis
CN113688041B (en) * 2021-08-24 2023-01-31 紫光展锐(重庆)科技有限公司 Pressure testing method, system, storage medium and terminal
US20230104347A1 (en) * 2021-09-24 2023-04-06 Palo Alto Research Center Incorporated Methods and systems for fault diagnosis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5293323A (en) * 1991-10-24 1994-03-08 General Electric Company Method for fault diagnosis by assessment of confidence measure
US6230109B1 (en) * 1995-05-16 2001-05-08 The United States Of America As Represented By The Secretary Of The Navy Multiconductor continuity and intermittent fault analyzer with distributed processing and dynamic stimulation
US20020193920A1 (en) * 2001-03-30 2002-12-19 Miller Robert H. Method and system for detecting a failure or performance degradation in a dynamic system such as a flight vehicle
US6751536B1 (en) * 2002-12-04 2004-06-15 The Boeing Company Diagnostic system and method for enabling multistage decision optimization for aircraft preflight dispatch
US20050251364A1 (en) * 2004-05-06 2005-11-10 Pengju Kang Sensor fault diagnostics and prognostics using component model and time scale orthogonal expansions

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4766595A (en) * 1986-11-26 1988-08-23 Allied-Signal Inc. Fault diagnostic system incorporating behavior models
JPH0721029A (en) * 1993-07-05 1995-01-24 Komatsu Ltd Inference device
JPH07311754A (en) * 1994-05-19 1995-11-28 Ricoh Co Ltd Learning machine
JPH08221378A (en) * 1995-02-10 1996-08-30 Ricoh Co Ltd Learning machine
JPH08305853A (en) * 1995-04-28 1996-11-22 Mitsubishi Electric Corp Method and device for object recognition and decision making based upon recognition
US6182258B1 (en) * 1997-06-03 2001-01-30 Verisity Ltd. Method and apparatus for test generation during circuit design
US6167352A (en) * 1997-06-26 2000-12-26 Agilent Technologies, Inc. Model-based diagnostic system with automated procedures for next test selection
US6947797B2 (en) * 1999-04-02 2005-09-20 General Electric Company Method and system for diagnosing machine malfunctions
US6675138B1 (en) * 1999-06-08 2004-01-06 Verisity Ltd. System and method for measuring temporal coverage detection
US6484135B1 (en) * 1999-08-30 2002-11-19 Hewlett-Packard Company Method for adaptive test generation via feedback from dynamic emulation
US6684359B2 (en) * 2000-11-03 2004-01-27 Verisity Ltd. System and method for test generation with dynamic constraints using static analysis
GB0307406D0 (en) * 2003-03-31 2003-05-07 British Telecomm Data analysis system and method
US7665067B2 (en) * 2003-09-15 2010-02-16 Cadence Design (Israel) Ii Ltd. Method and system for automatically creating tests
US7356443B2 (en) * 2003-12-17 2008-04-08 Agilent Technologies, Inc. Systems and methods for analyzing the selection of measurements of a communication network
JP2005309616A (en) * 2004-04-19 2005-11-04 Mitsubishi Electric Corp Facility equipment failure diagnosis system and failure diagnostic rule creation method
JP2005309077A (en) * 2004-04-21 2005-11-04 Fuji Xerox Co Ltd Fault diagnostic method, fault diagnostic system, transporting device, and image forming apparatus, and program and storage medium
US7519564B2 (en) * 2004-11-16 2009-04-14 Microsoft Corporation Building and using predictive models of current and future surprises

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5293323A (en) * 1991-10-24 1994-03-08 General Electric Company Method for fault diagnosis by assessment of confidence measure
US6230109B1 (en) * 1995-05-16 2001-05-08 The United States Of America As Represented By The Secretary Of The Navy Multiconductor continuity and intermittent fault analyzer with distributed processing and dynamic stimulation
US20020193920A1 (en) * 2001-03-30 2002-12-19 Miller Robert H. Method and system for detecting a failure or performance degradation in a dynamic system such as a flight vehicle
US6751536B1 (en) * 2002-12-04 2004-06-15 The Boeing Company Diagnostic system and method for enabling multistage decision optimization for aircraft preflight dispatch
US20050251364A1 (en) * 2004-05-06 2005-11-10 Pengju Kang Sensor fault diagnostics and prognostics using component model and time scale orthogonal expansions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2153325A4 *

Also Published As

Publication number Publication date
EP2153325A4 (en) 2011-05-18
JP5140722B2 (en) 2013-02-13
EP2153325B1 (en) 2013-01-23
US8024610B2 (en) 2011-09-20
JP2010528364A (en) 2010-08-19
EP2153325A1 (en) 2010-02-17
US20080294578A1 (en) 2008-11-27

Similar Documents

Publication Publication Date Title
EP2153325B1 (en) Diagnosing intermittent faults
De Kleer Diagnosing multiple persistent and intermittent faults
US5138694A (en) Parallel processing qualitative reasoning system
KR940001445B1 (en) Method &amp; apparatus for diagnosing fault
EP0474944B1 (en) Method and apparatus for diagnostic testing
Stern et al. How many diagnoses do we need?
US5522014A (en) Intergrated qualitative/quantitative reasoning with enhanced core predictions and extended test procedures for machine failure isolation using qualitative physics
De Kleer Diagnosing intermittent faults
US5537644A (en) Machine failure isolation in multiple machine configurations using qualitative physics
US6370659B1 (en) Method for automatically isolating hardware module faults
EP2153240B1 (en) Troubleshooting temporal behavior in &#34;combinational&#34; circuits
EP0871126A2 (en) Machine failure isolation using qualitative physics
EP2153350B1 (en) Dynamic domain abstraction through meta-analysis
US5353381A (en) Intelligent test selection for machine failure isolation using qualitative physics
Ye et al. Knowledge-driven board-level functional fault diagnosis
US5202955A (en) Dynamic assumption ordering for qualitative physics
EP0491037B1 (en) Machine failure isolation using qualitative physics
Diedrich On Diagnosing Cyber-Physical Systems
Smith et al. Fault identification through the combination of symbolic conflict recognition and Markov chain-aided belief revision
Abreu et al. Techniques for diagnosing software faults
Mittelstadt et al. Application of a Bayesian network to integrated circuit tester diagnosis
Ye Knowledge-Driven Board-Level Functional Fault Diagnosis.
Freitag ZFE IS INF 22, SIEMENS AG
von ALEXANDER On Diagnosing Cyber-Physical Systems
In et al. Towards Problem Solving Methods for Sequential Diagnosis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07864789

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2010509324

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2007864789

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE