Troubleshooting is a form of problem solving most often applied to repair of failed products or processes. It is a logical, systematic search for the source of a problem so that it can be solved, and so the product or process can be made operational again. Troubleshooting is needed to develop and maintain complex systems where the symptoms of a problem can have many possible causes. Troubleshooting is used in many fields such as engineering, system administration, electronics, automotive repair, and diagnostic medicine. Troubleshooting requires identification of the malfunction(s) or symptoms within a system. Then, experience is commonly used to generate possible causes of the symptoms. Determining which cause is most likely is often a process of elimination – eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.
In general, troubleshooting is the identification of, or diagnosis of “trouble” in a [system] caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining the causes of these symptoms.
A system can be described in terms of its expected, desired or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example selecting the “print” option from various computer applications is intended to result in a hardcopy emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example).
The methods of forensic engineering are especially useful in tracing problems in products or processes, and a wide range of analytical techniques are available to determine the cause or causes of specific failures. Corrective action can then be taken to prevent further failures of a similar kind. Preventative action is possible using FMEA and FTA before full scale production, and these methods can also be used for failure analysis.
Most discussion of troubleshooting, and especially training in formal troubleshooting procedures, tends to be domain specific, even though the basic principles are universally applicable.
Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example a printer that “was working when it was plugged in over there”). However, there is a well known principle that correlation does not imply causality. (For example the failure of a device shortly after it’s been plugged into a different outlet doesn’t necessarily mean that the events were related. The failure could have been a matter of coincidence.) Therefore troubleshooting demands critical thinking rather than magical thinking.
It’s useful to consider the common experiences we have with light bulbs. Light bulbs “burn out” more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.
A basic principle in troubleshooting is to start from the simplest and most probable possible problems first. This is illustrated by the old saying “When you see hoof prints, look for horses, not zebras”, or to use another maxim, use the KISS principle. This principle results in the common complaint about help desks or manuals, that they sometimes first ask: “Is it plugged in and does that receptacle have power?”, but this should not be taken as an affront, rather it should serve as a reminder or conditioning to always check the simple things first before calling for help.
A troubleshooter could check each component in a system one by one, substituting known good components for each potentially suspect one. However, this process of “serial substitution” can be considered degenerate when components are substituted without regards to a hypothesis concerning how their failure could result in the symptoms being diagnosed.
Efficient methodical troubleshooting starts with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist of) tests to eliminate these prospective causes. Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, checking to ensure that a printer’s light is on and that its cable is firmly seated at both ends), and to “bisect” the system (for example in a network printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems “towards” the user’s end or “towards” the device).
This latter technique can be particularly efficient in systems with long chains of serialized dependencies or interactions among its components. It’s simply the application of a binary searchacross the range of dependences.
Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (feedback loops). Such systems are less amenable to “bisection” troubleshooting techniques.
It also helps to start from a known good state, the best example being a computer reboot. A cognitive walkthrough is also a good thing to try. Comprehensive documentation produced by proficient technical writers is very helpful, especially if it provides a theory of operation for the subject device or system.
A common cause of problems is bad design, for example bad human factors design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (behavior-shaping constraint), or a lack of error-tolerant design. This is especially bad if accompanied by habituation, where the user just doesn’t notice the incorrect usage, for instance if two parts have different functions but share a common case so that it isn’t apparent on a casual inspection which part is being used.
Troubleshooting can also take the form of a systematic checklist, troubleshooting procedure, flowchart or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.
One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility … on finding a procedure to reliably induce the symptom to occur.
Once this is done then systematic strategies can be employed to isolate the cause or causes of a problem; and the resolution generally involves repairing or replacing those components which are at fault.
Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem.
In computer programming race conditions often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (analogous to “heating up” a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes.
Intermittent issues can be thus defined:
An intermittent fault is a one which occurs irregularly or inconsistently.
In particular he asserts that there is a distinction between frequency of occurrence and a “known procedure to consistently reproduce” an issue. For example knowing that an intermittent problem occurs “within” an hour of a particular stimulus or event … but that sometimes it happens in five minutes and other times it takes almost an hour … does not constitute a “known procedure” even if the stimulus does increase the frequency of observable exhibitions of the symptom.
Nevertheless, sometimes troubleshooters must resort to statistical methods … and can only find procedures to increase the symptom’s occurrence to a point at which serial substitution or some other technique is feasible. In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the root cause has been found and that the problem is truly solved.
Also, tests may be run to stress certain components to determine if those components have failed.
Isolating single component failures which cause reproducible symptoms is relatively straightforward.
However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. Features which add redundancy, fault detection and failover to a system may also be subject to failure, and enough different component failures in any system will “take it down.”
Even in simple systems the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).
Note that, while we talk about “replacing components” the resolution of many problems involves adjustments or tuning rather than “replacement.” For example, intermittent breaks in conductors — or “dirty or loose contacts” might simply need to be cleaned and/or tightened. All discussion of “replacement” should be taken to mean “replacement or adjustment or other maintenance.”
Root cause analysis
Root cause analysis (RCA) is a class of problem solving methods aimed at identifying the root causes of problems or events. The practice of RCA is predicated on the belief that problems are best solved by attempting to correct or eliminate root causes, as opposed to merely addressing the immediately obvious symptoms. By directing corrective measures at root causes, it is hoped that the likelihood of problem recurrence will be minimized. However, it is recognized that complete prevention of recurrence by a single intervention is not always possible. Thus, RCA is often considered to be an iterative process, and is frequently viewed as a tool of continuous improvement.
RCA, initially is a reactive method of problem detection and solving. This means that the analysis is done after an event has occurred. By gaining expertise in RCA it becomes a pro-active method. This means that RCA is able to forecast the possibility of an event even before it could occur.
Root cause analysis is not a single, sharply defined methodology; there are many different tools, processes, and philosophies of RCA in existence. However, most of these can be classed into five, very-broadly defined “schools” that are named here by their basic fields of origin: safety-based, production-based, process-based, failure-based, and systems-based.
- Safety-based RCA descends from the fields of accident analysis and occupational safety and health.
- Production-based RCA has its origins in the field of quality control for industrial manufacturing.
- Process-based RCA is basically a follow-on to production-based RCA, but with a scope that has been expanded to include business processes.
- Failure-based RCA is rooted in the practice of failure analysis as employed in engineering and maintenance.
- Systems-based RCA has emerged as an amalgamation of the preceding schools, along with ideas taken from fields such as change management, risk management, and systems analysis.
Despite the seeming disparity in purpose and definition among the various schools of root cause analysis, there are some general principles that could be considered as universal. Similarly, it is possible to define a general process for performing RCA.
General principles of root cause analysis
- Aiming performance improvement measures at root causes is more effective than merely treating the symptoms of a problem.
- To be effective, RCA must be performed systematically, with conclusions and causes backed up by documented evidence.
- There is usually more than one potential root cause for any given problem.
- To be effective the analysis must establish all known causal relationships between the root cause(s) and the defined problem.
- Root cause analysis transforms an old culture that reacts to problems to a new culture that solves problems before they escalate, creating a variability reduction and risk avoidance mindset.
General process for performing and documenting an RCA-based Corrective Action
Notice that RCA (in steps 3, 4 and 5) forms the most critical part of successful corrective action, because it directs the corrective action at the root of the problem. That is to say, it is effective solutions we seek, not root causes. Root causes are secondary to the goal of prevention, and are only revealed after we decide which solutions to implement.
- Define the problem.
- Gather data/evidence.
- Ask why and identify the causal relationships associated with the defined problem.
- Identify which causes if removed or changed will prevent recurrence.
- Identify effective solutions that prevent recurrence, are within your control, meet your goals and objectives and do not cause other problems.
- Implement the recommendations.
- Observe the recommended solutions to ensure effectiveness.
- Variability Reduction methodology for problem solving and problem avoidance.
Root cause analysis techniques
- Barrier analysis – a technique often used in particularly in process industries. It is based on tracing energy flows, with a focus on barriers to those flows, to identify how and why the barriers did not prevent the energy flows from causing harm.
- Bayesian inference
- Causal factor tree analysis – a technique based on displaying causal factors in a tree-structure such that cause-effect dependencies are clearly identified.
- Change analysis – an investigation technique often used for problems or accidents. It is based on comparing a situation that does not exhibit the problem to one that does, in order to identify the changes or differences that might explain why the problem occurred.
- Current Reality Tree A method developed by Eliahu M. Goldratt in his Theory of Constraints that guides an investigator to identify and relate all root causes using a cause-effect tree whose elements are bound by rules of logic (Categories of Legitimate Reservation). The CRT begins with a brief list of the undesirables things we see around us, and then guides us towards one or more root causes. This method is particularly powerful when the system is complex, there is no obvious link between the observed undesirable things, and a deep understanding of the root cause(s) is desired.
- Failure mode and effects analysis Also known as FMEA.
- Fault tree analysis
- 5 Whys
- Ishikawa diagram, also known as the fishbone diagram or cause and effect diagram
- Kepner-Tregoe Problem Analysis – a root cause analysis process developed in 1958, which provides a fact-based approach to systematically rule out possible causes and identify the true cause
- Pareto analysis
- RPR Problem Diagnosis – An ITIL-aligned method for diagnosing IT problems.
Common cause analysis (CCA) common modes analysis (CMA) are evolving engineering techniques for complex technical systems to determine if common root causes in hardware, software or highly integrated systems interaction may contribute to human error or improper operation of a system. Systems are analyzed for root causes and causal factors to determine probability of failure modes, fault modes, or common mode software faults due to escaped requirements. Also ensuring complete testing and verification are methods used for ensuring complex systems are designed with no common causes that cause severe hazards. Common cause analysis are sometimes required as part of the safety engineering tasks for theme parks, commercial/military aircraft, spacecraft, complex control systems, large electrical utility grids, nuclear power plants, automated industrial controls, medical devices or other safety safety-critical systems with complex functionality.
Basic elements of root cause
- Defective raw material
- Wrong type for job
- Lack of raw material
- Machine / Equipment
- Incorrect tool selection
- Poor maintenance or design
- Poor equipment or tool placement
- Defective equipment or tool
- Orderly workplace
- Job design or layout of work
- Surfaces poorly maintained
- Physical demands of the task
- Forces of nature
- No or poor management involvement
- Inattention to task
- Task hazards not guarded properly
- Other (horseplay, inattention….)
- Stress demands
- Lack of Process
- No or poor procedures
- Practices are not the same as written procedures
- Poor communication
- Management system
- Training or education lacking
- Poor employee involvement
- Poor recognition of hazard
- Previously identified hazards were not eliminated
- 4ME (Man, Machine, Materials, Method and Environment).
The 5 Whys is a question-asking method used to explore the cause/effect relationships underlying a particular problem. Ultimately, the goal of applying the 5 Whys method is to determine a root cause of a defect or problem.
The following example demonstrates the basic process:
- My car will not start. (the problem)
- Why? – The battery is dead. (first why)
- Why? – The alternator is not functioning. (second why)
- Why? – The alternator belt has broken. (third why)
- Why? – The alternator belt was well beyond its useful service life and has never been replaced. (fourth why)
- Why? – I have not been maintaining my car according to the recommended service schedule. (fifth why, root cause)
The questioning for this example could be taken further to a sixth, seventh, or even greater level. This would be legitimate, as the “five” in 5 Whys is not gospel; rather, it is postulated that five iterations of asking why is generally sufficient to get to a root cause. The real key is to encourage the troubleshooter to avoid assumptions and logic traps and instead to trace the chain of causality in direct increments from the effect through any layers of abstraction to a root cause that still has some connection to the original problem.
The technique was originally developed by Sakichi Toyoda and was later used within Toyota Motor Corporation during the evolution of their manufacturing methodologies. It is a critical component of problem solving training delivered as part of the induction into the Toyota Production System. The architect of the Toyota Production System, Taiichi Ohno, described the 5 whys method as “… the basis of Toyota’s scientific approach … by repeating why five times, the nature of the problem as well as its solution becomes clear.” The tool has seen widespread use beyond Toyota, and is now used within Kaizen, lean manufacturing, and Six Sigma.
While the 5 Whys is a powerful tool for engineers or technically savvy individuals to help get to the true causes of problems, it has been criticized by Teruyuki Minoura, former managing director of global purchasing for Toyota, as being too basic a tool to analyze root causes to the depth that is needed to ensure that the causes are fixed. Reasons for this criticism include:
- Tendency for investigators to stop at symptoms rather than going on to lower level root causes.
- Inability to go beyond the investigator’s current knowledge – can’t find causes that they don’t already know
- Lack of support to help the investigator to ask the right “why” questions.
- Results aren’t repeatable – different people using 5 Whys come up with different causes for the same problem.
These can be significant problems when the method is applied through deduction only. On-the-spot verification of the answer to the current “why” question, before proceeding to the next, is recommended as a good practice to avoid these issues.
Trial and error
Trial and error, or trial by error, is a general method of problem solving, fixing things, or for obtaining knowledge. “Learning doesn’t happen from failure itself but rather from analyzing the failure, making a change, and then trying again.”
In the field of computer science, the method is called generate and test. In elementary algebra, when solving equations, it is “guess and check”.
Bricolage – In trial and error, one selects a possible answer, applies it to the problem and, if it is not successful, selects (or generates) another possibility that is subsequently tried. The process ends when a possibility yields a solution.
In some versions of trial and error, the option that is a priori viewed as the most likely one should be tried first, followed by the next most likely, and so on until a solution is found, or all the options are exhausted. In other versions, options are simply tried at random.
This approach is more successful with simple problems and in games, and is often resorted to when no apparent rule applies. This does not mean that the approach need be careless, for an individual can be methodical in manipulating the variables in an attempt to sort through possibilities that may result in success. Nevertheless, this method is often used by people who have little knowledge in the problem area.
Ashby (1960, section 11/5) offers three simple strategies for dealing with the same basic exercise-problem; and they have very different efficiencies: Suppose there are 1000 on/off switches which have to be set to a particular combination by random-based testing, each test to take one second. [This is also discussed in Traill (1978/2006, section C1.2]. The strategies are:
- the perfectionist all-or-nothing method, with no attempt at holding partial successes. This would be expected to take more than 10^301 seconds, [i.e. 2^1000 seconds, or 3·5×(10^291) centuries!];
- a serial-test of switches, holding on to the partial successes (assuming that these are manifest) would take 500 seconds; while
- a parallel-but-individual testing of all switches simultaneously would take only one second.
Note the tacit assumption here that no intelligence or insight is brought to bear on the problem. However, the existence of different available strategies allows us to consider a separate (“superior”) domain of processing — a “meta-level” above the mechanics of switch handling — where the various available strategies can be randomly chosen. Once again this is “trial and error”, but of a different type. This leads us to:
Ashby’s book develops this “meta-level” idea, and extends it into a whole recursive sequence of levels, successively above each other in a systematic hierarchy. On this basis he argues that human intelligence emerges from such organization: relying heavily on trial-and-error (at least initially at each new stage), but emerging with what we would call “intelligence” at the end of it all. Thus presumably the topmost level of the hierarchy (at any stage) will still depend on simple trial-and-error.
Traill (1978/2006) suggests that this Ashby-hierarchy probably coincides with Piaget‘s well-known theory of developmental stages. [This work also discusses Ashby’s 1000-switch example; see §C1.2]. After all, it is part of Piagetian doctrine that children learn by first actively doing in a more-or-less random way, and then hopefully learn from the consequences — which all has a certain resemblance to Ashby’s random “trial-and-error”.
The basic strategy in many fields?
Four such systems are identified:
- Darwinian evolution which “educates” the DNA of the species!
- The brain of the individual (just discussed);
- The “brain” of society-as-such (including the publicly-held body of science); and
- The immune system.
An ambiguity: Can we have “intention” during a “trial”
In the Ashby-and-Cybernetics tradition, the word “trial” usually implies random-or-arbitrary, without any deliberate choice. However amongst non-cyberneticians, “trial” will often imply a deliberate subjective act by some adult human agent; (e.g. in a court-room, or laboratory). So that has sometimes led to confusion.
Of course the situation becomes even more confusing if one accepts Ashby’s hierarchical explanation of intelligence, and its implied ability to be deliberate and to creatively design — all based ultimately on non-deliberate actions! The lesson here seems to be that one must simply be careful to clarify the meaning of one’s own words, and indeed the words of others. [Incidentally it seems that consciousness is not an essential ingredient for intelligence as discussed above.
Trial and error has a number of features:
- solution-oriented: trial and error makes no attempt to discover why a solution works, merely that it is a solution.
- problem-specific: trial and error makes no attempt to generalise a solution to other problems.
- non-optimal: trial and error is generally an attempt to find a solution, not all solutions, and not the best solution.
- needs little knowledge: trials and error can proceed where there is little or no knowledge of the subject.
It is possible to use trial and error to find all solutions or the best solution, when a testably finite number of possible solutions exist. To find all solutions, one simply makes a note and continues, rather than ending the process, when a solution is found, until all solutions have been tried. To find the best solution, one finds all solutions by the method just described and then comparatively evaluates them based upon some predefined set of criteria, the existence of which is a condition for the possibility of finding a best solution. (Also, when only one solution can exist, as in assembling a jigsaw puzzle, then any solution found is the only solution and so is necessarily the best.)
Trial and error has traditionally been the main method of finding new drugs, such as antibiotics. Chemists simply try chemicals at random until they find one with the desired effect. In a more sophisticated version, chemists select a narrow range of chemicals it is thought may have some effect. (The latter case can be alternatively considered as a changing of the problem rather than of the solution strategy: instead of “What chemical will work well as an antibiotic?” the problem in the sophisticated approach is “Which, if any, of the chemicals in this narrow range will work well as an antibiotic?”) The method is used widely in many disciplines, such as polymer technology to find new polymer types or families.
The scientific method can be regarded as containing an element of trial and error in its formulation and testing of hypotheses. Also compare genetic algorithms, simulated annealing andreinforcement learning – all varieties for search which apply the basic idea of trial and error.
Biological evolution is also a form of trial and error. Random mutations and sexual genetic variations can be viewed as trials and poor reproductive fitness, or lack of improved fitness, as the error. Thus after a long time ‘knowledge’ of well-adapted genomes accumulates simply by virtue of them being able to reproduce.
Bogosort, a conceptual sorting algorithm (that is extremely inefficient and impractical), can be viewed as a trial and error approach to sorting a list. However, typical simple examples of bogosort do not track which orders of the list have been tried and may try the same order any number of times, which violates one of the basic principles of trial and error. Trial and error is actually more efficient and practical than bogosort; unlike bogosort, it is guaranteed to halt in finite time on a finite list, and might even be a reasonable way to sort extremely short lists under some conditions.
Issues with trial and error
Trial and error is usually a last resort for a particular problem, as there are a number of problems with it. For one, trial and error is tedious and monotonous. Also, it is very time-consuming; chemical engineers must sift through millions of various potential chemicals before they find one that works. Fortunately, computers are best suited for trial and error; they do not succumb to the boredom that humans do, and can potentially do thousands of trial-and-error segments in the blink of an eye.