By Kevin Stewart

One definition of Root Cause Analysis  is:
Root Cause Analysis is any structured process used to understand the causes of past events for the purpose of preventing recurrence.

describe the imageThis basic premise is the reason that the RCA is done.

On the surface, it always appears to be a simple matter, however there are always pitfalls and nuances.

One such pitfall that RCA investigators or facilitators face is something I call the “problem is fixed” syndrome. In my work at plants I would run across situations where a problem occurred and a solution was implemented. The particular solution used may or may not have been arrived at by using RCA. In either case the solution is implemented and the “problem is fixed.”

How is this statement validated as being true? Those involved will justify the solution by the simple fact that the problem hasn’t recurred, at least not in the immediate future, which unfortunately is sometimes the focus of plant management due to pressures, career goals or other reasons. On the surface this may seem to be difficult to argue – after all the problem is fixed – or is it?

In the cases I have been involved with, what has really happened is that the MTBF (Mean Time Between Failure) of the problem is actually a long time, say 5 years or greater. I was involved in two investigations where the incident hadn’t happened in the previous 5 years and most likely wouldn’t happen for another 5 years. Investigations had been performed and solutions were offered and implemented.

When asked about the effectiveness of the solutions the evidence given was that the incident hadn’t recurred so the solution must have been effective. On the surface this may appear to be difficult to argue back, since it is true that the problem hasn’t recurred. However by looking at the MTBF of the incident, you can point out that since the MTBF is long the effectiveness of the solution put in place will not be known until the problem recurs at some time in future. So at this particular time no solution, or any other proffered solution would be just as effective since the problem won’t recur anyway. You can easily see where if a facility is not careful they could be “fixing problems” with long MTBF’s claiming success and in reality not have actually provided effective solutions. This argument supports a thorough and complete RCA that is based on the cause and effect principle and are supported by evidence to insure an effective solution is implemented.

In one of the cases above the solution was to do more frequent maintenance to insure the problem was identified. While this would have worked for anything that had a MTBF longer than the frequency chosen it would not have worked for something that had a MTBF less than the frequency chosen. In addition to a solution that would not work in all cases it would have increased the cost of maintenance significantly. In this particular case a little more investigation and adding some additional causes to the chart identified that some external damage had been done and not reported, which caused the issue. If they could fix the unreported damage issue then an effective solution would be found that covered the situation that brought this incident on, it also would most likely fix other incidents that hadn’t even happened yet.

In this case you can see that the offered solution would have appeared to work just fine and since they did “something” everyone feels good about the work and “effective” solution.

The other incident was caused by someone who had recently returned to work after an extended leave. During an operating situation this employee correctly followed the incorrect procedure that was posted at the unit. The solution was to replace the incorrect posted procedure that was found to be incorrect at an operating unit. While replacing the procedure was necessary, they would not know if it is effective for quite a while. Again a little more investigation and a few more causes identified that there was no process to replace modified procedures around the plant. If this was fixed then an effective solution would be identified. You can see that here also the plant management would be thrilled because and investigation was done, something was put in place and the problem hasn’t happened again. I’m sure you can see that this situation very well could happen again either at this unit or other similar pieces of equipment.

Both of these examples also point out that a good RCA must be done using valid principles and evidence for the causes and you must not stop too soon! Stopping too soon is another common mistake in RCA – but that is another tip.

In the meantime be aware of incidents with long MTBF and offered solutions that are not based on good analysis or inappropriate causes.



What are your thoughts on conducting an RCA facilitation / Investigation and how much time have you spent preparing the analysis and implementing solutions?  Do you have a successful tip worth sharing or discussing? We look forward to reading your feedback and perspective via comments below or let’s connect on our LinkedIn Group – ARMS Reliability – Reliability & RCA for further discussion.


Comments are closed.

Post Navigation