6 Steps to Holistic RCA - An Intro

Bob Latino
Mar 22, 2024
6 min read

Are We Talking the Same Language? | Part 2

REFLECTION ON PART I

In Part I of this series, I focused on my unconventional journey to finding and joining the CHOLearning Community. As the title of the blog indicates, while the two communities use the terms ‘Reliability’ and ‘RCA’, they mean very different things to each community.

Having spent nearly 40 years in the field of Reliability Engineering (where RCA is a critical element of such a holistic system), this confused me. So, in PART I of this series, I outlined what Reliability meant to me to narrow this gap in my understanding.

The purpose of this series is to simply raise awareness in the HOP community, of the Reliability and RCA principles I have grown up with. This will help improve the way the two communities communicate with each other.

PART II - 6 Steps to Holistic RCA

STEP 1 – TYPES OF RCA CANDIDATES

STEP 2 - CATEGORIZING & QUANTIFYING RCA CANDIDATES

STEP 3 – ORGANIZING AN RCA TEAM

STEP 4 – DEVELOPING A DATA COLLECTION STRATEGY

PART III

STEP 5 – EVENT RECONSTRUCTION USING A LOGIC TREE

STEP 6 – GERMINATION OF A FAILURE

Let’s explore these steps of a holistic RCA approach, which further demonstrates that RCA is not simply a task, but a complete system.

STEP 1 – TYPES OF RCA CANDIDATES

Before we can get into how to analyze an undesirable outcome, we need to be sure what we are analyzing is an appropriate use of our scarce time and resources. We cannot, and should not, apply the same level of analytic discipline to all ‘failures’ that occur.

When we work in a reactive environment and the cultural norm is to conduct an ‘RCA’ on the failure-of-the-day (i.e. – crisis management), we need to recognize that when everything is urgent, nothing is urgent. This represents a firefighting culture and is both frustrating to work in and very stressful to our lives in general.

While most view RCA as a reactive tool (and to a large degree as applied, it is), it should be noted that a properly applied RCA is equally capable of being proactive. Let’s explore how this can be:

Most RCAs are reactive because they either hit an internal or external trigger. An internal trigger is usually a threshold or criteria to be met (such as described above in Table 1), for a formal RCA to be commissioned. External triggers are usually regulatory in nature. If there is a regulatory violation, a formal RCA is likely to be commissioned either by the violating party, and/or regulatory investigators.

RCAs are rarely conducted proactively, because there are few (if any) regulatory drivers to require them to be done.

STEP 2 – CATEGORIZING & QUANTIFYING RCA CANDIDATES

As we did in Part I of this series, I will use these graphics to put together a process flow diagram for an effective RCA system. The puzzle will be completed and put together at the end of the series.

As unexpected and undesirable events enter our daily funnel, and interrupt our opportunistic (scheduled) plans, we must demonstrate our resilience to deal with ‘life’ as it pertains to our everyday work. Such unexpected events will come from every department such as maintenance, operations, EH&S, quality, stores, procurement, and the like.

We need a means of putting these events into proper categories (i.e. – buckets) for them to be properly prioritized and addressed.

While as Reliability professionals we would prefer to work on the unacceptable risks (proactive activities), the unfortunate realities are that we must address the here-and-now (reaction) first. The entire goal of Reliability Engineering is ‘to control the fix, versus the ‘fix controlling us’. However, as discussed in Part I of this series, we would need a very progressive leadership to understand and support this vision. To this end, let’s work on how to ‘control the fix’ in this blog.

In my career, I have always broken down undesirable outcomes into 3 buckets (see Table 2):

So, as we get RCA candidates filling our funnel, they will be assessed, quantified, and prioritized within these 3 buckets. Based on the bucket, it will determine the breadth and depth of the analysis needed to address the undesirable outcome.

Certainly, triggered analyses are going to require a more formal RCA. They are delivered to us on a silver platter, usually accompanied by people in ‘suits’.

Next on our bucket list are chronic failures. Remember, these are no longer viewed as failures anymore, they are part of the job. They are weaved into the cost of doing business and even get cost of living increases as they are accommodated in the budget. So how do we convince leadership that we should dedicate scarce resources for conducting RCAs to these seemingly unimportant instances?

For the purposes of this blog, we will summarize what an Opportunity Analysis is (Table 3). We will just use a couple of line items to make our point.

Table 3: Sample Opportunity Analysis Line Items

Legend: Impact/Occurrence = LPO+MP+Materials/Occurrence (on Average)

LPO = Lost Profit Opportunity/Downtime

MP = Manpower

MATERIALS = Materials Used at Cost

#2, this is a case study from our RCA book in healthcare. ‘Blood Redraws’ refers to when patients like us go to the Emergency Department (ED) of a hospital and they draw blood for whatever reasons. A ‘redraw’ is when something didn’t go as planned the first time, and they must do it again. This results in additional lab time, lab tech time, the transport time for samples, more real estate in the ED used up longer than necessary (and the hospital is not being paid for that time due to the error), supplies to take the sample (syringe, gauze, etc). This is a real case to help me make the point about the annual costs of chronic failures.

As you can tell from these chronic failures, what makes them unique, is their frequency/yr. That alone bubbles them to the top of the list when sorting a full analysis, from highest to lowest. We typically find the Pareto Split applies and that 20% or less of the failure modes account for 80% or greater of the losses. This helps us make the business case to discuss with leadership why we should be doing RCA on some targeted non-triggered events.

STEP 3 – ORGANIZING AN RCA TEAM

Given today’s current staffing issues related to finding qualified personnel, trying to get personnel into team meetings for problem-solving activities is often difficult. So as stated in Part I of this series, one of the key Principles of Reliability was ‘Priority’. Leadership must mandate the proper time for such meetings to happen. This brings up the old adage…’ what interests my boss, fascinates me’😊.

While this is simply a blog, I will not get into specific roles and responsibilities of team members. This stuff is outlined in our book Root Cause Analysis: Improving Performance for Bottom-Line Results, if there is any interest.

However, I wanted to provide some rules of thumb when it comes to forming an RCA team:

STEP 4 – DEVELOPING A DATA COLLECTION STRATEGY

Now that we have targeted a qualified candidate for RCA (we know the potential ROI) and we have the right perspectives on the team, we need to strategize on how to collect preliminary data. To be efficient in our RCA, the more data we collect to prove or disprove our hypothesis, the less we will rely on hearsay, and letting it fly as fact. In any professional investigative occupation, hearsay is NOT a valid form of evidence.

To help summarize the key data collection categories, we will use the 5-Ps approach*:

Table 5: The 5-P’s Data Collection Strategy

Developing a data collection strategy involves the team creating the list above, based on the situation at hand. It will never include all that we will need, but if we can get 50% - 60% upfront, that’s a huge head start on the overall RCA. When the list above is made, then tasks must be assigned along with a due date.

In the next blog, PART III, we will delve into the remaining steps of the RCA Process Flow.

STEP 5 – EVENT RECONSTRUCTION USING A LOGIC TREE

Are all RCA approaches created equal (is all RCA the equivalent of the 5-Whys as many would like to have us think)?
How do mechanistic versus adaptive systems play into a formal RCA System?
What’s the difference between complex versus complicated in a formal RCA?
Is there a correlation between Reliability and Safety?

STEP 6 – GERMINATION OF A FAILURE

I hope our view of ‘Reliability/RCA’ so far, resonates and aligns with the traditional principles of the HOP community. I see absolute parallels between these two worlds and the principles complement each other…they do not compete or contradict each other IMHO.

In PART III, we will delve into the details of the two most common RCA methods and tools (5-Whys and Logic Trees). This will be where we can start to interject the HOP and HPI principles to demonstrate the unity in purpose I keep talking about. Then we will round out the RCA ‘system’ and how to measure the effectiveness of the RCA efforts. Thanks for your interest in this topic.

Do you want to engage with Bob on this topic? Join us next week for our March Webinar.

The Role of HOP in a Holistic Reliability Engineering System

March 27, 2024, 11:00 AM – 12:00 PM EDTWebinar

IMPACT! - 2024 CHOLearning 30th Annual Conference

June 10, 2024 at 8:00 AM – June 14, 2024 at 5:00 PMHilton Lake Las Vegas

Whitelist Instructions