Without truly understanding the key elements (and possessing the necessary skills) to conduct a thorough, effective investigation, people run the risk of missing key causal factors of an incident while conducting the actual analysis. This could potentially result in not identifying all possible solutions including those that may be more cost effective, easier to implement, or more effective at preventing recurrence.
Here we outline the 5 key steps of an incident investigation which precede the actual analysis.
1. Secure the incident scene
- Identify and preserve potential evidence
- Control access to the scene
- Document the scene using your ‘Incident Response Template’ (Do you have one?)
2. Select investigation team
- The functions that must be filled are:
- Incident Investigation Lead
- Evidence Gatherer
- Evidence Preservation Coordinator
- Communications Coordinator
- Interview Coordinator
Other important considerations for the selection of team members include:
- Ensure team members have the desirable traits (What are they?)
- The nature of the incident (How does this impact team selection?)
- Choose the right people from inside and outside the organization (How do you decide?)
- Appropriate size of the team (What is the optimum team size?)
*Our Incident Investigator training course examines each of these considerations and more, giving you the knowledge to select investigation team members wisely.
3. Plan the investigation
Upon receiving the initial call:
- Get the preliminary What, When, Where, and Significance
- Determine the status of the incident
- Understand any sensitivities
- If necessary and appropriate, issue a request to isolate the incident area
- Escalate notifications as appropriate
The preliminary briefing:
- Investigation Lead to present a preliminary briefing to the investigating team
- Prepare a team investigation plan
4. Collect the facts supported by evidence
- Be prepared and ready to lead or participate in an investigation at all times to ensure timeliness and thoroughness.
- Have your “Go Bag” ready with useful items to help you secure the scene, take photographs, document the details of the scene and collect physical evidence.
- Collect as much information as possible…analyze later
- Inspect the incident scene
- Gather facts and evidence
- Conduct interviews
*While every step in the Incident Prevention Process is crucial, step 4 requires a particularly distinct set of skills. A lot of time in our Incident Investigator training course is dedicated to learning the techniques and skills required to get this step done right.
5. Establish a timeline
This can be the quickest way to group information from many sources
- Stickers can be used on poster paper to start rearranging information on a timeline. Use different colors for precise data versus imprecise, and list the source of the information on each note.
After steps 1-5 comes the Root Cause Analysis of the incident, solution implementation and tracking, and reporting back to the organization:
6. Determine the root causes of the incident
7. Identify and recommend solutions to prevent recurrence of similar incidents
8. Implement the solutions
9. Track effectiveness of solutions
10. Communicate findings throughout the organization
*Steps 6-10 are taught in detail at our Root Cause Analysis Facilitator training course.
To learn more on the difference between our Incident Investigator versus RCA Facilitator training courses, check out our previous blog article and of course, if you would like to discuss how to implement or improve your organization’s incident prevention process, please contact us.
Author: Bruce Ballinger
To have a successful implementation and adoption of your new RCA program, it’s crucial to have all the elements of an effective and efficient program clearly identified and agreed upon in advance.
Here’s a high-level look at the elements that will need to be defined:
RCA Goals and Objective Alignment
Define the goals and objectives of the program and assure they are in alignment with corporate/facility/department goals and objectives
Status of Current RCA Effort
Perform a maturity assessment of existing RCA program to be used as a baseline to measure future improvements
Key Performance Indicators
Identify KPI’s with baselines and future targets to be used for measuring progress towards meeting program goals and objectives
Formal RCA Threshold Criteria
Determine which incidents will trigger a formal RCA and estimate how many triggered events may occur in the upcoming year
RCA and Solution Tracking Systems
Identify which internal tracking systems will be used to track status/progress of open RCA’s and implemented solutions
Roles and Responsibilities
Identify specifically who will have a role in the RCA effort including, program sponsor, champion, RCA facilitators
Determine who will be trained in the chosen RCA methodology and to what level and in what time frame
RCA Effort Oversight and Management
Identify who (or what committees or groups) will be responsible for managing tracking systems, decisions on solution implementation, program modifications over time, and general program performance
Process mapping exercise to document RCA management from the beginning of a triggered incident to completion of implemented solutions, including their impact on organization’s goals and objectives.
Human Change Management Plan
Develop a Change Management plan, including a detailed communication plan, that specifically targets those whose job duties will be affected by the RCA effort.
Create a checklist to monitor RCA effort implementation including action items, responsible parties and due dates
We recommend conducting a workshop in order to define each of these crucial elements of your RCA program.
The workshop should be conducted for what we call a “functional unit” which ideally is no larger than a plant or facility, however, it can be modified to accommodate multiple facilities.
Common elements of a functional unit include:
- A common trigger diagram
- Common KPI’s
- The same Program Champion
- Members have an interdependence and shared responsibility on one another for functional unit performance
By structuring programs to fit within the goals and objectives of the business, or “functional unit”, rather than applying a 'one size fits all' solution, effective and long lasting results can be realized.
Implementing a new RCA program or need to reinvigorate your current one? ARMS can help you create a customized plan for its successful adoption. Contact Us for more information.
Topics: RCA Program Development
Author: David Wilbur, CEO - Vetergy Group
To begin we must draw the distinction between error and failure. Error describes something that is not correct or a mistake; operationally this would be a wrong decision or action. Failure is the lack of success; operationally this is a measureable output where objectives were not met. Failures audit our operational performance, unfortunately quite often with catastrophic consequences; irredeemable financial impact, loss of equipment, irreversible environmental impact or loss of life. Failure occurs when an unrecognized and uninterrupted error becomes an incident that disrupts operations.
Individual Centered Approach
The traditional approach to achieving reliable human performance centers on individuals and the elimination of error and waste. Human error is the basis of study with the belief that in order to prevent failures we must eliminate human error or the potential for it. Systems are designed to create predictability and reliability through skills training, equipment design, automation, supervision and process controls.
The fundamental assumptions are that people are erratic and unpredictable, that highly trained and experienced operators do not make mistakes and that tightly coupled complex systems with prescribed operations will keep performance within acceptable tolerances to eliminate error and create safety and viability.
This approach can only produce a limited return on investment. As a result, many organizations experience a plateau in performance and seek enhanced methods to improve and close gaps in performance.
An Alternative Philosophy
Error is embraced rather than evaded; sources of error are minimized and programs focus on recognition of error in order to disturb the pathway of error to becoming failure.
Slight exception notwithstanding, we must understand people do not set out to cause failure, rather their desire is to succeed. People are a component of an integrated, multi-dimensional operating framework. In fact, human beings are the spring of resiliency in operations. Operators have an irreplaceable capacity to recognize and correct for error and adapt to changes in operating conditions, design variances and unanticipated circumstances.
In this approach, human error is accepted as ubiquitous and cannot be categorically eliminated through engineering, automation or process controls. Error is embraced as a system product rather than an obstacle; sources of error are minimized and programs focus on recognition of error in order to disturb its pathway to becoming failure. System complexity does not assure safety. While system safety components mitigate risk, as systems become more complex, error becomes obscure and difficult to recognize and manage.
Concentrating on individuals creates a culture of protectionism and blame, which worsens the obscurity of error. A better philosophy distributes accountability for variance and promotes a culture of transparency, problem solving and improvement. Leading this shift can only begin at the organizational level through leadership and example.
The Operational Juncture™
In contrast to the individual-centered view, a better approach to creating Operational Resilience is formed around the smallest unit of Human Factors Analysis called the Operational Juncture™. The Operational Juncture describes the concurrence of people given a task to operate tools and equipment guided by conflicting objectives within an operational setting including physical, technological, and regulatory pressures provided with information where choices are made that lead to outcomes, both desirable and undesirable.
It is within this multidimensional concurrence we can influence the reliability of human performance. Understanding this concurrence directs us away from blaming individuals and towards determining why the system responded the way it did in order to modify the structure. Starting at this juncture, we can preemptively design operational systems and reactively probe causes of failure. We view a holistic assignment of accountability fixing away from merely the actions of individuals towards all of the components that make up the Operational Juncture. This is not a wholesale change in the way safety systems function, but an enhanced viewpoint that captures deeper, more meaningful and more effective ways to generate profitable and safe operations.
A practical approach to analyzing human factors in designing and evaluating performance creates both reliability and resilience. Reliability is achieved by exposing system weaknesses and vulnerabilities that can be corrected to enhance reliability in future and adjacent operations. Resilience emerges when we expose and correct deep organizational philosophy and behaviors.
Resilience is born in the organizational culture where individuals feel supported and regarded. Teams operate with deep ownership of organizational values, recognize and respect the tension between productivity and protection, and seek to make right choices. Communication occurs with trust and transparency. Leadership respects and gives careful attention to insight and observation from all levels of the organization. In this culture, people will self-assess, teams will synergize and cooperate to develop new and creative solutions when unanticipated circumstances arise. Individuals will hold each other accountable.
Safety within Operational Resilience is something an organization does, not something that is created or attained. A successful program will deliver a top-down institutionalization of culture that produces a bottom-up emergence of resilience.
These days, many enterprise-level organizations are likely to have similar operations in multiple locations regionally or even worldwide. When a piece of equipment fails or a safety incident occurs at one site, the company investigates the problem and identifies solutions or corrective actions. Naturally, the team wants to capture the lessons learned and share them with other sites that have similar equipment, processes and potential incidents.
Advanced tools like the RealityCharting® software enable teams to share results of an Apollo Root Cause Analysis (RCA) across multiple layers of stakeholders. However, a large multinational enterprise might have dozens of different investigations going on at any given time. At the highest levels, decision-makers don’t necessarily want to see granular information about specific causes at any given plant. They need a top-down perspective of problems and patterns that are affecting the entire organization.
At ARMS Reliability, many of our clients have expressed a similar need. Our solution? Using classification tags to create and apply a consistent taxonomy to all root cause analyses performed for a given organization. Rolled up into a composite report, these tags reveal enterprise-wide trends and issues, allowing management to create action plans for tackling these systemic issues. For example, classification tags might uncover a large number of problems related to a lack of preventative maintenance on a certain type of pump, or a systemic non-compliance with a required safety process.
A classification taxonomy can be scalable and configured to an organization's goals and processes. Think of these classifications like buckets that can be applied at any level of the RCA — e.g., to the root causes or solutions, to individual contributing causes, or simply to the RCA investigation in general.
Keep in mind: The Apollo Root Cause Analysis method is centered around a free-thinking approach to solving problems. That’s what makes the methodology so powerful — it doesn’t lead you down any generic predetermined pathways by asking leading questions or categorizing various causes or effects in any way. At ARMS Reliability, we advocate applying classification tags only after the root cause analysis investigation is completed, so you keep the free-thinking causal analysis and organize it later, for the purpose of rolling the findings up into a deeper systemic view.
Taxonomies can range from 5–20 categories into the hundreds. For example, here we’ve used a human factors taxonomy to tag causes as organizational influences and other people-centric issues.
(Click to enlarge)
Reports can provide a summary of how many causes were classified under the various tags:
(Click to enlarge)
In another example, an organization bases its taxonomy of reliability issues on the ISO 14224 - Collection and exchange of reliability and maintenance data for equipment.
(Click to enlarge)
The taxonomy options are endless. Most organizations we work with have their own unique systems of classifications. It’s really all about codifying the types of information your organization most needs to capture.
If adding classifications to your Root Cause Analyses would be useful for your organization, contact ARMS Reliability. We’d be glad to show you more about what we’re doing with other clients and help you develop a taxonomy that works best for your needs.
How do they end up there?
Many of us have them. The invisible “graveyard” where good intentions (AKA – corrective actions from your root cause analysis investigation) went to die.
We all know that all the time and money spent on a root cause analysis investigation and identifying solutions is worthless if the solutions are not implemented. An investigation can usually be done within a week but solutions can take much longer to implement. They sometimes require the involvement of multiple teams or departments, regulatory agencies, engineering, planning, budgeting, and the list goes on and on. For these reasons, it can be challenging to stay on top of all the corrective actions you identified in your investigation, who’s responsible, and the status of an action item at any given time.
We can offer a few basic tips that will give you a head start in tracking action items effectively:
- Be clear about who is responsible for each corrective action. You don’t want to create the opportunity for people to be able to pass the buck with “I thought Bob was going to do it”.
- Have a mechanism in place by which the implementation of corrective actions can be tracked.
- Give ownership of a solution to an individual, not a group or department.
- Assign a due-date for each corrective action.
- Support people in their efforts to implement corrective actions.
- Make sure you follow up on each corrective action – check back with the individual responsible to make sure that progress is being made.
But even these “basics” are easier said than done.
In reality, most likely you come out of your root cause analysis investigation with a list of action items for which various people are responsible. Then everyone goes about their regular workdays and may or may not remember to follow through on any additional tasks they were assigned. Even if you have an appointed person to follow up with the action items and make sure they’re on track, it can be difficult to keep up with who has done what. Many managers rely on an Excel spreadsheet to manually track what has and hasn’t been done, due dates, and so forth. But this puts a lot of pressure on one person to keep up with everything – to manually send reminders to folks who haven’t completed their tasks and to enter the information properly when it has been done.
Even when the Excel file has been carefully kept up-to-date, it often lives locally on the manager’s hard drive, and other members of the team don’t have any visibility as to what has and hasn’t been done.
If your RCA program is starting to mature it may be time to consider an enterprise solution to help you better manage all your investigations.
Corrective action tracking inside of an enterprise RCA tool can help you maintain visibility and accountability by tracking the status of action items and assigned solutions. Team members get sent automatic reminders of incomplete or overdue action items and they can easily update the status of their assigned tasks, instantly informing everyone when a task has been completed. You can also create personalized dashboards with reports showing open, completed, or overdue corrective actions.
In addition to effective action tracking, an enterprise RCA solution can more broadly help your company implement and manage an effective overall root cause analysis program.
Here are some of the main features to look for:Enterprise-wide visibility of your RCA program
Expand the RCA knowledge base and accessibility across an organization.
Leverage information from previous investigations in your current investigation.
Classify and tag files for easy search-ability. Create custom tags incorporating company or industry standards.
Create and share interactive KPI reports
Build reports on your chosen metrics and visually display key performance indicators in tables, charts and graphics.
Specify which reports are most important to you for immediate dashboard display on your home page.
Preserve integrity by securely collecting and storing evidence and important reference files.
Store company corporate standards or reference files such as frequently referenced industry documents in a central location for immediate access when facilitating an RCA.
Communicate with all users through on-page messaging that lets you quickly share information, receive feedback and record comments.
Remember, in order to resurrect your RCA investigation corrective actions, start with the basics that we listed at the beginning of this article. But also keep in mind – the more mature your RCA program becomes, or the larger and more complex your organization, the larger and more complex your problems become. So when you’re ready to alleviate this pain point altogether, consider whether an enterprise RCA solution might be the next step in your program’s development.
“How long should an RCA take?”
This question is similar to how long is a piece of string?
I have heard one manager in a plant that has stipulated a maximum of two hours for an RCA to be conducted in his organisation. Another expects at least “brainstormed” solutions before the conclusion of day one – within 6 or 7 hours. It is not uncommon for a draft report to be required within 48 hours of the RCA.
The following three tips may assist to meet tight deadlines and when time expectations are short. One advantage of the Apollo Root Cause Analysis methodology is that it is a fast process but the “driver” has to be on the ball to achieve the desired outcomes – effective solutions.
Tip #1 You Define The Problem
Imagine the RCA has been triggered by an unplanned incident or event which falls into any of the safety, environment, production, quality, equipment failure or similar categories. You have been appointed as the facilitator by a superior/manager who is responding to the particular event. Your superior/manager may understand the trigger mechanism and may well nominate the problem title.
For example, “upper arm laceration”, “ammonia spill”, “production delay” and so forth could be the offering you make to the team as the starting point for the analysis. Typically, as facilitator you will have gathered some of the “facts” from first responder reports, interviews, data sheets, photographs and so on. So a good first step is to draft a problem definition statement, including the significance reflected by the consequences or impacts. The team then has a starting point to commence the analysis, albeit the problem statement may change as more detail is provided.
Ideally, you will have already created a file in RealityCharting™ and the Problem Definition table can be projected onto a screen or even onto the clear wall where your charting will be done with the Post-It™ notes. The team members’ information ought to have been entered and can be confirmed quickly in this display. You might even show the Incident Report format and focus on the disclaimer option you have selected deliberately: Purpose: To prevent recurrence, not place blame.
This preparatory work could save at least 20 minutes of the team members’ time and enable an immediate launch into the analysis phase.
Important: Save yourself hours of re-work and potential embarrassment by saving the file as soon as this first process is complete, if you haven’t already done so, and thereafter on a regular basis. Maintain some form of version control so that the evolution of the chart in the following day/s can be tracked if necessary.
If you are particularly well-resourced the chart development might be recorded on the software simultaneously as the hard copy is created on the wall space. A small team might choose to create the chart directly via the software and a decent projection medium.
Tip #2 Direct The Analysis
It is critical that your initiative in preparing the problem definition is not considered by the team members as disenfranchising them. The analysis step whereby all have an opportunity to contribute should ensure that they feel they have “ownership” of the problem.
To reinforce this, it is advisable to choose a sequence of addressing each member, typically from left to right or vice-versa depending on the seating arrangements. This establishes the requirement that one person is speaking at a time, secondly, that each and every statement will be documented and thirdly, that every person has equal opportunity. Your prompt and verbatim recording of each piece of information will provide the discipline required to minimise idle chatter which can waste time because it distracts focus. When you have a series of “pass” comments from team members because the process has exhausted their immediate knowledge of events, launch the chart creation.
It is worthwhile reminding the team that each information item that has been recorded and posted in the parking area, may not appear in their original form on the chart or at all, in some cases. Because the information gathering is a widespread net to capture as much knowledge regarding what happened, when and why, there will be no particular focus. But because they are coming from people with experience and expertise or initimate knowledge of events and
circumstances, they have some value. The precise value will be determined by where the information sits in the cause and effect logic that starts at the problem and is connected by “caused by” relationships.
Important: Cause text should be written in CAPITAL LETTERS. It will be easier to read/decipher for the team at the time and perhaps from photographs of the chart later. Similarly using caps in the software itself means that projection of the chart is more effective and the printing of various views is enhanced.
Tip #3 The "How and If" of Creating a RealityChart
Many proponents tap the existing understanding of the event by capturing as many of the action causes as possible. These may arrive via a 5 WHYS process, for example, which starts at the Primary Effect.
Plant Stopped (Problem or Primary Effect)
Why? Feed pump not pumping
Why? Broken Coupling
Why? Motor Bearing Seized
Why? Bearing race Collapsed
The Apollo RCA method requires use of the expression “caused by?” to connect cause and effect relationships. Understanding that there must be at least one action and one condition helps reveal the “hidden” causes and especially the condition causes which do not come to mind initially.
To support this expression and the essential “why”, consider asking “how”. This may be employed initially by the most impartial member of your team who has been engaged specifically because of his/her lack of association with the problem and can sincerely ask the
supposedly “dumb” questions. Invariably these questions generate more causes or a more precise arrangement of the existing causes. A “How does that happen exactly?” question can drive the team to take the requisite “baby steps”. This also often exposes differences between “experts” and the resolution of these differences is always illuminating.
The facilitator needs to be aware of the need to softly “challenge” the team’s understanding while ensuring the application of sufficient rigour to generate the best representation of causal relationships. This can be done in a neutral manner by using the “IF” proposition.
Given that every effect requires at least two causes, you can then address the team with the proposition: “If ‘one exists’ and ‘three exists’ (two conditions) then with ‘four added’ (the action) will the effect be “eight” every time?”. Using this technique on each causal element will generate the clarity and certainty being sought to understand the causes of the problem. If every “equation” (causal element) in the chart is “real” and the causes themselves are “real”
(substantiated by evidence) then the team is well-placed to consider the types of controls it could implement to prevent recurrence of the problem.
The more causes which are revealed the more opportunities the team has to identify possible solutions.
To speed up the RCA process,
Step 1 Facilitator gathers event information and fills out Problem Definition Statement.
Step 2 Facilitator directs the Information gathering casting a wide net and systematically requests information from participants.
Step 3 Use information gathered to build a RealityChart™ with actions based on what happened then looking for other causes such as conditions which may initially be hidden. Use how and If to help validate that causal relationships are logical.
With a completed chart the solution finding step can begin.
Creating a common reality is a part of the foundation of the Apollo Root Cause Analysis methodology. It is important that language and definitions are consistent among all parties involved. When the Apollo Root Cause Analysis methodology is applied correctly everyone who participates truly understands the value of the problem, what the solutions are and how they will affect the problem.Establishing a universal reality is a bigger challenge than you might think. No one shares the exact same experiences or interprets information in the exact same manner. Good problem solvers know to take these different perspectives into account as they forge a path to the solutions.
Just as individuals apply their own unique perspective when conducting specific RCAs, companies apply their unique organizational culture when implementing an RCA process. Establishing company standards by defining an RCA champion with clear expectations and implementation procedures in place will keep your organization on the path to RCA success.
Another way to stay at the top of your game is to learn from the experiences of industry peers. Here we take a look at a conversation between Tom, an Engineering Team Lead and RCA champion and Jack, an expert Apollo Root Cause Analysis methodology instructor.
Tom (Engineering Team Lead):
I have found that sometimes engineers and technicians do not have a real understanding of the meaning of “root cause.” They tend to think of it as a single poor design feature or failure like a “loose nut” or a single cause of the issue or failure. They seemed to be surprised when I recently identified ten root causes on the last job. They were confused and could not get their heads around having ten root causes. They said, “But what was the real single root cause?”
Jack (RCA Instructor):
You are so right. Many people have preconceived idea that there can only be one root cause. They are driven by this perception to that end. It is quite a limiting concept for those people. They can become quite tunneled in their thinking, offering a close-minded approach to their problems rather than an all-embracing search for knowledge and information that could lead to enlightenment. Some anecdotal information even suggests that this mind frame is taught and it quite difficult to rattle their cages and try to shift their paradigms. How do you define root cause?
I define root cause as an opportunity for improvement. A single root cause cannot exist on its’ own, there must also be at least one condition. Here, I cannot come across as too much of a know-it-all or people roll their eyes, so I need a quick snappy go to response that is quick and brief and simple and does not come across as a nerd or a geek. That’s just where I work, as there are no formal RCA people in this division – we all share the work on investigations and most are engineering failure investigations that I do out of my own volition, and share with my team. In your experience, what are the major setbacks you have seen with people applying the RCA process? I’d like to get better and avoid these mistakes.
You are doing a great job, persevere. Changing peoples’ perspectives takes time especially if you are the only one flying the flag. A major key to success is making sure you are asking enough questions and following a process that demands these questions be asked. Sometimes people take shortcuts to speed up the process…less to think about…less time…must be better! And they can still argue that they have a solution. For simple problems this may even work and they could achieve a satisfactory result, but for complex problems this approach simply doesn’t come close to being comprehensive enough. The lack of knowledge and training in this area now comes back to bite them and their problems invariably don’t go away. Without a solid RCA foundation and process in place the structures within the company they work for won’t raise any red flags that something may be incorrect or ineffective in any way….so the end product of a subpar RCA (the report) is accepted. If management doesn’t embrace the change then reverting to old acceptable habits is just easier. The key to avoiding these major failures lies in overcoming the resistance to change. Involving your team in the RCA process and sharing your successes with management is a great way to gain support.
I got into the habit of now actually doing an initial draft RCA live in front of my team. I draft the RCA in a bound book which I have dedicated to this purpose and follow the cause and effect pathways like the software. I feel like this approach is more relatable with my team and I am able to get their input quickly. We are usually able to identify half a dozen possible causes in just a few minutes. Afterwards I go to the software and expand on it. Then I formalize and save the RCA in the software which checks all my work.
Hope you are in Sydney sometime soon, Jack. Your teaching techniques really work and I liked your style. I think in 20 years of taking training your lessons are the ones that have stuck the most with me.
If you have questions or ideas to share and would like to connect with people who have been trained in the Apollo Root Cause Analysis methodology with ARMS Reliability join our Apollo Root Cause Analysis methodology discussion group on LinkedIn.