Root Cause Analysis (RCA)

Table of Contents

Root Cause Analysis (RCA) is a structured problem-solving methodology used to identify the underlying causes of failures, defects, or adverse events — and to develop corrective actions that prevent recurrence rather than simply addressing symptoms. In maintenance and reliability contexts, RCA investigates why an asset failed, why a process broke down, or why a safety incident occurred, tracing the failure chain from observable symptoms back through contributing factors to the fundamental causes that, if corrected, would prevent the failure from recurring.

RCA is distinguished from basic troubleshooting by its depth of investigation and its scope of corrective action. Troubleshooting identifies what failed and restores operation. RCA identifies why it failed at every causal level — physical cause (what broke), human cause (what decision or action allowed it), and latent cause (what systemic condition made the failure possible) — and develops corrective actions at each level. Without RCA, maintenance organizations repair the same failures repeatedly because the conditions that produced them remain unchanged.

RCA is closely related to but distinct from Root Cause Failure Analysis (RCFA). RCFA is specifically applied to equipment failures — a specialized application of RCA methodology to the physical, operational, and maintenance system causes of asset failures. RCA is the broader methodology applicable to any failure type — equipment, process, safety, quality — while RCFA is the equipment-specific application of the same analytical approach.

Why RCA Matters

Recurring failures are the primary driver of reactive maintenance in most industrial operations. When a bearing fails, is replaced, and fails again six months later under the same conditions, the maintenance team has addressed the physical failure without addressing its cause. Each recurrence consumes labor hours, spare parts, and downtime — a compounding cost that a single effective RCA investigation would eliminate. The financial case for RCA is straightforward: the cost of a thorough investigation is fixed; the cost of recurrence is open-ended.

Beyond direct repair cost, recurring failures signal systemic problems that affect broader reliability performance. A lubrication-related bearing failure that recurs across multiple assets of the same type indicates a lubrication program deficiency that is affecting the entire population, not just the individual assets that have failed so far. RCA that identifies and corrects the systemic cause — incorrect lubricant specification, contamination ingress, inadequate interval — eliminates the failure mode across all affected assets simultaneously, multiplying the return on the investigation investment.

RCA also builds organizational reliability knowledge. Documented RCA investigations accumulate into a failure knowledge base — a record of what failed, why, and what corrective actions were effective — that informs maintenance strategy development, training programs, and procurement decisions. Organizations with mature RCA programs make better maintenance investment decisions because they have documented evidence of what failure modes their assets actually experience and what interventions actually work.

How RCA Works in Practice

The RCA Process

A structured RCA follows six steps:

Step 1 — Data Collection: Define and scope the problem. Collect all available data about the failure event — failure timeline, operating conditions at time of failure, maintenance history, inspection records, operator observations, and physical evidence from the failed components. The quality of the RCA is directly determined by the quality and completeness of the data collected. Incomplete data collection produces incomplete causal analysis.

Step 2 — Assessment: Identify all possible causes through structured brainstorming and causal mapping. Cast a wide net — consider equipment design, operating conditions, maintenance practices, lubrication, contamination, human factors, and management system factors. Tools used in this step include fishbone (Ishikawa) diagrams, fault tree analysis, and process mapping. The goal is to ensure that no significant causal pathway is excluded from consideration before the investigation narrows.

Step 3 — Investigate Potential Causes: Test and evaluate the candidate causes identified in Step 2. Use physical evidence, data analysis, and subject matter expertise to determine which potential causes were actually present and which can be eliminated. Narrow the causal list to the root causes — the conditions that, if absent, would have prevented the failure. Distinguish between physical root causes (the component or mechanism that failed), human root causes (the decision or action that allowed the physical failure to develop), and latent root causes (the systemic condition — policy, procedure, training gap, or resource constraint — that made the human cause possible).

Step 4 — Plan Corrective Action: Develop corrective actions for each identified root cause. Corrective actions should address root causes at the appropriate level — a physical cause may require a design change or material upgrade; a human cause may require a procedure change or training; a latent cause may require a management system or resource change. Corrective actions that address only physical causes without addressing human and latent causes allow the same human and systemic conditions to produce new failures in new forms.

Step 5 — Implement the Solution: Execute the corrective actions developed in Step 4. Assign ownership, define completion criteria, and set implementation deadlines. Document the corrective actions in the CMMS work order or corrective action tracking system so that implementation progress is visible and accountable.

Step 6 — Track Progress and Revise: Monitor the corrective actions after implementation to confirm that the failure mode has been eliminated. If the failure recurs, either the corrective actions were not fully implemented, the root cause identification was incomplete, or additional causal factors were not identified in the initial investigation. RCA is an iterative process — verification that corrective actions have produced the expected outcome is as important as the investigation itself.

Common RCA Tools

5 Whys: The simplest RCA technique — asking “why” repeatedly until the underlying cause is reached. Effective for straightforward failure chains but limited for complex failures with multiple interacting causes. Best used as a starting point for investigation rather than the complete analytical method.

Fishbone (Ishikawa) Diagram: A visual tool that organizes potential causes into categories — typically Man, Machine, Method, Material, Environment, and Management — to ensure structured brainstorming covers all causal domains. Useful for the assessment step to prevent tunnel vision on the most obvious causal pathway.

Fault Tree Analysis (FTA): A top-down logic diagram that maps the causal relationships between a failure event and its contributing causes, using AND and OR logic gates to represent how causes combine to produce the failure. FTA is particularly useful for complex failures with multiple contributing causes that must occur simultaneously or in sequence.

Apollo RCA / PROACT: Structured RCA methodologies developed specifically for industrial reliability applications, providing a systematic causal factor charting approach that distinguishes physical, human, and latent root causes and ensures corrective actions are developed at each level.

RCA in Lubrication Management

Lubrication-related failures are among the most common equipment failure modes in asset-intensive operations — and among the most amenable to RCA-driven elimination. When a bearing fails due to lubrication causes, the physical root cause (lubricant film breakdown, contamination, incorrect viscosity) is typically identifiable from the failure mode and oil analysis results. The human and latent root causes — why the lubrication specification was incorrect, why contamination was not controlled, why the interval was not maintained — reveal the lubrication program gaps that must be corrected to prevent recurrence.

Lubrication engineers conducting RCA on lubrication failures analyze maintenance records, lube route completion history, oil analysis trends, and physical evidence from failed components to develop the complete causal picture. Corrective actions may include lubricant specification changes, contamination control upgrades, interval adjustments, technician retraining, or procedure modifications — addressing the failure at every causal level rather than simply replacing the failed component and resuming the same lubrication practices that produced the failure.

RCA by Industry

Manufacturing: RCA in manufacturing addresses both equipment failures and process quality failures — the same methodology applies to a recurring bearing failure on a production motor and to a recurring product defect in a manufacturing process. Manufacturing operations with Six Sigma and lean manufacturing programs typically have established RCA methodologies (8D, DMAIC) that can be extended to equipment reliability applications. Linking equipment RCA findings to production performance data — connecting failure events to OEE losses — strengthens the business case for corrective action investment by quantifying the production impact of the recurring failure mode.

Mining: RCA in mining is applied to both equipment failures and safety incidents — the same causal analysis discipline serves both reliability and safety management objectives. High-value mobile equipment failures and processing plant failures that affect ore throughput are the primary equipment RCA triggers. Mining operations with mature reliability programs conduct formal RCA on all failures above a defined cost or downtime threshold, accumulating failure knowledge across the asset population to inform maintenance strategy development and fleet management decisions.

Oil and Gas: RCA in oil and gas is mandated for process safety incidents and near-misses under process safety management regulations — making structured causal analysis a regulatory compliance requirement as well as a reliability tool. The same RCA methodology applied to safety incidents is applied to significant equipment failures. Oil and gas operations typically have the most developed RCA programs of any industry because the regulatory requirement for incident investigation has built the organizational capability and cultural acceptance that reliability-focused RCA programs in other industries must develop from scratch.

Crane and Rigging: RCA in crane and rigging operations is triggered by both equipment failures and lifting incidents — dropped loads, near-misses, and structural findings during inspection all warrant causal investigation. Regulatory requirements under ASME B30 and OSHA standards require investigation and documentation of crane incidents. RCA findings on crane failures are particularly valuable because the safety consequences of recurring crane failures are severe — the corrective actions identified through RCA directly reduce the risk of incidents that could result in serious injury or fatality.

Common RCA Program Failures

Stopping at the physical cause: The most common RCA failure is an investigation that identifies what broke — the bearing, the seal, the weld — but does not ask why it broke or what systemic conditions made the failure possible. Physical cause identification supports the repair; human and latent cause identification prevents recurrence. An RCA that stops at the physical cause is a repair report, not a root cause analysis.

Corrective actions that address symptoms rather than causes: Replace the failed component, increase inspection frequency, and add the asset to the watch list — these are symptom responses. They manage the consequence of the failure mode without eliminating it. RCA corrective actions should eliminate the conditions that produced the failure, not increase the surveillance of a failure mode that is allowed to continue.

No tracking of corrective action implementation: RCA value is realized only when corrective actions are implemented and sustained. Investigations that produce corrective action recommendations that are not tracked to completion, not assigned to owners, and not verified for effectiveness produce analysis without improvement. Every RCA corrective action should have an owner, a completion date, and a verification step in the CMMS or corrective action tracking system.

RCA triggered only by major failures: Organizations that conduct formal RCA only for significant failures miss the opportunity to investigate and eliminate the lower-consequence recurring failures that collectively consume the largest portion of reactive maintenance labor. A structured trigger system — defining the cost, downtime, or recurrence criteria that initiate formal RCA — ensures that investigation resources are applied systematically rather than only in response to the most visible events.

No failure knowledge base: RCA investigations that are completed, filed, and forgotten do not accumulate into organizational reliability knowledge. Documented RCA findings should be searchable and accessible — enabling maintenance engineers to check whether a new failure has been investigated before and what corrective actions were found to be effective. A failure knowledge base built from documented RCA investigations is one of the highest-value reliability knowledge assets an organization can develop.

RCA vs. Related Methodologies

  • Root Cause Analysis (RCA): The broad methodology for identifying underlying causes of any failure type — equipment, process, safety, quality — and developing corrective actions that prevent recurrence. Applicable across all failure domains.
  • Root Cause Failure Analysis (RCFA): The equipment-specific application of RCA methodology, focused on the physical, operational, and maintenance system causes of asset failures. RCFA is RCA applied to equipment reliability. See: Root Cause Failure Analysis (RCFA).
  • Failure Mode and Effects Analysis (FMEA): A proactive analytical tool that identifies potential failure modes and their consequences before failures occur. RCA is reactive — applied after failure; FMEA is proactive — applied to prevent failure. RCA findings frequently inform FMEA by providing real failure mode data from actual events. See: Failure Mode and Effects Analysis (FMEA).
  • Reliability-Centered Maintenance (RCM): A structured methodology for determining the appropriate maintenance strategy for each asset based on failure modes and consequences. RCA findings inform RCM by providing field evidence of actual failure modes and their causes. See: Reliability-Centered Maintenance (RCM).

Frequently Asked Questions

What is Root Cause Analysis?

Root Cause Analysis (RCA) is a structured problem-solving methodology used to identify the underlying causes of failures, defects, or adverse events and to develop corrective actions that prevent recurrence. In maintenance and reliability contexts, RCA investigates why an asset failed or why a process broke down, tracing the failure chain from observable symptoms back to physical causes (what broke), human causes (what decision or action allowed it), and latent causes (what systemic condition made the failure possible). RCA is distinguished from basic troubleshooting by its depth — it identifies why failures occurred at every causal level, not just what failed.

What is the difference between RCA and RCFA?

RCA is the broad methodology applicable to any failure type — equipment, process, safety, or quality. RCFA is the equipment-specific application of RCA methodology, focused on the physical, operational, and maintenance system causes of asset failures. In practice, the analytical approach is the same — both trace causal chains from observable failure to underlying root causes — but RCFA applies specifically to equipment reliability investigations while RCA is used for the full range of organizational failures including process, safety, and quality events.

When should RCA be conducted?

RCA should be triggered by any failure that meets defined criteria for investigation — typically based on failure cost, downtime impact, safety consequence, or recurrence frequency. Most operations define formal RCA triggers: failures above a defined repair cost, failures that produce more than a defined hours of downtime, safety incidents and near-misses, and failures that recur within a defined period of a previous repair. The trigger criteria should be defined in advance so that RCA resources are allocated systematically rather than based on visibility or management attention in the immediate aftermath of the failure event.

How does a CMMS support RCA?

A CMMS supports RCA by providing the maintenance history, work order records, parts consumption data, and failure event timeline that RCA data collection requires. CMMS failure coding — recording the failure mode, failure cause, and corrective action on every work order — builds the failure frequency data that identifies recurring failure modes warranting formal RCA investigation. Corrective actions identified through RCA are tracked in the CMMS as work orders or action items, ensuring implementation is assigned, monitored, and verified. Over time, the CMMS failure history accumulates into the failure knowledge base that informs both ongoing RCA investigations and broader maintenance strategy development.

Eliminate Recurring Failures With Redlist

Redlist captures failure codes, maintenance history, and corrective action records on every work order — giving reliability teams the failure data foundation needed to conduct effective RCA and track corrective actions through to verified completion.

Explore the Redlist CMMS  |  Request a Demo

4.7 Star Rating
Rated 5 out of 5

Redlist Lubrication Management  Software Live Demo

The Redlist Lubrication Management Software demonstration environment is not a personal free trial. You do not have to enter your payment information to access the free trial, and you are not required to subscribe at the end of the trial to continue usage.

It is a prepopulated live environment which means:

  1. The data is wiped and reset every night.
  2. Any changes you make in the environment will not be saved to the following day.
  3. Do not add any personal or proprietary information to the demo, as other users may see the data you input.
  4. Do not add any personal or proprietary information to the demo, as other users may see the data you input.

This demo is intended for desktop computer use. It is not optimized for Mobile or Tablet. The use of the DIY demo to build your own competing software is expressly prohibited.