Author: Jason Ballentine

Developing a maintenance strategy requires careful consideration and due process. Yet from what I’ve seen, many organizations are making obvious errors right from the start — missteps that can torpedo the success of the strategies they’re trying so hard to put in place.??????????????????????????????????????????

Without further ado, here are five common maintenance strategy mistakes:

  1. Relying solely on original equipment manufacturer (OEM) or vendor recommendations.

It seems like a good idea — you’d think the people who made or sold the equipment would know best. It’s what they don’t know that can hurt you.

Outside parties don’t know how a piece of equipment functions at your facility. They don’t understand how much this equipment is needed, the cost of failure, whether there’s any redundancy within the system… OEM and vendor maintenance guidelines are geared to maximize the availability and reliability of the machine, but their strategies might not be appropriate for your unique circumstances or needs. As a result, your team could end up over-maintaining the equipment, which can actually create more problems than it solves. The more you mess with a piece of equipment, the more you introduce the possibility of error or failure. Some things, in some situations, are better left alone.

What’s more, OEMs and vendors have a vested interest in selling more spare parts (so they can make more money). That means that their replacement windows might not be accurate or appropriate to your business needs. Rather than relying on calendar-driven replacement, your maintenance strategy might focus more on inspecting the equipment to proactively identify any issues or deterioration, then repairing or replacing only as needed.

It’s fine to use OEM/vendor maintenance guidelines as a starting point. Just make sure you thoroughly review their recommendations to see if they align with your unique needs for the given piece of equipment. Don’t just blindly accept them — make sure they fit first.

  1. Relying heavily on generic task libraries for your maintenance strategy.

This is surprisingly common. Some organizations purchase a very generic set of activities for a piece of equipment or equipment category, and attempt to use them to drive maintenance strategy. But generic libraries are even worse than OEM/vendor recommendations because they are just that — generic. They aren’t written for the specific equipment make and model you have. They might even include tasks that simply don’t apply, such as “inspect the belt” on a pump that uses an entirely different drive mechanism. Once a mechanic attempts to perform one of these generic, ill-suited tasks, he or she stops trusting your overall maintenance strategy. Without credibility and compliance, you might as well not have a strategy at all.

Like OEM and vendor recommendations, generic task libraries can help you get started on a robust maintenance strategy, if (and only if) you carefully examine them first and only use the tasks that make sense for your particular equipment and operational needs.

  1. Failing to include a criticality assessment in your strategy decisions.

If you choose and define tasks without factoring in criticality, you run the risk of wasted effort and faulty maintenance. Think about it: If a piece of equipment is low on the criticality scale, you might be okay to accept a generic strategy and be done with it. But for equipment that’s highly critical to the success of your operations, you need to capture as much detail as possible when selecting and defining tasks. How can you know which is which without fully assessing the relative importance of each piece of equipment (or group of equipment) to the overall performance of your site?

  1. Developing maintenance strategies in a vacuum.

Sometimes, organizations will hire an outside consultant to develop maintenance strategies and send them off to do it, with no input from or connection with the maintenance team (or the broader parts of the organization). Perhaps they figure, “you’re the expert, you figure it out.” Here’s the problem: For a maintenance strategy to be successful, it must be developed within the big picture. You’ve got to talk to the mechanic who’ll be doing the work, the planner for that work, and the reliability engineer who’ll be responsible for the performance of that equipment, production, or operation. Their input is extremely valuable, and their buy-in is absolutely critical. Without it, even the best maintenance strategy can be met with resistance and non-compliance.

  1. Thinking of maintenance strategy development as a “one-and-done” effort.

For some organizations, the process of developing a maintenance strategy from the ground up seems like something you do once and just move on. But things change — your business needs change, the equipment you have on site changes, personnel changes, and much more. That’s why it’s vitally important to keep your maintenance strategies aligned with the current state of your operations.

In fact, a good maintenance strategy is built with the idea of future revisions in mind. That means the strategy includes clear-cut plans for revisiting and optimizing the strategy periodically. A good strategy is also designed to make those revisions as easy as possible by capturing all of the knowledge that went into your strategy decisions. Don’t just use Microsoft Word or put tasks directly into the system without documenting the basis for the decisions you made. What were your considerations? How did you evaluate them? What ultimately swayed your decision? In the future, if the key factors or circumstances change, you’ll be able to evaluate those decisions more clearly, without having to guess or rely on shaky recall.

If you’ve found yourself making any of these mistakes, don’t despair. Most errors and missteps can be addressed with an optimization project. In fact, ARMS Reliability specializes in helping organizations make the most of their maintenance strategies. Contact us to learn more.

Engineering Support_Web Banner

bigstock--165000134.jpgAs outlined in our previous blog article, “RCA Program Development: The Key Steps of Designing Your Program”, there are 11 key steps to a successful RCA program. Last month we introduced the first two steps – Defining Goals and Current Status. In this article we’ll break down steps 3 and 4 – Setting KPIs for your RCA program and establishing trigger thresholds to initiate an RCA.

  1. Key Performance Indicators

Key Performance Indicators, or KPIs, are the benchmarks used to measure the success of a program or effort. They can generally be divided into two categories: leading indicators and lagging indicators.  Both of these measure the degree to which progress is being made in achieving a specific goal.  Leading indicators tend to be objectives that progress you towards achieving the ultimate goal. They can be measured over a short period and act as mileposts to gauge how you’re tracking towards your goal. Lagging indicators are often the goals themselves.  If the relationship between the two is correctly defined, then achieving the short-term (leading) indicators virtually guarantees achieving the long-term goals.

To provide perspective in measuring progress using KPIs, a baseline must first be established.  Baselines for the selected KPIs should be at least 3 years of historical performance. Once these are established, then goals or targets for improvement should be set for a period of time, say 3 years, going forward. This process should be reviewed at least annually with baselines and targets adjusted accordingly.

  1. Formal RCA Threshold Criteria

An effective incident prevention program will have RCAs being performed at two levels: 1) On an informal or ad hoc basis for smaller, nuisance-level problems that may be specific to individuals or departments; and 2) on a formal level where challenges to the organization’s goals exist.

Leaders must communicate the organizational trigger criteria but they should also encourage and support teams and individuals to set their own trigger criteria as well.  When your employees can solve smaller day-to-day problems more effectively, your organization will realize the benefits of pro-active problem solving because many smaller problems will be rectified before they can manifest themselves into larger organization-level problem.

For RCA to be a core competency at all levels of the organization, and for people to be proactively preventing organizational problems, it is important to have clear guidance for formal RCAs. This is the function of the Trigger Criteria diagram. High-level challenges should be formally identified and assigned a threshold that when exceeded will automatically trigger a formal RCA.  Triggers should generally be leading indicators of some form or another and derived from specific organizational goals, or KPIs.  They are the trip wires to engage the RCA process for finding solutions to problems that are inhibiting organization goal achievement.

Organizations at higher levels of maturity will most often have triggers for multiple categories including safety, environmental compliance, revenue loss, unbudgeted costs, production loss, and sometimes repeat incidents.

For a deeper dive into the topic of trigger thresholds and scaling your RCA investigation, check out our whitepaper “Matching the Scale of Your RCA Investigation to the Significance of the Incident

In this blog series, we’ve now covered:

And

  • Setting KPIs and Establishing Trigger Thresholds

But of course, there is more to setting up your RCA program for success. ARMS Reliability’s RCA experts can assist you with designing your complete RCA program or reinvigorating your current one. This of course includes assisting with determining the status of your current RCA effort, walking you through the process of establishing and aligning goals, helping you set KPIs for your program, and establishing trigger thresholds that make sense for your organization. Learn more about our recommended facilitated workshop that covers all 11 of the key steps, and contact us for more information.

 

RCA201_At A Glance.jpg

training_footer_ad1-resized-600.jpg

Author: Jason Ballentine

Many organizations believe that making sound maintenance decisions requires a whole lot of data. It’s a logical assumption — you do need to know things like the number of times an event has occurred, its duration, the number of spare parts needed, and the number of people engaged in addressing the event; plus the impact on the business and the reason why it happened. ????????????????????????????????????????

A lot of this information is captured in your Computerized Maintenance Management System (CMMS). The more detail you have, the more accurate results you can get from maintenance scenario simulation tools like Isograph’s Availability Workbench™. Unfortunately, your CMMS data may be lacking enough detail to yield optimal results.

It’s enough to make anybody want to throw his or her hands up and put off the decision indefinitely. If you do, you could be making a big mistake.

No matter what, you’re still going to have to make a decision. You have to.

The truth is, you can still do a lot with limited or poor quality data, supported by additional sources of knowledge. Extract any and all information you have available, not just what is in the CMMS. Document what you’ve got, then use it to make a timely decision that’s as informed as possible.

Don’t get caught up in the fact that it’s not perfect data — circumstances in the real world are hardly ever ideal. In fact, as reliability engineers, most of the data we get is related to failure, which is exactly what we’re trying to avoid. Actually, if we are tracking failures, having less data means we are likely doing our jobs well because that means we are experiencing a low number of failures.

The bottom line is: we can’t afford to sit and wait for more data to make decisions, and neither can you.

Gather as much information as you can from all available sources:

CMMS

In an ideal world, this is the master data record of all activities performed.  As discussed previously, that is almost never the case; however, this is an important starting point to reveal where data gaps exist.

Personal experience and expertise

There’s a wealth of information stored within the experience of people who are familiar with any given piece of equipment. Consider holding a facilitated workshop to gather insight on the equipment’s likely performance. Even a series of informal conversations can yield useful opinions and real-world experiences.

The Original Equipment Manufacturer (OEM)

Most OEMs will have documentation you can access, possibly also a user forum you can mine for additional information.

Industry databasese.g., the Offshore and Onshore Reliability Data Handbook (OREDA) and Process Equipment Reliability Database (PERD) by Center for Chemical Process Safety (CCPS)

Some information is available in these databases, but it’s generic — not specific to your unique site or operating context. For example, you can find out how often a certain type of pump fails, but you can’t discover whether that pump is being used on an oil platform, refinery, power station or mine site. Industry data does, however, provide useful estimates on which you can base your calculations and test your assumptions.

Capture all these insights in an easily accessible way, then use what you’ve learned to make the best decision currently possible. And be sure to record the basis for your decision for future reference. If you get better data down the road, you can always go back and revise your decisions — after all, most maintenance strategies should remain dynamic by design.

Don’t let a lack of data paralyze you into inaction. Gather what you can, make a decision, see how it works, and repeat. It’s a process of continuous improvement, which given the right framework is simple and efficient.

Availability Workbench™, Reliability Workbench™, FaultTree+™, and Hazop+™ are trademarks of Isograph Limited the author and owner of products bearing these marks. ARMS Reliability is an authorized distributor of those products, and a trainer in respect of their use.

Reliability Summit_Blog_Web Banner

Author: Jason Ballentine

As with any budget, you’ve only got a certain amount of money to spend on maintenance in the coming year. How do you make better decisions so you can spend that budget wisely and get maximum performance out of your facility? ??????????????????????????????????????????????

It is possible to be strategic about allocating funds if you understand the relative risk and value of different approaches. As a result, you can get more bang for the same bucks.

How can you make better budget decisions?

It can be tempting to just “go with your gut” on these things. However, by taking a systematic approach to budget allocation, you’ll make smarter decisions — and more importantly you’ll have concrete rationales for why you made those decisions —  which can be improved over time. Work to identify the specific pieces of equipment (or types of equipment) that are most critical to your business, then compare the costs and risks of letting that equipment run to failure against the costs and risks of performing proactive maintenance on that equipment. Let’s take a closer look at how you can do that.

4 steps to maximize your maintenance budget

1.  Assign a criticality level for each piece of equipment. Generally, this is going to result in a list of equipment that would cause the most pain — be it financial, production loss, safety, or environmental pain — in the event of failure. Perform a Pareto analysis for maximum detail. 

2.  For your most critical equipment, calculate the ramifications of a reactive/run-to-failure approach.

  • Quantify the relative risk of failure. (You can use the RCMCost™ module of Isograph’s Availability Workbench™ to better understand the risk of different failure modes.)
  • Quantify the costs of failure. Keep in mind that equipment failures can affect multiple aspects of your business in different ways — not just direct hard costs. In every case, consider all possible negative effects, including potential risks.
    • Maintenance: Staff utilization, spare parts logistics, equipment damage, etc.
    • Production Impact: Downtime, shipment delays, stock depletion or out-of-stock, rejected/reworked product, etc.
    • Environmental Health & Safety (EHS) Impact: Injuries, actual/potential releases to the environment, EPA visits/fines, etc.
    • Business Impact: Lost revenue, brand damage, regulatory issues, etc.

For a more detailed explanation of the various potential costs of failure, consult our eBook, Building a Business Case for Maintenance Strategy Optimization.

3.  Next, calculate the impact of a proactive maintenance approach for this equipment

  • Outline the tasks that would best mitigate existing and potential failure modes
  • Evaluate the cost of performing those tasks, based on the staff time and resources required to complete them.
  • Specify any risks associated with the proactive maintenance tasks. These risks could include the possibility of equipment damage during the maintenance task, induced failures, and/or infant mortality for newly replaced or reinstalled parts.

4. Compare the relative risk costs between these approaches for each maintenance activity. This will show you where to focus your maintenance budget for maximum return.

When is proactive maintenance not the best plan?

For the most part, you’ll want to allocate more of your budget towards proactive maintenance for equipment that has the highest risk and the greatest potential negative impact in the event of failure. Proactive work is more efficient so your team can get more done for the same dollar value. Letting an item run to failure can create an “all hands on deck” scenario under which nothing else gets done, whereas many proactive tasks can be performed quickly and possibly even concurrently.

That said, it’s absolutely true that sometimes run-to-failure is the most appropriate approach for even a critical piece of equipment. For example, a maintenance team might have a scheduled task to replace a component after five years, but the problem is that component doesn’t really age -— the only known failure mode is getting struck by lightning. No matter how old that component is, the risk is the same. Performing replacement maintenance on this type of component might actually cost more than simply letting it run until it fails. (In these cases, a proactive strategy would focus on minimizing the impact of a failure event by adding redundancy or stocking spares.) But you can’t know that without quantifying the probability and cost of failure.

Side note: Performing this analysis can help you see where your maintenance budget could be reduced without a dramatic negative effect on performance or availability. Alternatively, this analysis can help you demonstrate the likely impact of a forced budget reduction. This can be very helpful in the event of budget pressure coming down from above.   

At ARMS Reliability, we help organizations understand how to forecast, justify and prioritize their maintenance budgets for the best possible chances of success. Contact us to learn more.

Availability Workbench™, Reliability Workbench™, FaultTree+™, and Hazop+™ are trademarks of Isograph Limited the author and owner of products bearing these marks. ARMS Reliability is an authorised distributor of those products, and a trainer in respect of their use.

In our previous blog article, “RCA Program Development: The Key Steps of Designing Your Program”, we provided a high-level outline of the eleven key elements that need to be defined in order to have an effective root cause analysis program in your organization. Now, in a series of articles, we’ll break down each of those eleven key elements into further detail expanding on the important considerations that need to be taken into account, starting here with your goals and your current status. bigstock-Group-of-three-successful-busi-68707195.jpg

  1. RCA Goals and Objective Alignment

So, the first question is, “What are we trying to achieve with the RCA effort?” The answer from an overarching perspective can be found in the organization’s goals and objectives. Every organization, and individual for that matter, has a set of goals and objectives that are used as yardsticks to measure both short and long term performance. It is critically important that the RCA effort be in complete alignment with the organization’s and individuals’ goals and objectives. We do this by using the goals and objectives to guide us in identifying the Key Performance Indicators (KPIs) of the RCA effort and setting the Threshold Criteria (Trigger Diagram) for determining when a formal RCA must be performed. (Learn more about setting Threshold Criteria.) If the alignment is true, then there will be tangible, measurable improvement in goals and objectives achievement over time.

  1. Status of Current RCA Effort

Every organization will have some form of RCA in practice whether it is formalized, or ad hoc. It is worthwhile spending some time assessing the status, or maturity level, of the existing RCA process. Maturity level can be categorized in one of four general categories.

  • Level 1: Learning and Development
  • Level 2: Efficient
  • Level 3: Self-Actualizing
  • Level 4: Pro-Active

Level 1, Learning and Development, is where most organizations without a formalized RCA program find themselves.  Management recognizes a need for a formal problem solving method but the focus is primarily on training. There is little or no structure in place to support the trained facilitators and no well-defined KPIs or threshold criteria guidance. At this stage the organization will usually gain some organizational improvements from the elimination of problems, but in an inefficient “learn as you go” manner.

At Level 2, Efficient, formal RCA triggers and KPIs are in place and are aligned with business goals and objectives in advance of RCA training. This would include clear definitions of RCA roles and responsibilities as well as identification of supporting infrastructure such as RCA status tracking, effectiveness of implemented corrective actions and the like.

In the Self-Actualizing level, the effectiveness of the trained problem-solvers has matured through experience. Thus, their ability to solve organizational problems has resulted in a documented achievement of the program KPIs and resulting improvements to the organization’s goals and objectives. The organization is now in the continuous process of tightening the bandwidth of the KPIs to yield greater return to the bottom line. The RCA facilitators are now highly confident, efficient, and effective in eliminating impediments to achieving goals and objectives.

In the Pro-Active level, your organization has now integrated the RCA process into its core culture. Effective problem elimination is the norm and expected at all levels of the organization. People no longer look to place blame for problems but instead are focused on prevention and elimination. Return on investment for both monetary as well as health/safety/environmental issues is extremely high acting both as gratification and motivation. RCA has become a core competency within your culture whereby people are intolerant of ineffectively solving problems the first time and are finding pro-active ways to use RCA to prevent problems from occurring in the first place.

There are existing methods or surveys that can be used to determine an organization’s maturity level. Why is this important? Determining your current maturity level draws a line in the sand showing where you started, or took a renewed focus, in this journey of developing your RCA program. You can set goals around where you want to be over a period of time and look back to see how your program has actually evolved.

This article has given you a glimpse into the first two key elements but of course, there is more to setting up your RCA program for success. ARMS Reliability’s RCA experts can assist you with designing your complete RCA program or reinvigorating your current one. This of course includes assisting with determining the status of your current RCA effort and walking you through the process of establishing and aligning goals. Learn more about our recommended facilitated workshop and contact us for more information.

RCA Progam Development Banner.jpg

bigstock-Construction-Worker-Falling-Of-68401633_Filters.jpgWithout truly understanding the key elements (and possessing the necessary skills) to conduct a thorough, effective investigation, people run the risk of missing key causal factors of an incident while conducting the actual analysis. This could potentially result in not identifying all possible solutions including those that may be more cost effective, easier to implement, or more effective at preventing recurrence.

Here we outline the 5 key steps of an incident investigation which precede the actual analysis.

1. Secure the incident scene

  • Identify and preserve potential evidence
  • Control access to the scene
  • Document the scene using your ‘Incident Response Template’ (Do you have one?)

2. Select investigation team

  • The functions that must be filled are:
  • Incident Investigation Lead
  • Evidence Gatherer
  • Evidence Preservation Coordinator
  • Communications Coordinator
  • Interview Coordinator

Other important considerations for the selection of team members include:

  • Ensure team members have the desirable traits (What are they?)
  • The nature of the incident (How does this impact team selection?)
  • Choose the right people from inside and outside the organization (How do you decide?)
  • Appropriate size of the team (What is the optimum team size?)

*Our Incident Investigator training course examines each of these considerations and more, giving you the knowledge to select investigation team members wisely.

3. Plan the investigation

Upon receiving the initial call:

  • Get the preliminary What, When, Where, and Significance
  • Determine the status of the incident
  • Understand any sensitivities
  • If necessary and appropriate, issue a request to isolate the incident area
  • Escalate notifications as appropriate

The preliminary briefing:

  • Investigation Lead to present a preliminary briefing to the investigating team
  • Prepare a team investigation plan

4. Collect the facts supported by evidence

Tips:

  • Be prepared and ready to lead or participate in an investigation at all times to ensure timeliness and thoroughness.
  • Have your “Go Bag” ready with useful items to help you secure the scene, take photographs, document the details of the scene and collect physical evidence.
  • Collect as much information as possible…analyze later
  • Inspect the incident scene
  • Gather facts and evidence
  • Conduct interviews

*While every step in the Incident Prevention Process is crucial, step 4 requires a particularly distinct set of skills. A lot of time in our Incident Investigator training course is dedicated to learning the techniques and skills required to get this step done right.

5. Establish a timeline

This can be the quickest way to group information from many sources

Tip:

  • Stickers can be used on poster paper to start rearranging information on a timeline. Use different colors for precise data versus imprecise, and list the source of the information on each note.

After steps 1-5 comes the Root Cause Analysis of the incident, solution implementation and tracking, and reporting back to the organization:

6. Determine the root causes of the incident

7. Identify and recommend solutions to prevent recurrence of similar incidents

8. Implement the solutions

9. Track effectiveness of solutions

10. Communicate findings throughout the organization

*Steps 6-10 are taught in detail at our Root Cause Analysis Facilitator training course.

To learn more on the difference between our Incident Investigator versus RCA Facilitator training courses, check out our previous blog article and of course, if you would like to discuss how to implement or improve your organization’s incident prevention process, please contact us.

bigstock--136958450.jpgAuthor: Bruce Ballinger

To have a successful implementation and adoption of your new RCA program, it’s crucial to have all the elements of an effective and efficient program clearly identified and agreed upon in advance.

Here’s a high-level look at the elements that will need to be defined:

RCA Goals and Objective Alignment

Define the goals and objectives of the program and assure they are in alignment with corporate/facility/department goals and objectives

Status of Current RCA Effort

Perform a maturity assessment of existing RCA program to be used as a baseline to measure future improvements

Key Performance Indicators

Identify KPI’s with baselines and future targets to be used for measuring progress towards meeting program goals and objectives

Formal RCA Threshold Criteria

Determine which incidents will trigger a formal RCA and estimate how many triggered events may occur in the upcoming year

RCA and Solution Tracking Systems

Identify which internal tracking systems will be used to track status/progress of open RCA’s and implemented solutions

Roles and Responsibilities

Identify specifically who will have a role in the RCA effort including, program sponsor, champion, RCA facilitators

Training Strategy

Determine who will be trained in the chosen RCA methodology and to what level and in what time frame

RCA Effort Oversight and Management

Identify who (or what committees or groups) will be responsible for managing tracking systems, decisions on solution implementation, program modifications over time, and general program performance

Process Mapping

Process mapping exercise to document RCA management from the beginning of a triggered incident to completion of implemented solutions, including their impact on organization’s goals and objectives.

Human Change Management Plan

Develop a Change Management plan, including a detailed communication plan, that specifically targets those whose job duties will be affected by the RCA effort.

Implementation Tracking

Create a checklist to monitor RCA effort implementation including action items, responsible parties and due dates

We recommend conducting a workshop in order to define each of these crucial elements of your RCA program.

The workshop should be conducted for what we call a “functional unit” which ideally is no larger than a plant or facility, however, it can be modified to accommodate multiple facilities.

Common elements of a functional unit include:

  • A common trigger diagram
  • Common KPI’s
  • The same Program Champion
  • Members have an interdependence and shared responsibility on one another for functional unit performance

By structuring programs to fit within the goals and objectives of the business, or “functional unit”, rather than applying a ‘one size fits all’ solution, effective and long lasting results can be realized.

Implementing a new RCA program or need to reinvigorate your current one? ARMS can help you create a customized plan for its successful adoption. Contact Us for more information

Author: Scott Gloyna

For any given asset there are typically dozens of different predictive or preventive maintenance tasks that could be performed, however selecting the right maintenance tasks that contribute effectively to your overall strategy can be tricky, The benefit is the difference between meeting production targets and the alternative of lost revenue, late night callouts, and added stress from unplanned downtime events. Construction Worker Pointing With Finger. Ready For Sample Text

Step 1: Build out your FMEA (Failure Mode Effects Analysis) for the asset under consideration. 

Make sure you get down to appropriate failure modes in enough detail so that the causes are understood and you can identify the proper maintenance to address each specific failure mode.

Once you’ve made a list of failure modes, then it’s detailed analysis time. If you want to be truly rigorous, perform the following analysis for every potential failure mode. Depending on the criticality of the asset you can simplify by paring down your list to include only the failure modes that are most frequent or result in significant downtime.

Step 2: Identify the consequences of each failure mode on your list.

Failure modes can result in multiple types of negative impact. Typically, these failure effects include production costs, safety risks, and environmental impacts. It is your job to identify the effects of each failure mode and quantify them in a manner that allows them to be reviewed against your business’s goals. Often when I am facilitating a maintenance optimization study people will say things like “There is no effect when that piece of equipment fails.” If that’s the case, why is that equipment there? All failures have effects, they may just be small or hard to quantify, perhaps because of available workarounds or maybe there is a certain amount of time after the failure before an effect is realized.

Step 3: Understand the failure rate for each particular mode.

Gather information on the failure rates from any available industry data and personnel with experience on the asset or a similar asset and installation, as well as any records of past failure events at your facility. This data can be used to evaluate the frequency of failure through a variety of methods — ranging from a simple Mean Time To Failure (MTTF) to a more in-depth review utilizing Weibull distributions.

(Note: The Weibull module of Isograph’s Availability Workbench™ can help you to quickly and easily understand the likelihood of different failure modes occurring.)

Step 4: Make a list of possible reactive, planned or inspection tasks to address each failure mode.

Usually, you start by listing the actions you take when that failure mode occurs (reactive maintenance). Then broaden your list to any potential preventive maintenance and/or inspection tasks that could help prevent the failure mode from happening, or reduce the frequency at which it occurs.

  • Reactive tasks
    • Replacement
    • Repair
  • Preventive tasks
    • Daily routines (clean, adjust, lubricate)
    • Periodic overhauls, refurbishments, etc.
    • Planned replacement
  • Inspection tasks
    • Manual (sight, sound, touch)
    • Condition monitoring (vibration, thermography, ultrasonics, x-ray and gamma ray)

Step 5: Gather details about each potential task.

In order to compare and contrast different tasks, you have to understand the requirements of each:

  • What exactly does the task entail? (basic description)
  • How long would the work take?
  • How long would it take to start the work after shutdown/failure?
  • Who would do the work?
  • What labor costs are involved? (the hourly rates of the employees or outside contractors who would perform the task)
  • Would any spare parts be required? If so, how much would they cost?
  • Would you need to rent any specialized equipment? If so, how much would it cost?
  • Do you have to take the equipment offline? If so, for how long?
  • How often would you need to perform this task (frequency)?

A key consideration for inspection tasks only: What is the P-F interval for this failure mode? This is the window between the time you can detect a potential failure (P) and when it actually fails (F) — similar to calculating how long you can drive your car after the fuel light comes on, before you actually run out of fuel Understanding the P-F interval is key in determining the interval for each inspection task.

The P-F interval can vary from hours to years and is specific to the type of inspection, the specific failure mode and even the operating context of the machinery.

It can be hard to determine the P-F interval precisely but it is very important to ensure that the best approximation is made because of the impact it has on task selection and frequency.

Step 6: Evaluate the lifetime costs of different maintenance approaches.

Once you understand the cost and frequency of different failure modes, as well as the cost and frequency of various maintenance tasks to address them, you can model the overall lifetime costs of various options.

For example, say you have a failure mode with a moderate business impact — enough to affect production, but not nosedive your profits for the quarter. If that failure mode has a mean time between failures (MTBF) of six months, you might take a very aggressive maintenance approach. On the other hand, if that failure mode only happens once every ten years, your approach would be very different. “Run to Failure” is often a completely legitimate choice, but you need to understand and be able to justify that choice.

These calculations can be done manually, in spreadsheets or using specialized modeling software such as the RCMCost™ module of Isographs Availability Workbench™.

Ultimately you try to choose the least expensive maintenance task that provides the best overall business outcome.

 Ready to learn more? Gain the skills needed to develop optimized maintenance strategies through our training course: Introduction to Maintenance Strategy Development

rcm201_web-banner

Author: David Wilbur, CEO – Vetergy Group

To begin we must draw the distinction between error and failure. Error describes something that is not correct or a mistake; operationally this would be a wrong decision or action. Failure is the lack of success; operationally this is a measurable output where objectives were not met. Failures audit our operational performance, unfortunately quite often with catastrophic consequences; irredeemable financial impact, loss of equipment, irreversible environmental impact or loss of life. Failure occurs when an unrecognized and uninterrupted error becomes an incident that disrupts operations. bigstock-Worker-in-factor-1108477eac4c3d0b3c37f374ad197440e9c5b429

Individual-Centered Approach

The traditional approach to achieving reliable human performance centers on individuals and the elimination of error and waste. Human error is the basis of study with the belief that in order to prevent failures we must eliminate human error or the potential for it. Systems are designed to create predictability and reliability through skills training, equipment design, automation, supervision and process controls.

The fundamental assumptions are that people are erratic and unpredictable, that highly trained and experienced operators do not make mistakes and that tightly coupled complex systems with prescribed operations will keep performance within acceptable tolerances to eliminate error and create safety and viability.

This approach can only produce a limited return on investment. As a result, many organizations experience a plateau in performance and seek enhanced methods to improve and close gaps in performance.

An Alternative Philosophy

Error is embraced rather than evaded; sources of error are minimized and programs focus on recognition of error in order to disturb the pathway of error to becoming failure. 

Slight exception notwithstanding, we must understand people do not set out to cause failure, rather their desire is to succeed. People are a component of an integrated, multi-dimensional operating framework. In fact, human beings are the spring of resiliency in operations. Operators have an irreplaceable capacity to recognize and correct for error and adapt to changes in operating conditions, design variances and unanticipated circumstances.

In this approach, human error is accepted as ubiquitous and cannot be categorically eliminated through engineering, automation or process controls. Error is embraced as a system product rather than an obstacle; sources of error are minimized and programs focus on recognition of error in order to disturb its pathway to becoming failure. System complexity does not assure safety. While system safety components mitigate risk, as systems become more complex, error becomes obscure and difficult to recognize and manage.

Concentrating on individuals creates a culture of protectionism and blame, which worsens the obscurity of error. A better philosophy distributes accountability for variance and promotes a culture of transparency, problem solving and improvement. Leading this shift can only begin at the organizational level through leadership and example.

The Operational Juncture™

In contrast to the individual-centered view, a better approach to creating Operational Resilience is formed around the smallest unit of Human Factors Analysis called the Operational Juncture™. The Operational Juncture describes the concurrence of people given a task to operate tools and equipment guided by conflicting objectives within an operational setting including physical, technological, and regulatory pressures provided with information where choices are made that lead to outcomes, both desirable and undesirable.

It is within this multidimensional concurrence we can influence the reliability of human performance. Understanding this concurrence directs us away from blaming individuals and towards determining why the system responded the way it did in order to modify the structure. Starting at this juncture, we can preemptively design operational systems and reactively probe causes of failure. We view a holistic assignment of accountability fixing away from merely the actions of individuals towards all of the components that make up the Operational Juncture. This is not a wholesale change in the way safety systems function, but an enhanced viewpoint that captures deeper, more meaningful and more effective ways to generate profitable and safe operations.

A practical approach to analyzing human factors in designing and evaluating performance creates both reliability and resilience. Reliability is achieved by exposing system weaknesses and vulnerabilities that can be corrected to enhance reliability in future and adjacent operations. Resilience emerges when we expose and correct deep organizational philosophy and behaviors.

Resilience is born in the organizational culture where individuals feel supported and regarded. Teams operate with deep ownership of organizational values, recognize and respect the tension between productivity and protection, and seek to make right choices. Communication occurs with trust and transparency. Leadership respects and gives careful attention to insight and observation from all levels of the organization. In this culture, people will self-assess, teams will synergize and cooperate to develop new and creative solutions when unanticipated circumstances arise. Individuals will hold each other accountable.

Safety within Operational Resilience is something an organization does, not something that is created or attained. A successful program will deliver a top-down institutionalization of culture that produces a bottom-up emergence of resilience.

HFWebBanner.jpg