Yearly Archives: 2016

You are browsing the site archives by year.

bigstock-Construction-Worker-Falling-Of-68401633_Filters.jpgWithout truly understanding the key elements (and possessing the necessary skills) to conduct a thorough, effective investigation, people run the risk of missing key causal factors of an incident while conducting the actual analysis. This could potentially result in not identifying all possible solutions including those that may be more cost effective, easier to implement, or more effective at preventing recurrence.

Here we outline the 5 key steps of an incident investigation which precede the actual analysis.

1. Secure the incident scene

  • Identify and preserve potential evidence
  • Control access to the scene
  • Document the scene using your ‘Incident Response Template’ (Do you have one?)

2. Select investigation team

  • The functions that must be filled are:
  • Incident Investigation Lead
  • Evidence Gatherer
  • Evidence Preservation Coordinator
  • Communications Coordinator
  • Interview Coordinator

Other important considerations for the selection of team members include:

  • Ensure team members have the desirable traits (What are they?)
  • The nature of the incident (How does this impact team selection?)
  • Choose the right people from inside and outside the organization (How do you decide?)
  • Appropriate size of the team (What is the optimum team size?)

*Our Incident Investigator training course examines each of these considerations and more, giving you the knowledge to select investigation team members wisely.

3. Plan the investigation

Upon receiving the initial call:

  • Get the preliminary What, When, Where, and Significance
  • Determine the status of the incident
  • Understand any sensitivities
  • If necessary and appropriate, issue a request to isolate the incident area
  • Escalate notifications as appropriate

The preliminary briefing:

  • Investigation Lead to present a preliminary briefing to the investigating team
  • Prepare a team investigation plan

4. Collect the facts supported by evidence

Tips:

  • Be prepared and ready to lead or participate in an investigation at all times to ensure timeliness and thoroughness.
  • Have your “Go Bag” ready with useful items to help you secure the scene, take photographs, document the details of the scene and collect physical evidence.
  • Collect as much information as possible…analyze later
  • Inspect the incident scene
  • Gather facts and evidence
  • Conduct interviews

*While every step in the Incident Prevention Process is crucial, step 4 requires a particularly distinct set of skills. A lot of time in our Incident Investigator training course is dedicated to learning the techniques and skills required to get this step done right.

5. Establish a timeline

This can be the quickest way to group information from many sources

Tip:

  • Stickers can be used on poster paper to start rearranging information on a timeline. Use different colors for precise data versus imprecise, and list the source of the information on each note.

After steps 1-5 comes the Root Cause Analysis of the incident, solution implementation and tracking, and reporting back to the organization:

6. Determine the root causes of the incident

7. Identify and recommend solutions to prevent recurrence of similar incidents

8. Implement the solutions

9. Track effectiveness of solutions

10. Communicate findings throughout the organization

*Steps 6-10 are taught in detail at our Root Cause Analysis Facilitator training course.

To learn more on the difference between our Incident Investigator versus RCA Facilitator training courses, check out our previous blog article and of course, if you would like to discuss how to implement or improve your organization’s incident prevention process, please contact us.

bigstock--136958450.jpgAuthor: Bruce Ballinger

 

To have a successful implementation and adoption of your new RCA program, it’s crucial to have all the elements of an effective and efficient program clearly identified and agreed upon in advance.

Here’s a high-level look at the elements that will need to be defined:

  1. RCA Goals and Objective Alignment
  • Define the goals and objectives of the program and assure they are in alignment with corporate/facility/department goals and objectives
  • Status of Current RCA Effort
  • Perform a maturity assessment of existing RCA program to be used as a baseline to measure future improvements
  • Key Performance Indicators
  • Identify KPI’s with baselines and future targets to be used for measuring progress towards meeting program goals and objectives
  • Formal RCA Threshold Criteria
  • Determine which incidents will trigger a formal RCA and estimate how many triggered events may occur in the upcoming year
  • RCA and Solution Tracking Systems
  • Identify which internal tracking systems will be used to track status/progress of open RCA’s and implemented solutions
  • Roles and Responsibilities
  • Identify specifically who will have a role in the RCA effort including, program sponsor, champion, RCA facilitators
  • Training Strategy
    • Determine who will be trained in the chosen RCA methodology and to what level and in what time frame
  • RCA Effort Oversight and Management
    • Identify who (or what committees or groups) will be responsible for managing tracking systems, decisions on solution implementation, program modifications over time, and general program performance
  • Process Mapping
    • Process mapping exercise to document RCA management from the beginning of a triggered incident to completion of implemented solutions, including their impact on organization’s goals and objectives.
  • Human Change Management Plan
    • Develop a Change Management plan, including a detailed communication plan, that specifically targets those whose job duties will be affected by the RCA effort.
  •  Implementation Tracking
    • Create a checklist to monitor RCA effort implementation including action items, responsible parties and due dates

We recommend conducting a workshop in order to define each of these crucial elements of your RCA program.

The workshop should be conducted for what we call a “functional unit” which ideally is no larger than a plant or facility, however, it can be modified to accommodate multiple facilities.

Common elements of a functional unit include:

  • A common trigger diagram
  • Common KPI’s
  • The same Program Champion
  • Members have an interdependence and shared responsibility on one another for functional unit performance

By structuring programs to fit within the goals and objectives of the business, or “functional unit”, rather than applying a ‘one size fits all’ solution, effective and long lasting results can be realized.

Implementing a new RCA program or need to reinvigorate your current one? ARMS can help you create a customized plan for its successful adoption. Contact Us for more information

Author: Scott Gloyna

For any given asset there are typically dozens of different predictive or preventive maintenance tasks that could be performed, however selecting the right maintenance tasks that contribute effectively to your overall strategy can be tricky, The benefit is the difference between meeting production targets and the alternative of lost revenue, late night callouts, and added stress from unplanned downtime events. Construction Worker Pointing With Finger. Ready For Sample Text

Step 1: Build out your FMEA (Failure Mode Effects Analysis) for the asset under consideration. 

Make sure you get down to appropriate failure modes in enough detail so that the causes are understood and you can identify the proper maintenance to address each specific failure mode.

Once you’ve made a list of failure modes, then it’s detailed analysis time. If you want to be truly rigorous, perform the following analysis for every potential failure mode. Depending on the criticality of the asset you can simplify by paring down your list to include only the failure modes that are most frequent or result in significant downtime.

Step 2: Identify the consequences of each failure mode on your list.

Failure modes can result in multiple types of negative impact. Typically, these failure effects include production costs, safety risks, and environmental impacts. It is your job to identify the effects of each failure mode and quantify them in a manner that allows them to be reviewed against your business’s goals. Often when I am facilitating a maintenance optimization study people will say things like “There is no effect when that piece of equipment fails.” If that’s the case, why is that equipment there? All failures have effects, they may just be small or hard to quantify, perhaps because of available workarounds or maybe there is a certain amount of time after the failure before an effect is realized.

Step 3: Understand the failure rate for each particular mode.

Gather information on the failure rates from any available industry data and personnel with experience on the asset or a similar asset and installation, as well as any records of past failure events at your facility. This data can be used to evaluate the frequency of failure through a variety of methods — ranging from a simple Mean Time To Failure (MTTF) to a more in-depth review utilizing Weibull distributions.

(Note: The Weibull module of Isograph’s Availability Workbench™ can help you to quickly and easily understand the likelihood of different failure modes occurring.)

Step 4: Make a list of possible reactive, planned or inspection tasks to address each failure mode.

Usually, you start by listing the actions you take when that failure mode occurs (reactive maintenance). Then broaden your list to any potential preventive maintenance and/or inspection tasks that could help prevent the failure mode from happening, or reduce the frequency at which it occurs.

  • Reactive tasks
    • Replacement
    • Repair
  • Preventive tasks
    • Daily routines (clean, adjust, lubricate)
    • Periodic overhauls, refurbishments, etc.
    • Planned replacement
  • Inspection tasks
    • Manual (sight, sound, touch)
    • Condition monitoring (vibration, thermography, ultrasonics, x-ray and gamma ray)

Step 5: Gather details about each potential task.

In order to compare and contrast different tasks, you have to understand the requirements of each:

  • What exactly does the task entail? (basic description)
  • How long would the work take?
  • How long would it take to start the work after shutdown/failure?
  • Who would do the work?
  • What labor costs are involved? (the hourly rates of the employees or outside contractors who would perform the task)
  • Would any spare parts be required? If so, how much would they cost?
  • Would you need to rent any specialized equipment? If so, how much would it cost?
  • Do you have to take the equipment offline? If so, for how long?
  • How often would you need to perform this task (frequency)?

A key consideration for inspection tasks only: What is the P-F interval for this failure mode? This is the window between the time you can detect a potential failure (P) and when it actually fails (F) — similar to calculating how long you can drive your car after the fuel light comes on, before you actually run out of fuel Understanding the P-F interval is key in determining the interval for each inspection task.

The P-F interval can vary from hours to years and is specific to the type of inspection, the specific failure mode and even the operating context of the machinery.

It can be hard to determine the P-F interval precisely but it is very important to ensure that the best approximation is made because of the impact it has on task selection and frequency.

Step 6: Evaluate the lifetime costs of different maintenance approaches.

Once you understand the cost and frequency of different failure modes, as well as the cost and frequency of various maintenance tasks to address them, you can model the overall lifetime costs of various options.

For example, say you have a failure mode with a moderate business impact — enough to affect production, but not nosedive your profits for the quarter. If that failure mode has a mean time between failures (MTBF) of six months, you might take a very aggressive maintenance approach. On the other hand, if that failure mode only happens once every ten years, your approach would be very different. “Run to Failure” is often a completely legitimate choice, but you need to understand and be able to justify that choice.

These calculations can be done manually, in spreadsheets or using specialized modeling software such as the RCMCost™ module of Isographs Availability Workbench™.

Ultimately you try to choose the least expensive maintenance task that provides the best overall business outcome.

 Ready to learn more? Gain the skills needed to develop optimized maintenance strategies through our training course: Introduction to Maintenance Strategy Development

rcm201_web-banner

Actualmente, es probable que muchas organizaciones de nivel empresarial tengan operaciones similares en múltiples ubicaciones a nivel regional o incluso mundial. Cuando una pieza de un activo falla o se produce un incidente de seguridad en un sitio, la compañía investiga el problema e identifica soluciones o acciones correctivas. Naturalmente, el equipo quiere capturar las lecciones aprendidas y compartirlas con otros sitios que tienen equipos similares y procesos e incidentes potenciales. investigation files.jpg

Herramientas avanzadas como el software RealityCharting® permiten a los equipos compartir los resultados de un Apollo Análisis Causa Raíz (RCA) con múltiples miembros de su equipo. Sin embargo, una empresa multinacional podría tener decenas de investigaciones simultáneas. En los cargos más altos, los que toman las decisiones no necesariamente quieren ver información detallada sobre causas específicas en una planta. Necesitan una perspectiva mas general de los problemas y patrones que afectan a toda la organización. 

En ARMS Reliability, muchos de nuestros clientes han expresado una necesidad similar. ¿Nuestra solución? Utilizar etiquetas de clasificación para crear y aplicar una taxonomía coherente a todos los análisis de causa raíz realizados para una organización determinada. En un informe compuesto, estas etiquetas revelan las tendencias y problemas de toda la empresa, lo que permite a la administración crear planes de acción para abordar estos problemas sistémicos. Por ejemplo, las etiquetas de clasificación pueden revelar un gran número de problemas relacionados con la falta de mantenimiento preventivo en un cierto tipo de bomba o un incumplimiento sistémico con un proceso de seguridad requerido.

Una taxonomía de clasificación puede ser moldeada y configurada a los objetivos y procesos de una organización. Piense en éstas como clasificaciones que pueden aplicarse en cualquier nivel de la RCA por ejemplo, a las causas raíz o soluciones, a causas contribuyentes individuales, o simplemente a la investigación de RCA en general.

Tenga en cuenta lo siguiente: El método Apollo de análisis causa raíz se centra alrededor de un enfoque de pensamiento libre para la resolución de problemas. Eso es lo que hace que la metodología sea tan poderosa: no te lleva a ninguna vía genérica predeterminada haciendo preguntas directas o categorizando varias causas o efectos de ninguna manera. En ARMS Reliability, abogamos por la aplicación de etiquetas de clasificación una vez que la investigación de análisis causa raíz se ha completado, por lo que se mantiene el pensamiento libre de análisis causal para organizarlo posteriormente, con el fin obtener resultados con una visión sistémica más profunda.

Las taxonomías pueden variar desde 5 a 20 categorías hasta los cientos. Por ejemplo, aquí hemos utilizado una taxonomía sobre factores humanos para etiquetar causas como influencias organizacionales y otras cuestiones relacionadas a la persona.

screenshot 1.png

Los informes pueden proporcionar un resumen de cuántas causas se clasificaron bajo las distintas etiquetas:

screenshot 2v2.jpg

En otro ejemplo, una organización basa su taxonomía de problemas de confiabilidad en el ISO 14224 – Recopilación e intercambio de datos de confiabilidad y mantenimiento para activos.

Las opciones de taxonomía son infinitas. La mayoría de las organizaciones con las que trabajamos tienen sus propios sistemas de clasificación. Realmente se trata de codificar los tipos de información que su organización necesita capturar primero.

Si cree que la adición de clasificaciones a su análisis de causa raíz sería útil para su organización, póngase en contacto con ARMS Reliability. Nos gustaría mostrarle más sobre lo que estamos haciendo con otros clientes y ayudarle a desarrollar una taxonomía que funcione de acuerdo a sus necesidades.

Author: David Wilbur, CEO – Vetergy Group

To begin we must draw the distinction between error and failure. Error describes something that is not correct or a mistake; operationally this would be a wrong decision or action. Failure is the lack of success; operationally this is a measureable output where objectives were not met. Failures audit our operational performance, unfortunately quite often with catastrophic consequences; irredeemable financial impact, loss of equipment, irreversible environmental impact or loss of life. Failure occurs when an unrecognized and uninterrupted error becomes an incident that disrupts operations.bigstock-Worker-in-factory-at-CNC-machi-82970306.jpg

Individual Centered Approach

The traditional approach to achieving reliable human performance centers on individuals and the elimination of error and waste. Human error is the basis of study with the belief that in order to prevent failures we must eliminate human error or the potential for it. Systems are designed to create predictability and reliability through skills training, equipment design, automation, supervision and process controls.

The fundamental assumptions are that people are erratic and unpredictable, that highly trained and experienced operators do not make mistakes and that tightly coupled complex systems with prescribed operations will keep performance within acceptable tolerances to eliminate error and create safety and viability.

This approach can only produce a limited return on investment. As a result, many organizations experience a plateau in performance and seek enhanced methods to improve and close gaps in performance.

An Alternative Philosophy

Error is embraced rather than evaded; sources of error are minimized and programs focus on recognition of error in order to disturb the pathway of error to becoming failure. 

Slight exception notwithstanding, we must understand people do not set out to cause failure, rather their desire is to succeed. People are a component of an integrated, multi-dimensional operating framework. In fact, human beings are the spring of resiliency in operations. Operators have an irreplaceable capacity to recognize and correct for error and adapt to changes in operating conditions, design variances and unanticipated circumstances.

In this approach, human error is accepted as ubiquitous and cannot be categorically eliminated through engineering, automation or process controls. Error is embraced as a system product rather than an obstacle; sources of error are minimized and programs focus on recognition of error in order to disturb its pathway to becoming failure. System complexity does not assure safety. While system safety components mitigate risk, as systems become more complex, error becomes obscure and difficult to recognize and manage.

Concentrating on individuals creates a culture of protectionism and blame, which worsens the obscurity of error. A better philosophy distributes accountability for variance and promotes a culture of transparency, problem solving and improvement. Leading this shift can only begin at the organizational level through leadership and example.

The Operational Juncture™

In contrast to the individual-centered view, a better approach to creating Operational Resilience is formed around the smallest unit of Human Factors Analysis called the Operational Juncture™. The Operational Juncture describes the concurrence of people given a task to operate tools and equipment guided by conflicting objectives within an operational setting including physical, technological, and regulatory pressures provided with information where choices are made that lead to outcomes, both desirable and undesirable.

It is within this multidimensional concurrence we can influence the reliability of human performance. Understanding this concurrence directs us away from blaming individuals and towards determining why the system responded the way it did in order to modify the structure. Starting at this juncture, we can preemptively design operational systems and reactively probe causes of failure. We view a holistic assignment of accountability fixing away from merely the actions of individuals towards all of the components that make up the Operational Juncture. This is not a wholesale change in the way safety systems function, but an enhanced viewpoint that captures deeper, more meaningful and more effective ways to generate profitable and safe operations.

A practical approach to analyzing human factors in designing and evaluating performance creates both reliability and resilience. Reliability is achieved by exposing system weaknesses and vulnerabilities that can be corrected to enhance reliability in future and adjacent operations. Resilience emerges when we expose and correct deep organizational philosophy and behaviors.

Resilience is born in the organizational culture where individuals feel supported and regarded. Teams operate with deep ownership of organizational values, recognize and respect the tension between productivity and protection, and seek to make right choices. Communication occurs with trust and transparency. Leadership respects and gives careful attention to insight and observation from all levels of the organization. In this culture, people will self-assess, teams will synergize and cooperate to develop new and creative solutions when unanticipated circumstances arise. Individuals will hold each other accountable.

Safety within Operational Resilience is something an organization does, not something that is created or attained. A successful program will deliver a top-down institutionalization of culture that produces a bottom-up emergence of resilience.

HFWebBanner.jpg

These days, many enterprise-level organizations are likely to have similar operations in multiple locations regionally or even worldwide. When a piece of equipment fails or a safety incident occurs at one site, the company investigates the problem and identifies solutions or corrective actions. Naturally, the team wants to capture the lessons learned and share them with other sites that have similar equipment, processes and potential incidents. investigation files.jpg

Advanced tools like the RealityCharting® software enable teams to share results of an Apollo Root Cause Analysis (RCA) across multiple layers of stakeholders. However, a large multinational enterprise might have dozens of different investigations going on at any given time. At the highest levels, decision-makers don’t necessarily want to see granular information about specific causes at any given plant. They need a top-down perspective of problems and patterns that are affecting the entire organization.

At ARMS Reliability, many of our clients have expressed a similar need. Our solution? Using classification tags to create and apply a consistent taxonomy to all root cause analyses performed for a given organization. Rolled up into a composite report, these tags reveal enterprise-wide trends and issues, allowing management to create action plans for tackling these systemic issues. For example, classification tags might uncover a large number of problems related to a lack of preventative maintenance on a certain type of pump, or a systemic non-compliance with a required safety process.

A classification taxonomy can be scalable and configured to an organization’s goals and processes. Think of these classifications like buckets that can be applied at any level of the RCA — e.g., to the root causes or solutions, to individual contributing causes, or simply to the RCA investigation in general.

Keep in mind: The Apollo Root Cause Analysis method is centered around a free-thinking approach to solving problems. That’s what makes the methodology so powerful — it doesn’t lead you down any generic predetermined pathways by asking leading questions or categorizing various causes or effects in any way. At ARMS Reliability, we advocate applying classification tags only after the root cause analysis investigation is completed, so you keep the free-thinking causal analysis and organize it later, for the purpose of rolling the findings up into a deeper systemic view.

Taxonomies can range from 5–20 categories into the hundreds. For example, here we’ve used a human factors taxonomy to tag causes as organizational influences and other people-centric issues.

screenshot 1.png

 (Click to enlarge)

Reports can provide a summary of how many causes were classified under the various tags:

screenshot 2v2.jpg

 (Click to enlarge)

In another example, an organization bases its taxonomy of reliability issues on the ISO 14224 – Collection and exchange of reliability and maintenance data for equipment.

 

screenshot 3v3.jpg

 (Click to enlarge)

The taxonomy options are endless. Most organizations we work with have their own unique systems of classifications. It’s really all about codifying the types of information your organization most needs to capture.

If adding classifications to your Root Cause Analyses would be useful for your organization, contact ARMS Reliability. We’d be glad to show you more about what we’re doing with other clients and help you develop a taxonomy that works best for your needs.

Author: Dan DeGrendel

Regardless of industry or discipline, we can probably all agree that routine maintenance — sometimes referred to as preventative, predictive, or even scheduled maintenance — is a good thing. Unfortunately, through the years I’ve found that most companies don’t have the robust strategies they need.

Typical issues and the kinds of trouble they can create:

service engineer worker at industrial compressor refrigeration s1. Lack of structure and schedule

In many cases, routine tasks are just entries on a to-do list of work that needs to be performed — with nothing within the work pack to drive compliance. In particular, a list of tasks beginning with “Check” which have no guidance of an acceptable limit can have limited value. The result can be a “tick and flick” style routine maintenance program that fails to identify impending failure warning conditions.

2. Similar assets, similar duty, different strategies

Oftentimes, maintenance views each piece of equipment as a standalone object, with its own unique maintenance strategy. As a result, one organization could have dozens of maintenance strategies to manage, eating up time and resources. In extreme cases, this can lead to similar assets having completely different recorded failure mechanisms and routine tasks, worded differently, grouped differently and structured differently within the CMMS.

3. Operational focus 

Operations might be reluctant to take equipment out of service for maintenance, so they delay or even cancel the appropriate scheduled maintenance. At times this decision is driven by the thought that the repair activity is the same in a planned or reactive manner. But experience tells us that without maintenance, the risk is even longer downtime and more expensive repairs when something fails.

4. Reactive routines

Sometimes, when an organization has been burned in the past by a preventable failure, they overcompensate by performing maintenance tasks more often than necessary. The problem is, the team might be wasting time doing unnecessary work — worse still it might even increase the likelihood of future problems, simply because unnecessary intrusive maintenance can increase the risk of failure.

5. Over-reliance on past experience 

There’s no substitute for direct experience and expertise. But when tasks and frequencies are too solely based on opinions and “what we’ve always done” — rather than sound assumptions — maintenance teams can run into trouble through either over or under maintaining. Without documented assumptions, business decisions are based on little more than a hunch. “Doing what we’ve always done” might not be the right approach for the current equipment, with the current duty, in the current business environment (and it certainly makes future review difficult).

6. Failure to address infrequent but high consequence failures 

Naturally, routine tasks account for the most common failure modes. They should however also address failures that happen less frequently, but may have a significant impact on the business. Developing a maintenance plan which addresses both types, prevents unnecessary risk. For example, a bearing may be set up on a lubrication schedule, but if there’s no plan to detect performance degradations due to a lubrication deficiency, misalignment, material defect, etc then undetected high consequence failures can occur.

7. Inadequate task instructions

Developing maintenance guidelines and best practices takes time and effort. Yet, all too often, the maintenance organization fails to capture all that hard-won knowledge by creating clear, detailed instructions. Instead, they fall back on the maintenance person’s knowledge — only to lose it when a person leaves the team. Over time, incomplete instructions can lead to poorly executed, “bandaid-style” tasks that get worse as the months go by.

8. Assuming new equipment will operate without failure for a period of time

There’s a unique situation that often occurs when new equipment is brought online. Maintenance teams assume they have to operate the new equipment first to see how it fails before they can identify and create the appropriate maintenance tasks. It’s easy to overlook the fact that they likely have similar equipment with similar points of failure. Their data from related equipment provides a basic foundation for constructing effective routine maintenance.

9. Missing opportunity to improve

If completed tasks aren’t reviewed regularly to gather feedback on instructions, tools needed, spare parts needed, and frequency; the maintenance process never gets better. The quality or effectiveness of the tasks then degrade over time and, with it, so does the equipment.

10. Doing what we can and not what we should 

Too often, maintenance teams decide which tasks to perform based on their present skill sets — rather than equipment requirements. Technical competency gaps can be addressed with a training plan and/or new hires, as necessary, but the tasks should be driven by what the equipment needs.

Without a robust routine maintenance plan, you’re nearly always in reactive mode — conducting ad-hoc maintenance that takes more time, uses more resources, and could incur more downtime than simply taking care of things more proactively. What’s worse, it’s a vicious cycle. The more time maintenance personnel spend fighting fires, the more their morale, productivity, and budget erodes. The less effective routine work that is performed, the more equipment uptime and business profitability suffer.  At a certain point, it takes a herculean effort simply to regain stability and prevent further performance declines.

Here’s the good news: An optimized maintenance strategy, constructed with the right structure is simpler and easier to sustain. By fine-tuning your approach, you make sure your team is executing the right number and type of maintenance tasks, at the right intervals, in the right way, using an appropriate amount of resources and spare parts. And with a framework for continuous improvement, you can ultimately drive towards higher reliability, availability and more efficient use of your production equipment.

Want to learn more? Check out our next blog in this series, Plans Can Always Be Improved:  Top 5 Reasons to Optimize Your Maintenance Strategy.

rel101_web-banner

Author: Dan DeGrendel

Maintenance optimization doesn’t have to be time-consuming or difficult. Really it doesn’t. Yet many organizations simply can’t get their maintenance teams out of a reactive “firefighting mode” so they can focus on improving their overall maintenance strategy. Development And Growth

Stepping back to evaluate and optimize does take time and resources, which is why some organizations struggle to justify the project. They lack the data and/or the framework to demonstrate the real, concrete business value that can be gained.

And even when organizations do start to work on optimization, sometimes their efforts stall when priorities shift, results are not immediate and the overall objectives fade from sight.

If any of these challenges sound familiar, there are some very convincing reasons to forge ahead with maintenance optimization:

1. You can make sure every maintenance task adds value to the business

Through the optimization process, you can eliminate redundant and unnecessary maintenance activities, and make sure your team is focused on what’s really important. You’ll outline the proper maintenance tasks, schedules and personnel assignments; then incorporate everything into the overall equipment utilization schedule and departmental plans to help drive compliance. Over time, an optimized maintenance strategy will save time and resources — including reducing the hidden costs of insufficient maintenance (production downtime, scrap product, risks to personnel or equipment and expediting and warehousing of spare parts, etc.).

2. You’ll be able to plan better

Through the optimization process, you’ll be allocating resources to various tasks and scheduling them throughout the year. This gives you the ability to forecast resource needs, by trade, along with spare parts and outside services. It also helps you create plans for training and personnel development based on concrete needs.

3. You’ll have a solid framework for a realistic maintenance budget

The plans you establish through the optimization process give you a real-world outline of what’s needed in your maintenance department, why it’s needed, and how it will impact your organization. You can use this framework to establish a realistic budget with strong supporting rationales to help you get it approved. Any challenges to the budget can be assessed and a response prepared to indicate the impact on performance that any changes might make.

4. You’ll just keep improving

Optimization is a project that turns into an ongoing cycle of performing tasks, collecting feedback and data, reviewing performance, and tweaking maintenance strategies based on current performance and business drivers.

5. You’ll help the whole business be more productive and profitable

Better maintenance strategies keep your production equipment aligned to performance requirements, with fewer interruptions. That means people can get more done, more of the time. That’s the whole point, isn’t it?

Hopefully, this article has convinced you of the benefits of optimizing your maintenance strategies. Ready to get started or re-energize your maintenance optimization project? Check out our next blog article, How To Optimize Your Maintenance Strategy: A 1,000-Foot View.

rcm201_web-banner

Author: Dan DeGrendel

Optimizing your maintenance strategy doesn’t have to be a huge undertaking. The key is to follow core steps and best practices using a structured approach. If you’re struggling to improve your maintenance strategy — or just want to make sure you’ve checked all the boxes — here’s a 1000-foot view of the process.

1. Sync up

  • Identify key stakeholders from maintenance, engineering, production, and operations — plus the actual hands-on members of your optimization team.
  • Get everybody on board with the process and trained in the steps you’re planning to take.  A mix of short awareness sessions and detailed educations sessions to the right people are vital for success.
  • Make sure you fully understand how your optimized maintenance strategies will be loaded and executed from your Computer Maintenance Management System (CMMS)

2. Organize

  • Review/revise the site’s asset hierarchy for accuracy and completeness. Standardize the structure if possible.
  • Gather all relevant information for each piece of equipment.
    • Empirical data sources: CMMS, FMEA (Failure Mode and Effects Analysis) studies, industry standards, OEM recommended maintenance
    • Qualitative data sources: Team knowledge and past records

3. Prioritize

  • Assign a criticality level for each piece of equipment; align this to any existing risk management framework
  • Consider performing a Pareto analysis to identify equipment causing the most production downtime, highest maintenance costs, etc.
  • Determine the level of analysis to perform on each resulting criticality level

4. Strategize

  • Using the information you’ve gathered, define the failure modes, or apply an existing library template. Determine existing and potential modes for each piece of equipment
  • Assign tasks to mitigate the failure modes.
  • Assign resources to each task (e.g, the time, number of mechanics, tools, spare parts needed, etc.)
  • Compare various options to determine the most cost-effective strategy
  • Bundle selected activities to develop an ideal maintenance task schedule (considering shutdown opportunities). Use standard grouping rules if available.

This is your proposed new maintenance strategy.

5. Re-sync

  • Review the proposed maintenance strategy with the stakeholders you identified above, then get their buy-in and/or feedback (and adjust as needed)

6. Go!

  • Implement the approved maintenance strategy by loading all of the associated tasks into your CMMS — ideally through direct integration with your RCM simulation software, manually, or via Excel sheet loader.

7. Keep getting better

  • Continue to collect information from work orders and other empirical and qualitative data sources.
  • Periodically review maintenance tasks so you can make continual improvements.
  • Monitor equipment maintenance activity for unanticipated defects, new equipment and changing plant conditions. Update your maintenance strategy accordingly.
  • Build a library of maintenance strategies for your equipment.
  • Take what you’ve learned and the strategies and best practices you’ve developed and share them across the entire organization, wherever they are relevant.

Of course, this list provides only a very high-level view of the optimization process.

If you’re looking for support in optimizing your maintenance strategies, or want to understand how to drive ongoing optimization, ARMS Reliability is here to help.

rbd201_web-banner

Author: Philip Sage, CMRP, CRL

Traditionally, SAP is populated with Master Data with no real consideration of future reliability improvement. Only once that maintenance is actually being executed does the real pressure of any under performing assets drive the consideration of the reliability strategy. At that point the mechanics of what’s required for ongoing reliability improvement, based upon the SAP Master Data structure, is exposed and, quite typically, almost unviable. ???????????????????????????????????????????????????????????????????????????

The EAM system is meant to support reliability. Getting your EAM system to support reliability requires some firm understanding of what must happen. If we look a little closer at reliability and the phases of life of an asset, we can see why the EAM settings must vary and not be fixed.

The initial reliability performance of any system is actually determined by its design and component selection.

This is probably not a big surprise for anyone close to reliability, but it may spark some debate from those who have not heard this before.

As evidence to support this statement, a newly commissioned and debugged system should operate nearly failure free for an initial period of time and only become affected by chance failures on some components. An even closer inspection can show that during this period, we can expect that most wear out failures would be absent after a new machine or system is placed into service. During this “honeymoon period” preventative replacement is actually not necessary nor would an inspection strategy provide benefit until such time as wear (or unpredictable wear) raises the possibility of a failure. Within this honeymoon period the components of the system behave exponentially and fail due to their individual chance failures only. They should only be replaced if they actually fail and not because of some schedule. Minor lubrication or service might be required, but during this initial period, the system is predominantly maintenance free and largely free from failure.

Here is where the first hurdle occurs.

After the initial period of service has passed, then it is reasonable to expect both predictable and unpredictable forms of wear out failures to gradually occur and increase in rate, as more components reach their first wear out time.

Now if repair maintenance (fixing failures) is the only strategy practiced, then the system failure rate would be driven by the sporadic arrivals of the component wear out failures, which will predictably rise rather drastically, then fluctuate wildly resulting in “good” days followed by “bad” days. The system failure rate driven by component wear out failures, would finally settle to a comparatively high random failure rate, predominantly caused by the wear out of components then occurring in an asynchronous manner.

With a practice heavily dependent upon repair maintenance, the strength of the storeroom becomes critical, as it makes or breaks the system availability which can only be maintained by fast and efficient firefighting repairs. The speed at which corrective repairs can be actioned and the logistical delays encountered, drive the system availability performance.

From this environment, “maintenance heroes” are born.

As the initial honeymoon period passes, the overall reliability the system becomes a function of the maintenance policy, i.e. the overhaul, parts replacement, and inspection schedules.

The primary role of the EAM is to manage these schedules.

The reduction or elimination of predictable failures is meant to be managed through preventative maintenance tasks, housed inside the EAM that counter wear out failures. Scheduled inspections help to counter the unpredictable failure patterns of other components.

If the EAM is properly configured for reliability, there is a tremendous difference in the reliability of a system. The system reliability becomes a function of whether or not preventative maintenance is practiced or “only run to failure then repair” maintenance is practiced. As a hint: the industry wide belief is that some form of preventative practice is better than none at all.

Preventative maintenance is defined as the practice that prevents the wear failure by preemptively replacing, discarding or performing an overhaul to “prevent” failure.  For long life systems the concept revolves around making a minimal repair that is made by replacement of the failed component, and resulting in the system then restored to service in “like new” condition. Repair maintenance was defined as a strategy that waits until the component in the system fails during the system’s operation.

If the EAM is not programmed correctly or if the preventative tasks are not actioned, then the reliability of a system can fall to ridiculously low levels, where random failures of components of the recoverable system, plague the performance and start the death spiral into full reactive maintenance.

This is quite costly, as in order to be marginally effective the additional requirement is a fully stocked storeroom, which raises the inventory carry costs. Without a well-stocked storeroom, there are additional logistical delays associated with each component, that are additive in their impact on the system availability, and the system uptime, and so system availability becomes a function of spare parts.

An ounce of prevention goes a long way.

Perhaps everything should be put on a PM schedule…? This is actually the old school approach, and I find it still exists in practice all over the world.

The reliability of a system is an unknown hazard and is affected by the relative timing of the preventative task. This timing comes from the EAM in the form of a work order which is supposed to be generated relative to the wear out of the component. How well this task aligns with reality is quite important. If the preventative work order produced by the EAM system comes out at the wrong time, there is a direct adverse effect on system reliability.

EAM systems are particularly good at forecasting the due date of the next work order and creating a work order to combat a component wear out failure. However, wear is not always easily predicted by the EAM and so we see in practice, that not all EAM generated work orders suppress the wear out failures. One reason for this variance is the EAM system work order was produced based on the system calendar time base along with a programmed periodicity that was established in the past to predict the future wear performance.

We don’t always get this right.

As a result we generate work orders for work that is not required, or work that should have been performed before the component failed, not just after the component failed.

Maybe this sounds familiar?

Calendar based forecasts assume wear is constant with time. It is not.

A metric based in operating hours is often a more complete and precise predictor of a future failure. It’s true most EAM systems today allow predictable work to be actioned and released by either calendar time or operating hours and allow other types of time indexed counters to trigger PM work orders.

A key to success is producing the work order just ahead of the period of increased risk to failure due to wear. Whether by calendar or some other counter we call the anticipation of failure, and the work order to combat it, the traditional view of maintenance.

graph

This all sounds simple enough.

The basic job of a reliability engineer is to figure out when something will likely fail based on its past performance and schedule a repair or part change. The EAM functionality is used to produce a work order ahead of the failure, and if that work is performed on-time, we should then operate the system with high reliability.

The reliability side of this conjecture, when combined with an EAM to support, is problematic.

If the work order is either ill-timed from the EAM or not performed on time during the maintenance work execution, there is an increased finite probability that the preventative task will not succeed in its purpose to prevent a failure. Equally devastating, if the PM schedule is poorly aligned or poorly actioned, the general result mirrors the performance expected from a repair maintenance policy, and the system can decay into a ridiculously low level of reliability, with near constant sporadic wear out of one of the many components within the system.

When preventative maintenance is properly practiced so that it embraces all components known to be subject to wear out, a repairable system can operate at high reliability and availability with a very low “pure chance” failure rate and do so for indefinitely long periods of time.

Determining what to put into the EAM is really where the game begins.

FIND OUT MORE AT:

MASTERING ENTERPRISE ASSET MANAGEMENT WITH SAP, 23-26 October 0216, Crown Promenade, Melbourne

Phil Sage will be running a full day workshop “Using SAP with Centralised Planning to Continually Improve RCM Derived Maintenance Strategies” Wednesday 26 October

Come learn what works, and what does not work, as you integrate SAP EAM to support your reliability and excellence initiatives, which are needed to be best in class in asset management. The workshop covers how and where these tools fit into an integrated SAP framework, what is required to make the process work, and the key links between reliability excellence, failure management and work execution using SAP PM.