Author | Jaco Cronje
Have you ever found yourself sitting in yet another root cause analysis session, unable to even get close to a possible cause, just because of that sinking realisation that, yet again, you do not have enough information about the failure? This is the stage where the breakdown has been fixed, and you were too late. Everything is up and running again. Production is back to chasing the next target. You might ask yourself: How can you find the missing information?
In this article, we will define the critical point of the incident investigation where most of the information mentioned above goes missing. This is at the moment just after the incident or failure happened, the critical time frame where the very first people arrive at the scene and observe the scenario as-is, just after, or maybe even as it occurred; the very first eyewitnesses. We call this The Golden Hour and will elaborate on how effective it is to preserve information and make it available for successful root cause analysis (RCA).
The term golden hour originates in trauma medicine, where the first hour after severe trauma is considered the most critical in determining a successful emergency treatment outcome. In maintenance, preserving all information within the first hour of the incident or failure will go a long way in achieving a successful root cause analysis and preventing reoccurrence.
The Golden Hour
To explain the concepts, we will use a process pump as an example of an asset. A functional failure can be anything from a simple trip out of the electric motor or a catastrophic failure resulting in a significant containment loss. We will illustrate this with a worst-case scenario and presume a major containment loss, and parts are lying everywhere …
24 March @ 01:03 – The two-way radio on your shoulder crackles and blares in your ear. After your long night shift, you have just warmed up your supper. “MAINTENANCE! ANYONE FROM MAINTENANCE! COME IN …” You sadly leave your warm and tasty dinner, grab your camera, safety gloves, hard hat and notebook and rush to the crime scene! Time is already against you because Production wants the plant up and running as soon as possible. Your conundrum is whether you will have enough time to gather sufficient evidence.
Firstly, let us establish some first-line responsibilities. We can confirm that production operators are responsible for operating assets using standard operating procedures (SOPs) and a specific production specification to ensure that the product is produced at, inter alia, the desired quality and rate. When an asset experiences a functional failure, it naturally no longer makes the product according to the desired production specification and/or quality. Usually, the first line of “defence” called in from a Maintenance point of view is the technician or artisan. Here we have identified two key roles: the operator and the artisan or technician. These key role-players are our response teams, and they will be the people who will capture all of the evidence and information during the Golden Hour.
To further define and unpack the details of what unfolds during the Golden Hour, we are going to use the following five steps as a framework for our root cause analysis:
1. Safety first
2. Spot the difference
3. Collect the evidence
5. Recording and reporting
Some of the explanations within the five steps may sound repetitive because at least four of the five steps often happen simultaneously during the golden hour.
1. Safety first
24 March @ 01:04 – Production: Before Maintenance is tasked with looking at the pump, are there any significant process-related deviations in the control room that raise the safety alarms? If so, immediately log them and stop Maintenance from entering an unsafe area. Once everything is confirmed as safe within the control room, you can look and listen in on the scene on site. If there is a containment loss (process fluid, fluid from the pump, bearing casing, etc), preventing safe access to the asset for repairs or it prevents other safe operations, then follow standard safe-making procedures to make the area safe. Ensure that any fluid samples are taken for later analysis. If you cannot identify any foreign object that was not there during normal operations, do not move it; rather, wait for Maintenance to arrive to identify the part. DO NOT throw anything away or clean anything unnecessarily without taking a photo of its current state or taking a sample of the debris or dirt.
After the area has been declared safe by Production (if the failure has been catastrophic and a process leak has been observed), Maintenance can be called to the scene.
24 March @ 01:10 – Maintenance: If any unidentified debris or loose parts from the pump have been moved or were displaced during the failure, preventing you from safely getting closer to repairing the pump, remove the parts carefully with the required rigging assistance. First up, take a sample of any leaking fluids (oil, process fluid, if Production has not already done so) and also take a photo of the parts in their current state before moving them. If photos cannot be taken, write down (or draw the scene) where the parts are as accurately as possible before making the area safe. DO NOT discard or throw away any parts, and do not clean anything unnecessarily before taking samples of debris or dirt or a complete account of the scene’s current state.
Once both parties have declared the area safe to work, move on to the next step.
2. Spot the difference
While making the area safe (if required, depending on the failure severity), looking for differences, and collecting evidence simultaneously, you are recording as much as possible. To further elaborate, spotting the difference means investigating what exactly is different from normal operating conditions or how many deviations can be spotted. This can be straightforward to near impossible at times. Now you may say: “But this is exactly solving the root cause.” Not entirely – it is not as simple as that.
24 March between 01:15 and 01:45 – Production: Within the control room, and if available, take note of process trends (temperature, pressure, flowrate, electrical information, etc) and what exactly started deviating from the production specification before and after the incident.
Extract the alarm history from the previous 24 hours if it is available. Note down the lack of information if the alarm history is unavailable. This can facilitate the RCA in updating the asset information to add critical parameters in future. Write down your experience exactly what triggered your response in reporting the incident. At the scene, if there was containment loss, write it down. From a production point of view, take your SOP and write down (before safe-making) the positions and indications of the process valves, switches and gauges related to the pump and the pump’s process. Compare it to the required settings stated in the SOP and note any differences.
24 March between 01:15 and 01h45 – Maintenance: Confirm with Production why the pump has stopped if it is not blatantly obvious (in more than the required number of pieces or scattered all over the place)! For example, if it has just tripped out electrically, has there been a low-flow or high-current trip? If so, feel if the pump is still turning before attempting a restart or anything else. If it can restart, listen for any unusual noises. In another scenario, look for anything out of place first if you get to the scene. Are all the parts intact? If not, write down what is not in accordance with the pump’s original specification (loose bolts, wires, coupling, cracked casing, etc). If you do not have the technical drawings to compare to or do not know the specification, note that as well. This can aid in the RCA in that the specifications should be requested from the original equipment manufacturer and that technical documentation should be updated. Now use your sense of touch and smell to feel if anything, such as the pump bearing box or electrical motor is hotter than normal or if the oil smells different from what it usually does.
3. Collect the evidence
When you get to step No. 3, you may think: “Well, what have we been doing all this time?” Sure, we have collected substantial evidence specifically through our senses and hopefully noted all of it down on a template or job card. This step explicitly addresses physical or tangible evidence. Those things we say we can “bag and tag.”
24 March between 01:15 and 01:45 – Production: As noted in the safe-making stage, collect any process-specific evidence during the incident. Compile the sequence of events, the specific product still inside the pump or asset and/or the exact product that was scrapped due to a deviation in quality or leaked out due to containment loss. Anything else collected from the site related to the incident while cleaning should not be discarded before it is critically reviewed by either Production or Maintenance or a process-safety or process-specialist person.
24 March between 01:15 and 01:45 – Maintenance: All parts related to the pump before repair that had been collected and cannot be used during the repair must be removed and tagged. These are to be further inspected in the workshop to determine the cause of why the component or part is defective or has failed. Do not clean the part before bagging or tagging it, as it can remove critical evidence pointing to the cause of failure. Preserve as much as possible of the part and its current condition. If any parts cannot be removed from the site for further investigation or inspection before repairs, ensure that the relevant people required to assist or be present for the enquiry are contacted as soon as possible so the examination can occur. Alternatively, before altering or restoring the parts, take photos, or draw the scene and write up notes about the scene as accurately as possible.
24 March @ 01:45 and beyond … At this stage, we can safely say the Golden Hour is over. The time to collect all the evidence, obtain the facts, and grab all the information has passed. Everything has been cleaned up, and the repairs to the pump have now started to take place. By now, and depending on the criticality of the pump, Production is either moving on to monitor the rest of the process if it is not critical or jumping up and down and imploring Maintenance to hurry up and get the pump running because, according to Murphy’s law, the standby pump might suffer the same fate!
We are not going into too much detail with this step because it is not part of the Golden Hour. However, the main goal here is to ensure that the pump is restored to the desired production specification using the correct spare parts supplied by the OEM. Maintenance does the restoration according to the proper procedures per the OEM instructions and specifications using the correct tools.
If that was the case, the pump should be commissioned with both parties being present using a commissioning checklist (again, depending on the asset’s criticality) and, during commissioning, should be monitored to ensure that it performs its required function.
5. Recording and reporting
24 March from 01:03 to 25 March @ 07:00 – Production: By looking at the time stamp, it is safe to assume that recording and reporting are essential throughout four major steps. For Production, these recordings can happen on production logs, shift reports and CMMS notifications which can be logged against the specific asset. Production should also take on the primary responsibility and log the sequence of events and the official incident (depending on severity) on the company incident reporting/logging/HSSE system for investigation. One standard location (preferably the Engineering store or workshop) should be identified where the samples or physical evidence gathered by Production can be taken to and further processed or investigated.
24 March from 01:03 to 25 March @ 07:00 – Maintenance: Similar to Production above, Maintenance should record all of their findings on a shift report, job card, or CMMS notification specifically logged against the asset during the time of information gathering and during the restoration. All documented and photographic evidence should be stored on an electronic server in a standard location that is easily accessible to all involved in further investigating the incident and solving the root cause. Be sure to coordinate and cross-check all information with Production regarding the sequence of events while also capturing evidence on-site with Production.
25 March @ 07:30 – Even though it is morning, you were looking forward to your supper, so you rewarm it once again and settle down to eat while mulling over lessons learned and the value of The Golden Hour, and then your two-way radio start to crackle …
Conclusion – beyond golden hour
In the article, we have examined the concept of The Golden Hour. The aim was, first and foremost, to illustrate how critical information can get lost so quickly if we do not take care of those vital minutes just after an incident has occurred. Second, we know that time is of the essence and that one should not ideally spend more than 15–30 minutes gathering information or evidence. Following these five steps and knowing the essential elements to look out for will hopefully optimise the time spent.
The key takeaway is that Production and Maintenance staff should be aware of this and be well-trained to ensure the information is gathered, recorded, reported and stored correctly. The information collected will then be useful in the root cause analysis and in future instances as positive case studies or for continuous improvement or best practice initiatives.