I recently attended a conference discussion about using dashboards and exception monitoring to increase production efficiency for production maintenance engineers by directing their attention where and when it is needed, while also including the appropriate diagnostic information for the situation. This allows for earlier detection of potential issues, in time for planned fixes rather than unplanned outages.
The classical dashboard model involves a high level overview with composite KPIs where deviances can be spotted and drilled into to diagnose. This can be as simple as a daily pump status report of average on/off times, power consumption, and pressures. Oddly enough, we’ve written this report for Shell. These reports are run and delivered on a fixed schedule, or are made available when an Engineer or Tech specifically calls for them.
Exception based reporting monitors the system on an ongoing basis, and raises warnings to the user when they need to be reviewed. In an exception monitoring paradigm, you find the problems much earlier because it isn’t necessary for the exception to deviate the entire trend line before it’s brought to your attention. Awk. This is basically just saying ‘you find it earlier because you’re looking at fewer numbers rather than the average, and the system knows the difference between a dip caused by normal variation in the data and a true downward trend’. In many cases, the trend line and change is more important than the average, but a simple dashboard has no way of knowing whether the change is transient or continuing, while an exception based system can be more proactive about monitoring potential issues.
An additional advantage to the exception monitoring paradigm is that it operates on a push model that can safely be left until it asks for your attention. A good manager will keep an eye on the high level detail under his purview and occasionally spot check details. A monitoring structure that requires attention be placed on a myriad of dials and graphs for hundreds or even thousands of pieces of equipment is subject to oversights, oversights that lead to costly unplanned outages or worse problems. The push model frees the monitor to focus on strategic tasks with the knowledge that tactical needs will be brought to his attention as soon as the problem could reasonably be diagnosed. Spot checks are still useful, but now the engineer has additional detail at his disposal while making these checks.
Finally, the exception model reduces the expectation that the user will check the dashboard everyday with their full attention. When running dashboards or other daily reports, a problem is at best a single red cell on a large spreadsheet. More generally, it’s a single abnormal number in a sea of numbers that are within the same tolerance ranges every day. Exceptional well behavior should be called out in an exceptional manner, in a report, status message, or dashboard whose arrival is an event, and not the expected daily report to be reviewed at leisure on the off chance there’s a problem contained in it.
SCADA monitors and other fault detection systems have their place, ensuring critical safety factors are met and that systems shut down in a controlled manner in the event of a true emergency. Exception monitoring allows the engineers to describe complex scenarios using data from a myriad of sources, where the expected system outcome is a red-flag, and the engineer’s reaction is to review the data closely, with the attention that comes from the certainty that there’s actually a problem to find.
This doesn’t mean that the engineer is excused from monitoring the big picture either. Those dashboards still have a very important place monitoring overall system health, status and trends. Exception monitoring is to ensure that a true problem will be caught, and that a costly pump failure that could have been predicted if the engineer had only chosen that pump’s data to look at for the day doesn’t happen.
We also talked about how creating diagnostic dashboards served as a useful tool for documenting the diagnostic knowledge of an aging engineering corps. Asking an experienced engineer what data he would like to see when a condition he’s flagged as exceptional occurs is the same as having him explain to a younger engineer what steps he would take to diagnose a fledgling problem on the verge of becoming an expensive outage. Creating a diagnostic dashboard in this situation helps the experienced engineer do his job easier, at the same time it helps a younger engineer do his job more correctly.
When the engineer gets the report, he not only sees the problem, but he has, immediately, all the data that he would have to go find in a variety of systems. This can save them minutes or even hours of data collection, allowing them to resolve the problem and move on.
In addition to the possibilities arising from custom application development, there are a number of purpose built tools on the market that provide this functionality such as P2’s Explorer and Sentinel products. Microsoft provides a number of monitoring tools for data in motion that operate on a daily, rather than real time, manner. These principles can be used in any data management solution, whether monitoring data flowing into a data warehouse, or watching it fly by in a real time SCADA system.
The primary lesson for me was, don’t send users a daily report and expect them to routinely check for the absence of errors. Only send the report when there’s something to see, and then, improve that report with a dashboard giving up to date and targeted diagnostic information. When you do that, you’re adding the diagnostic knowledge to the enterprise, not to an engineer’s head.