IT incidents are in general best put right quickly. Performance in this area can be tracked as MTTR, or mean time to repair. As a metric, MTTR has its advantages. One of these is that it is simple to handle.
You add up all the resolution times for incidents handled, then divide by the number of incidents. Hey presto, you have your MTTR. You can drive it down over time by targeting resolution or repair times under the MTTR, which helps restore productivity too.
However, for disaster recovery, MTTR on its own may be dangerously simple as a metric.
The potential problem comes from averaging over large numbers of incidents without understanding whether MTTR is typically about the same in each case or whether it is more variable.
An overall low MTTR can hide one or two much larger values for the time to repair an incident. It only takes one long resolution time for one important IT resource to cause serious interruption to business and possible damage to the organisation’s reputation.
Understanding the degree of spread of values around the mean is therefore important. Running a check to see if any values have exceeded a predefined maximum acceptable level is one way. Calculating the standard deviation of all the values is another. A smaller standard deviation means the values are in general more closely grouped around the mean.
Driving down MTTR can then be tackled by first understanding its component parts. MTTR can be divided into four parts. MTTI (mean time to identify) is the average time needed to detect an incident or disaster. MTTK (mean time to know) is the average time required to identify the cause.
MMTF (mean time to fix) is the mean time to implement a solution. And finally, MTTV (mean time to verify) is the mean time to confirm that the solution is working. Out of these four, MTTK is often the critical one, as the increasingly complexity of IT systems tends to make this component grow disproportionately. For disaster recovery in particular, reducing MTTK and ensuring all total repair times are below an acceptable maximum (RTO) are key goals.