As every good disaster recovery practitioner knows, a DR plan must be tested if it is to have real credibility. DR managers are often limited to simulations of disasters, with varying degrees of effectiveness. In general, the two dimensions that you will seek to optimise in your disaster recovery planning are the realism of the simulation and the involvement of people in the organisation. Yet as we’ll see below there may still be one disaster that you can do for real – and without sending your enterprise into a tailspin.
In ascending order of realism for simulation of disasters, we might have: models or computer simulations; games specifically designed to mimic disaster recovery situations; and situations that are imposed by management on the organization (denial of access to headquarters, for example). Computer simulations are useful first steps because they allow for a ‘non-invasive’ glimpse of what might happen if a production facility or data centre went down. Playing a DR game with real people from the organisation helps disaster situations to become more meaningful and obliges players to think through different options.
While you might insist that everybody in the organisation played the game at some point, there’ll always be a dimension missing: size, meaning the widespread effect a disaster can have on a whole site, rather than just a few people at a time. To bring this dimension into play, you may have to simulate disaster by declaring a location to be inaccessible for the morning, temporarily blocking phone or Internet access or decreeing that all personnel will work from home for the day.
Finally, while it’s true that you wouldn’t try to create a fire, a flood, an earthquake or any other similar catastrophe, there is another disaster that can reasonably (but take all necessary precautions first!) be created – and that also happens to be the biggest culprit for enterprise downtime overall. It’s the IT system crash, whether in terms of a server, an application, a hard drive or a network connection. Some companies now run computer programs that randomly stop such bits of the IT infrastructure to test recovery or resilience: Netflix with its application Chaos Monkey is an example.