SysAdmins creating software for SysAdmins.
Business Continuity Concepts – Part 1: planning
Learn the Language
When planning for Disaster Recovery*, there are 3 major concerns that need to be addressed: Maximum Tolerable Downtime, Recovery Time Objective, and Recovery Point Objective. Usually these areas need to be discussed by all major stakeholders rather than being unilaterally decided upon by a single person such as a CIO or IT director. Having input from the finance department as well as the CEO and other high level directors is crucial for proper planning.
A disaster, in this discussion, is defined as any event that significantly impedes the normal carrying-on of business. It could be something non-dramatic such as the phone system going down or something as serious as a tornado that wipes out the entire building.
This brings up another important discussion point. A disaster in this context doesn’t have to completely shut all parts of the business down. It just needs to impede normal functioning enough that if it were to last for a period of time the business would no longer be sustainable. For example, if the phone systems go down, most employees would not be immediately affected like they would by a hurricane. However, sustained phone outage would eventually lead to an unrecoverable loss of business.
Finally, you can’t plan for everything. AT&T’s list of natural disasters alone has 17 items on it including lighting, earthquakes, blizzards, and heat waves. When you add into that terrorism, sabotage, unrelated power failure, etc. there are more eventualities than any company could plan for. So part of this discussion needs to be, “What are the events that are most likely to affect us?” A lot of disaster planning will be similar across events anyway.
Your discussion of Business Continuity must address 3 critical areas:
Maximum Tolerable Downtime (MTD)
This is the length of time that a business, or specific business unit, can remain non-functional without causing serious risk of business failure. Some companies could probably go for days or more without computer systems, while a major on-line retailer like Amazon would be in serious trouble after only a few minutes.
Recovery Time Objective (RTO)
Given that you know what your MTD is, in what time-frame are you going to aim to be back up? This needs to be a shorter time than the MTD, or you’ll be out of business. The formulation of this number will be based on the cost of various recovery schemes. For example, you could have a complete duplicate of your data center ready to go at the flip of a switch, but the cost of maintaining two facilities might be unacceptably large.
Recovery Point Objective (RPO)
What point in time are you trying to get back to? A day before the disaster happened? An hour? How much lost business time and data is acceptable given the cost?
Create the Plan
Once the values for these 3 BC concepts have been agreed upon, it is imperative for the plan to be put onto paper so that there is a physical copy. Getting a signature from major stakeholders is also a best practice and will reduce finger pointing down the road.
These 3 concepts are just one part of your disaster recovery plan. They will help to provide focus not only for planning but also give predefined goals when an incident does occur. Specific strategies need to be created surrounding business data and infrastructure, personnel, and workspace.
In the case of a data related disaster, all eyes are going to be on the IT staff while the rest of the company sits and waits. As an IT administrator you will want the business continuity plan to be well documented so that you don’t receive any undue pressure. This plan can also be pointed to when making budget requests, in order to help validate the necessity of new purchases and investments.
Finally, remember that no Disaster Recovery Plan is set in stone. Threats change often, and so do the objectives for business continuity. New systems become mission critical and some older ones decline in use. Especially with growth of a company, acceptable downtimes often shrink. Make sure that you are reviewing and testing your DR plans every 6 months. The last thing you want to do is wait 2 or 3 years only to find out that while your backups are fine, you lack another critical piece to get systems back up and running again.
* (see ISO 22313)