SME_ITDR: Start of Day and End of Day as known recovery points


Had a discussion the other day about how should an application recover.

It’s obvious that real-time replication of data, databases, and “stuff” is NOT the same as a restart after a recovery.

(Well, maybe not so obvious to the techies who love these things.)

I made a case to a group of IT folks that an “application”, or suite of applications, or a portfolio of applications had to accomplish certain functions:

  • Determine the state of the world when it starts
  • Verify the data and files being presented are “correct” (i.e., IAAA checked)
  • Align all its data and files to the “correct” starting point
  • Run any transaction logs necessary to bring the app up to speed (i.e., recovery point runs up to current time or last good transaction time)
  • Allow for correction and catchup to external suorces
  • Preserve trails to demonstrate “correct” recovery

If the “application” can NOT do all of these things then it is doomed to a completely manual recovery.

When an enterprise has thousands of “applications” — usually intertwined in a rat’s nest of complexity — the likelyhood of a timely successful recovery is directly proportional to the enterprise’s “luck”. (More likely to win the Lotto.)

And, it’s interesting what Leadership, Regulators, External Auditors, Internal Auditors, and Risk Managers will accept as “proof” this will all work when needed.

Glad it’s not my paycheck on the “pass line” at this particular crap shoot.

# – # – # – # – #  

SME_ITDR: Two weeks? Most business can’t recover at all!

New Standard: Two-Week Disaster Preparedness
What message are you telling people about disaster preparedness?
Eric Holdeman | March 31, 2015

*** begin quote ***

Three-day or 72-hour disaster preparedness messages have dominated the national message for decades when it comes to how long you should tell people to be prepared for disasters.

My thinking on this started to change in 2005 following Hurricane Katrina. We were about to launch a big public education campaign in King County, Wash., called “Three Days, Three Ways.” The three-way message was have a plan, build a kit and get training. At the time, when I checked with American Red Cross and Federal Emergency Management Agency about the possibility of that message changing, they said “no” so we went ahead with the campaign so as to standardize and not confuse people with different messages.

Now 10 years hence, Hurricane Sandy was another learning lesson and the great quake that could happen any day still is looming in our future. Many emergency management agencies in locations where you can have a huge regional disaster have moved on to telling their communities to become prepared for a week.

*** end quote ***

From an ITDR pov, most business can’t suffer ANY interruption.

They can’t realign their data after a disaster.

Might as well close and start over.

If the recovery has not been designed in from the start — like start of day / end of day recovery points with automatic transaction capture to allow high speed replay — then you can forget that recovery and restart.


# – # – # – # – # 

SME_ITDR: IN ITDR, there are no “silver bullets”


Disaster recovery as a service wipes out traditional DR plans
by: Paul Korzeniowski

Disaster recovery planning and infrastructure builds vex IT managers. Cloud services offer lower costs and more flexibility, but not without risk.

*** begin quote ***

How to construct a DR plan

First, outline potential disasters for the data center: Hazardous weather, power outages, vendors’ systems going offline, employee sabotage or outsider attacks are all possibilities.

Identify which of its hundreds of applications the corporation needs online immediately. Audit the list and prioritize by importance to daily operations.

Next, source and install redundant data center infrastructure — servers, software, network connections, storage — to support the applications. Disaster recovery plans cannot escape cost considerations; an offsite data center is expensive.

*** end quote ***

I would assert that this is EXACTLY how NOT to construct an ITDR plan.

I’d also assert that “the Cloud”, and “DRaaS” (Disaster Recovery as a Service), is not the “Silver Bullet”. (With apologies to Coors Light)

In the old mainframe days, professionals recognized the — what I call — the partial recovery sequence. IT hardware is too expensive to duplicate, so let’s  triage.

Since getting the tapes from Iron Mountain and going to Sungard or Comdisco took time, ITDR started with Business Continuity Planning (BCP).

And, BCP required Business Process Reengineering (i.e., what will the Business do until IT recovers and what brings “money in the door” — note we don’t care about “out the door”, they can wait.

Early in my career, I noted an interesting behavior. I call it “Everything is critical UNTIL I have to pay for it! Then, nothing is.”

May sound funny, but the minute IT starts doing DR, it’s like money is no object.

Here’s an interesting experience I had at a large Financial Institution that shall remain nameless.

The Business Units said: “I can’t afford any down time, Nor can I afford any data loss when I do take down time.” (Now that in and of itself it an interesting requirements statement, but this is about SME_ITDR; not situation appraisal.) No problem. Synchronous data replication to a bunker near to the production data center, sync rep to a bunker near the recovery center, sync rep from the far bunker to the recovery center. Never lose any data ever. Price tag is 20M$ for about the first 5 applications; incremental after that in discrete chunks. Where’s the checkbook Senior Business Unit Head Honcho?

The response was “what can get for free” (i.e., what level of service costs zero)???

Easy answer: TANSTAAFL!! (“There Ain’t No Such Thing As A Free Lunch” From Robert Heinlein’s classic) 

Clearly, IT can NOT do ITDR in the absence of the Business — both from a cost and a process point of view.

The best that IT can do alone, it to keep “cutting the homogeneous datacenter” into smaller and smaller discrete modules of service (i.e., like the data center is the motherboard and everything “plugs in” by discrete well-defined interfaces.

At that previously mentioned large Financial Institution that shall remain nameless, the application portfolio had about 700 applications, an analysis of their Remedy data showed that, to recover any SINGLE application, one was required to recover about ⅔ of the portfolio. 

And, needless to say, that’s not happening any time soon.

Bottom line: One must design Business Process that are recoverable; then the technology can be recoverable. Translated into IT-ese, start with the cart; not the horse.

— 30 —

SME_ITDR: The interesting part about ITDR is timing

Different types of measures can be included in disaster recovery plan (DRP). Disaster recovery planning is a subset of a larger process known as business continuity planning and includes planning for resumption of applications, data, hardware, electronic communications (such as networking) and other IT infrastructure.
Disaster recovery – Wikipedia, the free encyclopedia

# – # – # – # – # 

The interesting part of splitting this into parts is there is no holistic view of a recovery.

The timing of a recovery is essential to success. It really doesn’t matter if one can “turn the lights on” (i.e., power up back up components and get the parts running). What matters is to resync the whole “mess”.

From my first assignment in ITDR to my latest, no one seems to understand that.

From the internal job schedulers, to the external third parties, and in all the intra-system interfaces — everything must be set to common point in time. 

And, the “real world” keeps on going without you. So that makes “catching up” even harder.

So, “recovery” must be automated. Push that “big red easy button”, with apologies to Staples, and the systems must “automagically” on command: instantiate the recovery environment, fall back to a known good restart point, replay all the transaction “book” between the recovery point and the disaster, and present systems for acceptance by Business Users.

That’s a tall order.

In my first assignment, the arithmetic worked out that regardless of when during the processing week a disaster occurred, the environment would always be ready on the following Monday. (Quite a novel discovery. And shook the Business and IT Leadership awake. “Hey we need a better BCP for Monday disaster!”.)

Unfortunately, without the holistic view, everyone sees “trees”, but not the “forest”.

It’s a good thing that disasters are relatively rare. Most corporations don’t survive them.

— 30 —