Failover: Messy Realities

Posted by: Michael Nygard on 04/19/2010

People who don't live in operations can carry some funny misconceptions in their heads. Some of my personal faves:

  • Just add some servers!
  • I want a report of every configuration setting that's different between production and QA!
  • We're going to make sure this (outage) never happens again!

I've recently been reminded of this during some discussions about disaster recovery. This topic seems to breed misconceptions. Somewhere, I think most people carry around a mental model of failover that looks like this:

Normal operations transitions directly and cleanly to failed over

That is, failover is essentially automatic and magical.

Sadly, there are many intermediate states that aren't found in this mental model. For example, there can be quite some time between failure and it's detection. Depending on the detection and notification, there can be quite a delay before failover is initiated at all. (I once spoke with a retailer whose primary notification mechanism seemed to be the Marketing VP's wife.)

Once you account for delays, you also have to account for faulty mechanisms. Failover itself often fails, usually due to configuration drift. Regular drills and failover exercises are the only way to ensure that failover works when you need it. When the failover mechanisms themselves fail, your system gets thrown into one of these terminal states that require manual recovery.

Just off the cuff, I think the full model looks a lot more like this:

Many more states exist in the real world, including failure of the failover mechanism itself.

It's worth considering each of these states and asking yourself the following questions:

  • Is the state transition triggered automatically or manually?
  • Is the transition step executed by hand or through automation?
  • How long will the state transition take?
  • How can I tell whether it worked or not?
  • How can I recover if it didn't work?

About Michael Nygard

Michael Nygard

Michael strives to raise the bar and ease the pain for developers across the country. He shares his passion and energy for improvement with everyone he meets, sometimes even with their permission. Michael has spent the better part of 20 years learning what it means to be a professional programmer who cares about art, quality, and craft. He's always ready to spend time with other developers who are fully engaged and devoted to their work--the "wide awake" developers. On the flip side, he cannot abide apathy or wasted potential.

Michael has been a professional programmer and architect for nearly 20 years. During that time, he has delivered running systems to the U. S. Government, the military, banking, finance, agriculture, and retail industries. More often than not, Michael has lived with the systems he built. This experience with the real world of operations changed his views about software architecture and development forever.

He worked through the birth and infancy of a Tier 1 retail site and has often served as "roving troubleshooter" for other online businesses. These experiences give him a unique perspective on building software for high performance and high reliability in the face of an actively hostile environment.

Most recently, Michael wrote "Release It! Design and Deploy Production-Ready Software", a book that realizes many of his thoughts about building software that does more than just pass QA, it survives the real world. Michael previously wrote numerous articles and editorials, spoke at Comdex, and co-authored one of the early Java books.

More About Michael »

NFJS, the Magazine

May Issue Now Available
  • Client-Side MVC with Spine.js, Part 1

    by Craig Walls
  • On Prototypal Inheritance, Part 2

    by Raju Gandhi
  • Making use of Scala Lazy Collections

    by Venkat Subramaniam
  • Integration Testing Web Applications Using Gradle

    by Kenneth Kousen
Learn More »