Wednesday, January 23, 2008

Joel on Availability: Broken

Joel Spolsky posted an article today on system availability. He postulated some widely held, but in my opinion misguided notions about the viability of building highly-available systems entitled the "Five Whys".

To paraphrase, Joel argued that there are classes of failures that can't be anticipated and that instead of trying to anticipate them or how to measure your progress in system hardening through SLA metrics, you should focus on your diagnosis, root cause, and resolution process.

I agree those processes are necessary, but they are not sufficient. Building highly available software is possible. The processes, tools, and infrastructure necessary to guarantee availability does have a cost. Here is my analysis of where Joel's goes off track:

a) Joel describes an example of a poorly configured router causing a serious outage. He proceeds to argue that creating an SLA around the router isn't the right thing to do based on the opinion that many failure events are "unknown unknowns" or Black Swans. However, the poorly configured router is clearly an example of a "known known" and not a Black Swan. Someone busted the configuration. There are a number of easy ways this could have been prevented. Peer review of configuration, automated configuration installation, configuration validation tools, checklists are all viable solutions and relatively low-cost solutions that would have eliminated this problem.

b) Joel further argues that trying to have "6 nines" availability is a ridiculous goal as it is less than 30 seconds downtime per year. He has a point that most any event will be more than seconds long and so "6 nines" measured in a year is effectively 100%. If you're shooting for 6 nines, then you need to measure over a longer time-period than a year. That said, it doesn't invalidate the whole notion of an minute-based availability metric. There is a material difference between 99%, 99.9%, 99.99% and 99.999% availability. They are an order of magnitude each. 99% is 5,260 minutes of downtime a year -- 87 hours or 3.6 days! 99.9% is 526 minutes or 8.7 hours of downtime per year. There is sufficient granularity in those numbers to be statistically significant over a year. I don't believe you can argue that black-swan events should cause 3.6 days of downtime per-year and ideally you'd be much less than 8.7 hours/year too, making 4 9's a very reasonable target.

c) The entire notion that Joel describes of an "unexpected unexpected" in a deterministic computer system seems far-fetched to me. Large-scale systems only appear to be Brownian in nature when they're poorly built and poorly understood. Computer systems are deterministic in nature. What part of a deterministic system's behavior should be "unexpected"? I understand that many people may not take the time to understand the system, or the system may have evolved to a state of complexity beyond human comprehension, but that needn't be the case.

d) SLAs matter because you need to set a standard for yourself and strive to meet it. Failing to set a standard is a recipe for complacency.

e) Joel argues that continuous improvement after an event occurs is the best solution to increasing availability. Continuous improvement is good, but depending on the number of lurking issues it can take a very long time. Continuous improvement will only drive up system availability only if the rate of fixing root causes is higher than the rate of introducing new problems. If system entropy is increasing, then your continuous improvement process will continue to fall behind.

f) Joel also argues that some events are over a long-enough time period that they can't be predicted. "They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like 'mean time between failure.' What's the 'mean time between catastrophic floods in New Orleans?'" Over a long enough time scale you absolutely can predict the "mean time between catastrophic floods in New Orleans." It is a mean, not an absolutely prediction. Means account for variance. You can't say precisely that "in 2012 there will be another class-5 hurricane", but you can state the probability and use that to define an MTBF. In fact, that analysis had been done and the risk had been assessed and the US choose not to make the investments necessary. The structural flaws in the levees had been identified in 1986. I for one am glad that the government is not throwing up its hands and saying "major floods can't be predicted or mitigated" but is instead doing further investment in the flooding and disaster models and attempting to decide how to respond.

I do agree with some of his points.

1) Availability has a cost. You need to decide as an organization what level of availability you and your customers want to achieve, estimate the costs, and assess at what point you want to make that investment. You then need to invest in the processes, tools, technologies, infrastructure, and high-quality vendors who can allow you to meet that cost.

2) I agree that 99.999% or 99.9999% are effectively equivalent to 100% when measured in minutes over a year. A years worth of minutes isn't statistically significant at those levels. His point about AT&T is valid measured over a year, but if AT&T's last significant outage was in 1991 then they've clearly proven that high availability is possible.

3) You must identify the root cause. Patching the symptoms does not increase stability.

4) Continuous improvement processes are good. You must have them. But don't just do them after a failure. Audits are valuable tools, simulated scenarios are valuable.

We must not accept that higher-quality software and systems at lower cost is impossible. We must continue to innovate so that we can delivery higher-quality at lower cost or someone else will. In every case, the answer is to increase the standards and push the creative engineering talent we have to meet those standards. SLAs are important and valuable yard-sticks.

No comments: