Principles of Large Scale Systems: January 2008

Joel Spolsky posted an article today on system availability. He postulated some widely held, but in my opinion misguided notions about the viability of building highly-available systems entitled the "Five Whys".

To paraphrase, Joel argued that there are classes of failures that can't be anticipated and that instead of trying to anticipate them or how to measure your progress in system hardening through SLA metrics, you should focus on your diagnosis, root cause, and resolution process.

I agree those processes are necessary, but they are not sufficient. Building highly available software is possible. The processes, tools, and infrastructure necessary to guarantee availability does have a cost. Here is my analysis of where Joel's goes off track:

a) Joel describes an example of a poorly configured router causing a serious outage. He proceeds to argue that creating an SLA around the router isn't the right thing to do based on the opinion that many failure events are "unknown unknowns" or Black Swans. However, the poorly configured router is clearly an example of a "known known" and not a Black Swan. Someone busted the configuration. There are a number of easy ways this could have been prevented. Peer review of configuration, automated configuration installation, configuration validation tools, checklists are all viable solutions and relatively low-cost solutions that would have eliminated this problem.

b) Joel further argues that trying to have "6 nines" availability is a ridiculous goal as it is less than 30 seconds downtime per year. He has a point that most any event will be more than seconds long and so "6 nines" measured in a year is effectively 100%. If you're shooting for 6 nines, then you need to measure over a longer time-period than a year. That said, it doesn't invalidate the whole notion of an minute-based availability metric. There is a material difference between 99%, 99.9%, 99.99% and 99.999% availability. They are an order of magnitude each. 99% is 5,260 minutes of downtime a year -- 87 hours or 3.6 days! 99.9% is 526 minutes or 8.7 hours of downtime per year. There is sufficient granularity in those numbers to be statistically significant over a year. I don't believe you can argue that black-swan events should cause 3.6 days of downtime per-year and ideally you'd be much less than 8.7 hours/year too, making 4 9's a very reasonable target.

c) The entire notion that Joel describes of an "unexpected unexpected" in a deterministic computer system seems far-fetched to me. Large-scale systems only appear to be Brownian in nature when they're poorly built and poorly understood. Computer systems are deterministic in nature. What part of a deterministic system's behavior should be "unexpected"? I understand that many people may not take the time to understand the system, or the system may have evolved to a state of complexity beyond human comprehension, but that needn't be the case.

d) SLAs matter because you need to set a standard for yourself and strive to meet it. Failing to set a standard is a recipe for complacency.

e) Joel argues that continuous improvement after an event occurs is the best solution to increasing availability. Continuous improvement is good, but depending on the number of lurking issues it can take a very long time. Continuous improvement will only drive up system availability only if the rate of fixing root causes is higher than the rate of introducing new problems. If system entropy is increasing, then your continuous improvement process will continue to fall behind.

f) Joel also argues that some events are over a long-enough time period that they can't be predicted. "They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like 'mean time between failure.' What's the 'mean time between catastrophic floods in New Orleans?'" Over a long enough time scale you absolutely can predict the "mean time between catastrophic floods in New Orleans." It is a mean, not an absolutely prediction. Means account for variance. You can't say precisely that "in 2012 there will be another class-5 hurricane", but you can state the probability and use that to define an MTBF. In fact, that analysis had been done and the risk had been assessed and the US choose not to make the investments necessary. The structural flaws in the levees had been identified in 1986. I for one am glad that the government is not throwing up its hands and saying "major floods can't be predicted or mitigated" but is instead doing further investment in the flooding and disaster models and attempting to decide how to respond.

I do agree with some of his points.

1) Availability has a cost. You need to decide as an organization what level of availability you and your customers want to achieve, estimate the costs, and assess at what point you want to make that investment. You then need to invest in the processes, tools, technologies, infrastructure, and high-quality vendors who can allow you to meet that cost.

2) I agree that 99.999% or 99.9999% are effectively equivalent to 100% when measured in minutes over a year. A years worth of minutes isn't statistically significant at those levels. His point about AT&T is valid measured over a year, but if AT&T's last significant outage was in 1991 then they've clearly proven that high availability is possible.

3) You must identify the root cause. Patching the symptoms does not increase stability.

4) Continuous improvement processes are good. You must have them. But don't just do them after a failure. Audits are valuable tools, simulated scenarios are valuable.

We must not accept that higher-quality software and systems at lower cost is impossible. We must continue to innovate so that we can delivery higher-quality at lower cost or someone else will. In every case, the answer is to increase the standards and push the creative engineering talent we have to meet those standards. SLAs are important and valuable yard-sticks.

Debate, particularly technical debate, is healthy. Debate helps flesh out the issues. It helps build shared context. It illuminates hidden requirements or hidden biases.

Debate can be passionate, even heated. Debate is not comfortable for everyone. Debate can favor those most skilled in the techniques of debate. Debate can leave people feeling dissatisfied or disenfranchised. Debate can ruin a team.

The question is how do you create an environment that fosters productive debate while avoiding the negative aspects?

Debate is a process. It has steps and guidelines, triggers and rules. Here are the process guidelines that I've used to help promote healthy debate:

a) Recognize debate is a process. It has a beginning, middle, and end. Debate will not continue indefinitely without resolution. The goal is for the participants to all reach a joint conclusion. That may not be possible, but the ultimate decision should be intellectual not emotional, and all participants should understand each perspective.

b) Recognize the value and necessity of debate as a constructive force. Give it the time necessary.

c) Empower constructive dissension. Elevate those who are more naturally quiet and seek their input. Encourage new and different perspectives.

d) Do NOT allow intimidation. Intimidation discourages debate and is a technique commonly used to shut down discussion.

e) Require transparency with relevant participants. Discussions and decisions should be documented. Individuals should not be allowed to say one thing to one person and a different thing to another person.

f) Encourage multi-modal communication. Certain individuals speak better in person with a whiteboard. Others aren't able to formulate ideas effectively without time and express themselves better in written format. Do not preclude one or the other.

g) Eliminate emotion. Emotion doesn't have a place. Whether the debate is technical, philosophical, or religious, the debate should be intellectual not emotional and never physical.

h) Facilitate face-to-face discussion. Create a level playing ground. Encourage discussion. Do not allow either party to shut down or attempt to intimidate. The goal is to move the conversation from combative to constructive debate.

i) Watch for signals of an impasse. Typically impasses can be resolved with better requirements, or better understanding of different perspectives, or just further debate. However, sometimes the debate boils down to a core difference in beliefs that can't be resolved without following both paths to see the outcome, which often just isn't possible. Once it is boiled down to a core belief without emotion and a full understanding of both sides, often the parties can "agree to disagree". At that point, a decision is necessary and it isn't that one person is more right or more wrong.

j) Validate that everyone feels they were heard. If an impasse is reached, but individuals feel their point wasn't listened to or they weren't able to effectively communicate it, then the discussion is not yet complete. Everyone must feel heard to be able to support either decision.

j) Clearly identify the decider. If there is an impasse, then at the end of the debate there will be a decision. Participants need to know who that decider is and how he or she intends to decide. Those expectations need to be communicated.

In my experience, the number one trap that people fall into is that they allow intimidation. Intimidation is corrosive to constructive debate and a constructive corporate culture.

The number two trap is that there is no clear decider and no clear path to resolution.

The third trap is poor facilitation. If the moderator can't facilitate a productive conversation, then the pressure for resolution builds, but the resolution comes without everyone feeling heard.

Principles of Large Scale Systems

Wednesday, January 23, 2008

Joel on Availability: Broken

Wednesday, January 9, 2008

Creating an Environment for Debate

Blog Archive