Principles of Large Scale Systems: February 2008

The principles of manufacturing safety can be applied very directly to improving availability in a large-scale operational environment:

A) eliminate hazards
B) keep aisles and paths clear
C) minimize the diversity and complexity of tasks
D) make everyone walk, not run
E) encourage reporting of injury or poor safety behavior
F) give everyone a big red button
G) create a culture of safety and reinforce it constantly
H) audit frequently
I) create a written record and post-mortem for every event

Most manufacturing injuries and most large-scale systems failures are caused by human error. As a result, you want to address the limitations of the humans in the system: their attention span fails; they're clumsy at times; they naturally take short-cuts; they make assumptions based on past experience and apply them to new circumstances; they have a natural bias toward trying to fix a problem themself rather than report it; and they tend to prioritize based on recent direction rather than against global perspectives.

These faults have led to a desire to eliminate all humans from large-scale systems operations and availability. This is a fine goal, but we're far from achieving it still. As imperfect as they are humans still have better adaptive diagnosis, problem solving, and dynamic decision-making abilities than we've been able to instantiate in code.

In large-scale systems operations, the environment is typically virtual rather than physical. Eliminating hazards is about fixing the known architectural points of weakness, having backups, eliminating single-points-of-failure, applying security patches, and keeping the datacenters at the right temperature.

Keeping the aisles clear means keeping cruft from piling up -- eliminating unnecessary bugs and errors, having clean intelligible logs, making configuration consistent and readable, prioritizing understandability over cleverness in code, having clean cabling, intelligent machine names, clear boundaries between production and test, and high signal-to-noise alerts.

Minimizing the diversity of tasks is about reducing variance in the system, reducing the number of tasks that operators have to be proficient at, and reducing the number of steps in those tasks. We've all seen 2-page step-by-step release processes and we've all seen them break down somewhere along the way. We've all seen operators assume that one environment is identical to another and apply the same command or configuration change with disastrous consequence.

Making everyone walk, not run is about controlling the natural instinct of an operator to try to speed up when they're behind in a task or that the system is crashing around them. It is also about keeping a level head and consistent measured progress and instilling processes and practices that require the operators to not skip steps or take short-cuts.

How many times has someone noticed a problem or introduced a problem during a task and said "oh, I'll come back and fix that later" only to forget to fix it and have it cause a failure later? In the same way safety risks and incidents need to be reported so that appropriate steps can be taken and nothing is forgotten, operational risks need to be reported. This can be as simple as requiring all observed problems to be logged as a ticket. The lighter-weight this process, the more likely a report will be filed. Anonymous reports should be allowed too.

Giving everyone a big red button empowers the operator to stop something from happening before the situation gets worse. An operator should be empowered to stop any software from launching or to decide at any point to stop an activity that appears to be too risky.

All the best practices and processes will atrophy if the priority is perceived to shift away from safety, or in the operational world -- away from system availability. Creating a culture is about management constantly reinforcing the importance. Posters help and work. Creating a culture requires developing a sense of pride in the team that they're good at it -- the best. Creating the culture requires transparency into the issues and visibility into the progress against those. It also requires creating a shared sense of failure when any one individual fails.

Auditing frequently is a good process to reinforce the discipline and keep looking for ways to improve even if their hasn't been an incident in months or years. That heartbeat and that constant emphasis and diligence help set the standard and will undoubtedly always uncover something new.

The written record is a key piece. It requires management to review in detail the event and really understand not just the proximal cause of the incident, but all the related causes.

Principles of Large Scale Systems

Thursday, February 21, 2008

Operational Availability Is Similar to Manufacturing Safety

Blog Archive