Principles of Large Scale Systems: 2008

Friday, April 18, 2008

The Ownership Trap

In healthy corporate cultures, individuals feel a sense of ownership for their actions, their technologies, their products, and their company. Engineering teams have a sense of “ownership” for their services, their customer experience, their costs, their dates, and their commitments.

Focus on ownership to the exclusion of leadership is a trap. Ownership can exist without leadership. Ownership without leadership leads to bunker mentalities -- boundary infighting, failure to consider the customer, and a lack of vision and forward momentum.

A heavy focus on ownership can unintentionally lead to an under-emphasis on leadership. In particular, there is a difference between owning something tangible -- a technology, a service, a host, a roadmap and owning a space. A team might perceive themselves as owning “Oracle deployments” instead of seeing themselves as owning “Relational data usage at the company.” They might see themselves as responsible for owning a metric like “number of pager events” rather than owning “end-user service availability”. This slight definitional difference requires the individual to consider customer communication in the event of a service outage as part of what they own.

Owners often feel accountable for their tactical deliverables and commitments. Owners must clearly articulate boundaries in support of ownership clarity and effective separation of concerns. Owners must focus on the here-and-now realities.

Leaders naturally think about their service in the broader context. They are looking for opportunities by which their technology or business service continues to grow and increase in scope and value. They are focused on best-of-breed.

Leaders aren’t worried so much about in-scope or out-of-scope and are focused on doing-the-right-thing. They use ownership to make sure accountability, commitment, and clarity exist, but don’t use ownership as a way to segment themselves from the rest of the organization.

Leaders think less about roles & responsibilities and more about making sure they do the right thing for the customer. Clarity of roles & responsibilities are necessary for project execution efficiency, but they are a means to an end, not an end result in and of themselves.

Leaders naturally seek to expand their service footprint. They actively consult with customers, vendors, and peer teams trying to find the best solution. They are proactive rather than reactive. They view transparency as a key to success and black-box encapsulation as a limitation to thinking broadly.

Leaders seek out opportunity for impact. They view a new customer as an opportunity rather than a burden. They view tools for driving cross-organizational change as an opportunity to help get involved and lead the entire organization, rather than as an affront to their autonomy.

Leaders communicate passionately. Status, demos, presentations aren’t distractions but are vehicles to affect positive change and get feedback.

There is a leadership trap too. Leaders who fail to feel a strong sense of ownership are like a gust of wind or a crashing wave; they can cause excitement, but their impact, if any, is short-term. Leaders who push ownership elsewhere lose their effectiveness over time as their credibility wears away.

Strong owners commit. They feel accountable for success or failure. They feel accountable to dates and deadlines. They feel responsible for long-term manageability and maintenance even for spaces that they do not directly own. Without this sense of ownership, leadership is not lasting.

The best leaders are owners and the best owners are leaders.

Thursday, April 17, 2008

Intuitives vs Sensories in Software Engineering

Engineers often fall into two Myers-Briggs personality classifications -- ISTJ or INTJ. Engineers tend to be (I)ntroverted, (T)hinking, and (J)udgmental.

Introverted -- Most engineers are motivated by their solitary pursuits. This is manifested in the need for quiet, often dark, isolated work spaces. Many engineers are introverted enough to try to avoid conflict at all cost and most engineers have to train themselves to be comfortable presenting to large groups.

Thinking -- Logic is more important than feeling. Engineers are reknown for following a rational approach. Argument is based upon logical assumptions. Engineers tend to be drawn to math, physics, and economics as alternative to their core engineering discipline.

Judgmental -- Most engineers are decisive. They have a clear opinion and don’t sway between perspectives. They are comfortable making technical decisions even in the face of unknowns.

The place where engineers most differ is between i(N)tuitives and (S)ensories.

Sensory people trust what they can touch and feel. They are very grounded in the here-and-now and the realities of the current situation. They have a visceral understanding of the current issues and feel a need to improve them. They are happiest when they can feel the pulse of the system and get into the code. They are very detail-oriented. These individuals make great engineers who produce very high-quality code and can dig in and fix most problems with a hands-on understanding of the system. Pragmatism is much more important than purism. In extreme cases, they distrust purism as impossible and believe that prediction and modeling are wasted exercises. They distrust anyone’s statements that they can anticipate the problems. Sensories often prefer to stay close to the implementation.

Intuitives tend to be the opposite. Touching and feeling the system isn’t particularly important. They understand the system at a conceptual or abstract level. They can manipulate the system based on a mental understanding of the structures and principles behind those structures. They favor purism as it makes it easier for them to manipulate the system mentally. They focus entirely on anticipating how the system will behave as future requirements impact the design or implementation. They are constantly modeling the system mentally. They may not feel the need to write down that model as they can easily manipulate it in their heads. In many cases details are a distraction. Certainly imperfection introduced by “hacks” and other organic evolution of the system are annoyances rather than practical realities that need to be dealt with. Intuitives often gravitate toward higher-level architecture and design and move away from the code.

The challenge in an engineering environment is that, poorly managed, these two approaches are in direct conflict. Strong intuitives are easily frustrated by sensories who can’t or are unwilling to consider a purist model. Strong sensories are unpersuaded by an intuitive-argument and are frustrated with an intuitive unwillingness to deal with the hands-on realities of the system.

Interestingly, both camps can become champions of process. Process is concrete and can give sensories the comfort that they have some control over the hands-on phases and steps. Intuititives like the sense of organization and structure provided by process.

The flip side is processes which introduce significant communication overhead chafe with the introverted nature of the typical engineer. Processes which subvert localized decision-making authority chafe with the judgmental nature of the typical engineer. As a result, processes may make sense to both parties, but both groups will resist any process deemed to take away control.

Strong sensories can become very uncomfortable and frustrated with intuitives. Intuitives will often ignore the details that sensories find so important. As a result, intuitives often consider a sensories’ approach and input as a minor annoyance. On the other hand, purist ideals forecast into the future make sensories extremely uncomfortable. They aren’t grounded in the here and now, but affect what needs to be done here and now. Strong sensories can come to entirely distrust intuitives, especially given one or two examples where intuitives have gone off-track.

It isn’t hard for an intuitive to go off track. Situations change. Future-casting is not future-proof. The business dynamics might change meaning the approach an intuitive postulated is no longer the right approach. Intuitives will tend to write this off with a “no regrets” philosophy that says “we made the best prediction given the data available”. The data changed, therefore it is obvious the prediction changed.

However, for a sensory this is confirmation that future-casting is impossible and had they focused on the here and now, they wouldn’t have wasted effort or otherwise be shifting course or undoing work unnecessarily.

Process becomes the tool that the sensory uses to make sure he or she doesn’t feel that pain again in the future. In one tact, the sensory will push for all “intuitive” design or prediction to be put into concrete form – diagrams, documents, decision-trees, gantt chart project plans, or prototypes. For the intuitive this is reasonable as it allows him or her to communicate better, but ultimately unnecessary as all this exists in their heads already and the process of documenting it is not particularly revealing.

The other tact is to bake into the process the notion that prediction is not possible and the only viable approach is hands-on evolution of the existing system. There are subgroups of the agile programming community today that argue prediction is impossible and not a worthwhile endeavor.

Agility and flexibility appeal to the intuitive as he or she is constantly future-casting and sees the need to adapt as new data becomes available. Waterfall is also appealing to the intuitive as it allows for some form of purism up-front before the messy detail-oriented task of implementation begins. In the same way agile methods can be employed to try to limit the damage an intuitive can cause with their future-casting, waterfall can be employed by intuitives to try to limit the churning sensories can cause by constantly pushing on the details. That said, in waterfall the intuitive does indeed have to deal with some amount of “annoying detail”. That annoying detail can be ignored with the lighter-weight iterative constructs of a methodology like scrum. Therefore, intuitives will also gravitate toward agile methods.

In the end, extremes fail. Strong engineers of the intuitive bent can indeed future-cast to a certain extent and their passion for purism can give the system and architecture structure. However, future-casting too far or investing too much before the hands-on details come to light or customer feedback is incorporate into the design is precarious. Similarly, failing to attempt any anticipation or prediction before launching into hands-on development without a sense of the framework or future requirements can ultimately create a house-of-cards architecture that requires more and more investment to maintain.

Engineering organizations need to recognize the strengths and pitfalls of each personality type and balance them. Favoring one over the other is problematic. Details matter, anticipatory designs matter. The tension between the two camps, properly managed, is constructive. Processes which encourage intuitive to write down some of their thinking and expose their assumptions are valuable to a point. Processes which favor hands-on experience earlier and faster and place appropriate priority on the concrete details are important too. Prototyping works very well.

Ultimately, the discussion and debate between the two is the most valuable product of this inherent personality conflict.

Thursday, February 21, 2008

Operational Availability Is Similar to Manufacturing Safety

The principles of manufacturing safety can be applied very directly to improving availability in a large-scale operational environment:

A) eliminate hazards
B) keep aisles and paths clear
C) minimize the diversity and complexity of tasks
D) make everyone walk, not run
E) encourage reporting of injury or poor safety behavior
F) give everyone a big red button
G) create a culture of safety and reinforce it constantly
H) audit frequently
I) create a written record and post-mortem for every event

Most manufacturing injuries and most large-scale systems failures are caused by human error. As a result, you want to address the limitations of the humans in the system: their attention span fails; they're clumsy at times; they naturally take short-cuts; they make assumptions based on past experience and apply them to new circumstances; they have a natural bias toward trying to fix a problem themself rather than report it; and they tend to prioritize based on recent direction rather than against global perspectives.

These faults have led to a desire to eliminate all humans from large-scale systems operations and availability. This is a fine goal, but we're far from achieving it still. As imperfect as they are humans still have better adaptive diagnosis, problem solving, and dynamic decision-making abilities than we've been able to instantiate in code.

In large-scale systems operations, the environment is typically virtual rather than physical. Eliminating hazards is about fixing the known architectural points of weakness, having backups, eliminating single-points-of-failure, applying security patches, and keeping the datacenters at the right temperature.

Keeping the aisles clear means keeping cruft from piling up -- eliminating unnecessary bugs and errors, having clean intelligible logs, making configuration consistent and readable, prioritizing understandability over cleverness in code, having clean cabling, intelligent machine names, clear boundaries between production and test, and high signal-to-noise alerts.

Minimizing the diversity of tasks is about reducing variance in the system, reducing the number of tasks that operators have to be proficient at, and reducing the number of steps in those tasks. We've all seen 2-page step-by-step release processes and we've all seen them break down somewhere along the way. We've all seen operators assume that one environment is identical to another and apply the same command or configuration change with disastrous consequence.

Making everyone walk, not run is about controlling the natural instinct of an operator to try to speed up when they're behind in a task or that the system is crashing around them. It is also about keeping a level head and consistent measured progress and instilling processes and practices that require the operators to not skip steps or take short-cuts.

How many times has someone noticed a problem or introduced a problem during a task and said "oh, I'll come back and fix that later" only to forget to fix it and have it cause a failure later? In the same way safety risks and incidents need to be reported so that appropriate steps can be taken and nothing is forgotten, operational risks need to be reported. This can be as simple as requiring all observed problems to be logged as a ticket. The lighter-weight this process, the more likely a report will be filed. Anonymous reports should be allowed too.

Giving everyone a big red button empowers the operator to stop something from happening before the situation gets worse. An operator should be empowered to stop any software from launching or to decide at any point to stop an activity that appears to be too risky.

All the best practices and processes will atrophy if the priority is perceived to shift away from safety, or in the operational world -- away from system availability. Creating a culture is about management constantly reinforcing the importance. Posters help and work. Creating a culture requires developing a sense of pride in the team that they're good at it -- the best. Creating the culture requires transparency into the issues and visibility into the progress against those. It also requires creating a shared sense of failure when any one individual fails.

Auditing frequently is a good process to reinforce the discipline and keep looking for ways to improve even if their hasn't been an incident in months or years. That heartbeat and that constant emphasis and diligence help set the standard and will undoubtedly always uncover something new.

The written record is a key piece. It requires management to review in detail the event and really understand not just the proximal cause of the incident, but all the related causes.

Wednesday, January 23, 2008

Joel on Availability: Broken

Joel Spolsky posted an article today on system availability. He postulated some widely held, but in my opinion misguided notions about the viability of building highly-available systems entitled the "Five Whys".

To paraphrase, Joel argued that there are classes of failures that can't be anticipated and that instead of trying to anticipate them or how to measure your progress in system hardening through SLA metrics, you should focus on your diagnosis, root cause, and resolution process.

I agree those processes are necessary, but they are not sufficient. Building highly available software is possible. The processes, tools, and infrastructure necessary to guarantee availability does have a cost. Here is my analysis of where Joel's goes off track:

a) Joel describes an example of a poorly configured router causing a serious outage. He proceeds to argue that creating an SLA around the router isn't the right thing to do based on the opinion that many failure events are "unknown unknowns" or Black Swans. However, the poorly configured router is clearly an example of a "known known" and not a Black Swan. Someone busted the configuration. There are a number of easy ways this could have been prevented. Peer review of configuration, automated configuration installation, configuration validation tools, checklists are all viable solutions and relatively low-cost solutions that would have eliminated this problem.

b) Joel further argues that trying to have "6 nines" availability is a ridiculous goal as it is less than 30 seconds downtime per year. He has a point that most any event will be more than seconds long and so "6 nines" measured in a year is effectively 100%. If you're shooting for 6 nines, then you need to measure over a longer time-period than a year. That said, it doesn't invalidate the whole notion of an minute-based availability metric. There is a material difference between 99%, 99.9%, 99.99% and 99.999% availability. They are an order of magnitude each. 99% is 5,260 minutes of downtime a year -- 87 hours or 3.6 days! 99.9% is 526 minutes or 8.7 hours of downtime per year. There is sufficient granularity in those numbers to be statistically significant over a year. I don't believe you can argue that black-swan events should cause 3.6 days of downtime per-year and ideally you'd be much less than 8.7 hours/year too, making 4 9's a very reasonable target.

c) The entire notion that Joel describes of an "unexpected unexpected" in a deterministic computer system seems far-fetched to me. Large-scale systems only appear to be Brownian in nature when they're poorly built and poorly understood. Computer systems are deterministic in nature. What part of a deterministic system's behavior should be "unexpected"? I understand that many people may not take the time to understand the system, or the system may have evolved to a state of complexity beyond human comprehension, but that needn't be the case.

d) SLAs matter because you need to set a standard for yourself and strive to meet it. Failing to set a standard is a recipe for complacency.

e) Joel argues that continuous improvement after an event occurs is the best solution to increasing availability. Continuous improvement is good, but depending on the number of lurking issues it can take a very long time. Continuous improvement will only drive up system availability only if the rate of fixing root causes is higher than the rate of introducing new problems. If system entropy is increasing, then your continuous improvement process will continue to fall behind.

f) Joel also argues that some events are over a long-enough time period that they can't be predicted. "They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like 'mean time between failure.' What's the 'mean time between catastrophic floods in New Orleans?'" Over a long enough time scale you absolutely can predict the "mean time between catastrophic floods in New Orleans." It is a mean, not an absolutely prediction. Means account for variance. You can't say precisely that "in 2012 there will be another class-5 hurricane", but you can state the probability and use that to define an MTBF. In fact, that analysis had been done and the risk had been assessed and the US choose not to make the investments necessary. The structural flaws in the levees had been identified in 1986. I for one am glad that the government is not throwing up its hands and saying "major floods can't be predicted or mitigated" but is instead doing further investment in the flooding and disaster models and attempting to decide how to respond.

I do agree with some of his points.

1) Availability has a cost. You need to decide as an organization what level of availability you and your customers want to achieve, estimate the costs, and assess at what point you want to make that investment. You then need to invest in the processes, tools, technologies, infrastructure, and high-quality vendors who can allow you to meet that cost.

2) I agree that 99.999% or 99.9999% are effectively equivalent to 100% when measured in minutes over a year. A years worth of minutes isn't statistically significant at those levels. His point about AT&T is valid measured over a year, but if AT&T's last significant outage was in 1991 then they've clearly proven that high availability is possible.

3) You must identify the root cause. Patching the symptoms does not increase stability.

4) Continuous improvement processes are good. You must have them. But don't just do them after a failure. Audits are valuable tools, simulated scenarios are valuable.

We must not accept that higher-quality software and systems at lower cost is impossible. We must continue to innovate so that we can delivery higher-quality at lower cost or someone else will. In every case, the answer is to increase the standards and push the creative engineering talent we have to meet those standards. SLAs are important and valuable yard-sticks.

Wednesday, January 9, 2008

Creating an Environment for Debate

Debate, particularly technical debate, is healthy. Debate helps flesh out the issues. It helps build shared context. It illuminates hidden requirements or hidden biases.

Debate can be passionate, even heated. Debate is not comfortable for everyone. Debate can favor those most skilled in the techniques of debate. Debate can leave people feeling dissatisfied or disenfranchised. Debate can ruin a team.

The question is how do you create an environment that fosters productive debate while avoiding the negative aspects?

Debate is a process. It has steps and guidelines, triggers and rules. Here are the process guidelines that I've used to help promote healthy debate:

a) Recognize debate is a process. It has a beginning, middle, and end. Debate will not continue indefinitely without resolution. The goal is for the participants to all reach a joint conclusion. That may not be possible, but the ultimate decision should be intellectual not emotional, and all participants should understand each perspective.

b) Recognize the value and necessity of debate as a constructive force. Give it the time necessary.

c) Empower constructive dissension. Elevate those who are more naturally quiet and seek their input. Encourage new and different perspectives.

d) Do NOT allow intimidation. Intimidation discourages debate and is a technique commonly used to shut down discussion.

e) Require transparency with relevant participants. Discussions and decisions should be documented. Individuals should not be allowed to say one thing to one person and a different thing to another person.

f) Encourage multi-modal communication. Certain individuals speak better in person with a whiteboard. Others aren't able to formulate ideas effectively without time and express themselves better in written format. Do not preclude one or the other.

g) Eliminate emotion. Emotion doesn't have a place. Whether the debate is technical, philosophical, or religious, the debate should be intellectual not emotional and never physical.

h) Facilitate face-to-face discussion. Create a level playing ground. Encourage discussion. Do not allow either party to shut down or attempt to intimidate. The goal is to move the conversation from combative to constructive debate.

i) Watch for signals of an impasse. Typically impasses can be resolved with better requirements, or better understanding of different perspectives, or just further debate. However, sometimes the debate boils down to a core difference in beliefs that can't be resolved without following both paths to see the outcome, which often just isn't possible. Once it is boiled down to a core belief without emotion and a full understanding of both sides, often the parties can "agree to disagree". At that point, a decision is necessary and it isn't that one person is more right or more wrong.

j) Validate that everyone feels they were heard. If an impasse is reached, but individuals feel their point wasn't listened to or they weren't able to effectively communicate it, then the discussion is not yet complete. Everyone must feel heard to be able to support either decision.

j) Clearly identify the decider. If there is an impasse, then at the end of the debate there will be a decision. Participants need to know who that decider is and how he or she intends to decide. Those expectations need to be communicated.

In my experience, the number one trap that people fall into is that they allow intimidation. Intimidation is corrosive to constructive debate and a constructive corporate culture.

The number two trap is that there is no clear decider and no clear path to resolution.

The third trap is poor facilitation. If the moderator can't facilitate a productive conversation, then the pressure for resolution builds, but the resolution comes without everyone feeling heard.

Principles of Large Scale Systems