Principles of Large Scale Systems

Friday, April 18, 2008

The Ownership Trap

In healthy corporate cultures, individuals feel a sense of ownership for their actions, their technologies, their products, and their company. Engineering teams have a sense of “ownership” for their services, their customer experience, their costs, their dates, and their commitments.

Focus on ownership to the exclusion of leadership is a trap. Ownership can exist without leadership. Ownership without leadership leads to bunker mentalities -- boundary infighting, failure to consider the customer, and a lack of vision and forward momentum.

A heavy focus on ownership can unintentionally lead to an under-emphasis on leadership. In particular, there is a difference between owning something tangible -- a technology, a service, a host, a roadmap and owning a space. A team might perceive themselves as owning “Oracle deployments” instead of seeing themselves as owning “Relational data usage at the company.” They might see themselves as responsible for owning a metric like “number of pager events” rather than owning “end-user service availability”. This slight definitional difference requires the individual to consider customer communication in the event of a service outage as part of what they own.

Owners often feel accountable for their tactical deliverables and commitments. Owners must clearly articulate boundaries in support of ownership clarity and effective separation of concerns. Owners must focus on the here-and-now realities.

Leaders naturally think about their service in the broader context. They are looking for opportunities by which their technology or business service continues to grow and increase in scope and value. They are focused on best-of-breed.

Leaders aren’t worried so much about in-scope or out-of-scope and are focused on doing-the-right-thing. They use ownership to make sure accountability, commitment, and clarity exist, but don’t use ownership as a way to segment themselves from the rest of the organization.

Leaders think less about roles & responsibilities and more about making sure they do the right thing for the customer. Clarity of roles & responsibilities are necessary for project execution efficiency, but they are a means to an end, not an end result in and of themselves.

Leaders naturally seek to expand their service footprint. They actively consult with customers, vendors, and peer teams trying to find the best solution. They are proactive rather than reactive. They view transparency as a key to success and black-box encapsulation as a limitation to thinking broadly.

Leaders seek out opportunity for impact. They view a new customer as an opportunity rather than a burden. They view tools for driving cross-organizational change as an opportunity to help get involved and lead the entire organization, rather than as an affront to their autonomy.

Leaders communicate passionately. Status, demos, presentations aren’t distractions but are vehicles to affect positive change and get feedback.

There is a leadership trap too. Leaders who fail to feel a strong sense of ownership are like a gust of wind or a crashing wave; they can cause excitement, but their impact, if any, is short-term. Leaders who push ownership elsewhere lose their effectiveness over time as their credibility wears away.

Strong owners commit. They feel accountable for success or failure. They feel accountable to dates and deadlines. They feel responsible for long-term manageability and maintenance even for spaces that they do not directly own. Without this sense of ownership, leadership is not lasting.

The best leaders are owners and the best owners are leaders.

Thursday, April 17, 2008

Intuitives vs Sensories in Software Engineering

Engineers often fall into two Myers-Briggs personality classifications -- ISTJ or INTJ. Engineers tend to be (I)ntroverted, (T)hinking, and (J)udgmental.

Introverted -- Most engineers are motivated by their solitary pursuits. This is manifested in the need for quiet, often dark, isolated work spaces. Many engineers are introverted enough to try to avoid conflict at all cost and most engineers have to train themselves to be comfortable presenting to large groups.

Thinking -- Logic is more important than feeling. Engineers are reknown for following a rational approach. Argument is based upon logical assumptions. Engineers tend to be drawn to math, physics, and economics as alternative to their core engineering discipline.

Judgmental -- Most engineers are decisive. They have a clear opinion and don’t sway between perspectives. They are comfortable making technical decisions even in the face of unknowns.

The place where engineers most differ is between i(N)tuitives and (S)ensories.

Sensory people trust what they can touch and feel. They are very grounded in the here-and-now and the realities of the current situation. They have a visceral understanding of the current issues and feel a need to improve them. They are happiest when they can feel the pulse of the system and get into the code. They are very detail-oriented. These individuals make great engineers who produce very high-quality code and can dig in and fix most problems with a hands-on understanding of the system. Pragmatism is much more important than purism. In extreme cases, they distrust purism as impossible and believe that prediction and modeling are wasted exercises. They distrust anyone’s statements that they can anticipate the problems. Sensories often prefer to stay close to the implementation.

Intuitives tend to be the opposite. Touching and feeling the system isn’t particularly important. They understand the system at a conceptual or abstract level. They can manipulate the system based on a mental understanding of the structures and principles behind those structures. They favor purism as it makes it easier for them to manipulate the system mentally. They focus entirely on anticipating how the system will behave as future requirements impact the design or implementation. They are constantly modeling the system mentally. They may not feel the need to write down that model as they can easily manipulate it in their heads. In many cases details are a distraction. Certainly imperfection introduced by “hacks” and other organic evolution of the system are annoyances rather than practical realities that need to be dealt with. Intuitives often gravitate toward higher-level architecture and design and move away from the code.

The challenge in an engineering environment is that, poorly managed, these two approaches are in direct conflict. Strong intuitives are easily frustrated by sensories who can’t or are unwilling to consider a purist model. Strong sensories are unpersuaded by an intuitive-argument and are frustrated with an intuitive unwillingness to deal with the hands-on realities of the system.

Interestingly, both camps can become champions of process. Process is concrete and can give sensories the comfort that they have some control over the hands-on phases and steps. Intuititives like the sense of organization and structure provided by process.

The flip side is processes which introduce significant communication overhead chafe with the introverted nature of the typical engineer. Processes which subvert localized decision-making authority chafe with the judgmental nature of the typical engineer. As a result, processes may make sense to both parties, but both groups will resist any process deemed to take away control.

Strong sensories can become very uncomfortable and frustrated with intuitives. Intuitives will often ignore the details that sensories find so important. As a result, intuitives often consider a sensories’ approach and input as a minor annoyance. On the other hand, purist ideals forecast into the future make sensories extremely uncomfortable. They aren’t grounded in the here and now, but affect what needs to be done here and now. Strong sensories can come to entirely distrust intuitives, especially given one or two examples where intuitives have gone off-track.

It isn’t hard for an intuitive to go off track. Situations change. Future-casting is not future-proof. The business dynamics might change meaning the approach an intuitive postulated is no longer the right approach. Intuitives will tend to write this off with a “no regrets” philosophy that says “we made the best prediction given the data available”. The data changed, therefore it is obvious the prediction changed.

However, for a sensory this is confirmation that future-casting is impossible and had they focused on the here and now, they wouldn’t have wasted effort or otherwise be shifting course or undoing work unnecessarily.

Process becomes the tool that the sensory uses to make sure he or she doesn’t feel that pain again in the future. In one tact, the sensory will push for all “intuitive” design or prediction to be put into concrete form – diagrams, documents, decision-trees, gantt chart project plans, or prototypes. For the intuitive this is reasonable as it allows him or her to communicate better, but ultimately unnecessary as all this exists in their heads already and the process of documenting it is not particularly revealing.

The other tact is to bake into the process the notion that prediction is not possible and the only viable approach is hands-on evolution of the existing system. There are subgroups of the agile programming community today that argue prediction is impossible and not a worthwhile endeavor.

Agility and flexibility appeal to the intuitive as he or she is constantly future-casting and sees the need to adapt as new data becomes available. Waterfall is also appealing to the intuitive as it allows for some form of purism up-front before the messy detail-oriented task of implementation begins. In the same way agile methods can be employed to try to limit the damage an intuitive can cause with their future-casting, waterfall can be employed by intuitives to try to limit the churning sensories can cause by constantly pushing on the details. That said, in waterfall the intuitive does indeed have to deal with some amount of “annoying detail”. That annoying detail can be ignored with the lighter-weight iterative constructs of a methodology like scrum. Therefore, intuitives will also gravitate toward agile methods.

In the end, extremes fail. Strong engineers of the intuitive bent can indeed future-cast to a certain extent and their passion for purism can give the system and architecture structure. However, future-casting too far or investing too much before the hands-on details come to light or customer feedback is incorporate into the design is precarious. Similarly, failing to attempt any anticipation or prediction before launching into hands-on development without a sense of the framework or future requirements can ultimately create a house-of-cards architecture that requires more and more investment to maintain.

Engineering organizations need to recognize the strengths and pitfalls of each personality type and balance them. Favoring one over the other is problematic. Details matter, anticipatory designs matter. The tension between the two camps, properly managed, is constructive. Processes which encourage intuitive to write down some of their thinking and expose their assumptions are valuable to a point. Processes which favor hands-on experience earlier and faster and place appropriate priority on the concrete details are important too. Prototyping works very well.

Ultimately, the discussion and debate between the two is the most valuable product of this inherent personality conflict.

Thursday, February 21, 2008

Operational Availability Is Similar to Manufacturing Safety

The principles of manufacturing safety can be applied very directly to improving availability in a large-scale operational environment:

A) eliminate hazards
B) keep aisles and paths clear
C) minimize the diversity and complexity of tasks
D) make everyone walk, not run
E) encourage reporting of injury or poor safety behavior
F) give everyone a big red button
G) create a culture of safety and reinforce it constantly
H) audit frequently
I) create a written record and post-mortem for every event

Most manufacturing injuries and most large-scale systems failures are caused by human error. As a result, you want to address the limitations of the humans in the system: their attention span fails; they're clumsy at times; they naturally take short-cuts; they make assumptions based on past experience and apply them to new circumstances; they have a natural bias toward trying to fix a problem themself rather than report it; and they tend to prioritize based on recent direction rather than against global perspectives.

These faults have led to a desire to eliminate all humans from large-scale systems operations and availability. This is a fine goal, but we're far from achieving it still. As imperfect as they are humans still have better adaptive diagnosis, problem solving, and dynamic decision-making abilities than we've been able to instantiate in code.

In large-scale systems operations, the environment is typically virtual rather than physical. Eliminating hazards is about fixing the known architectural points of weakness, having backups, eliminating single-points-of-failure, applying security patches, and keeping the datacenters at the right temperature.

Keeping the aisles clear means keeping cruft from piling up -- eliminating unnecessary bugs and errors, having clean intelligible logs, making configuration consistent and readable, prioritizing understandability over cleverness in code, having clean cabling, intelligent machine names, clear boundaries between production and test, and high signal-to-noise alerts.

Minimizing the diversity of tasks is about reducing variance in the system, reducing the number of tasks that operators have to be proficient at, and reducing the number of steps in those tasks. We've all seen 2-page step-by-step release processes and we've all seen them break down somewhere along the way. We've all seen operators assume that one environment is identical to another and apply the same command or configuration change with disastrous consequence.

Making everyone walk, not run is about controlling the natural instinct of an operator to try to speed up when they're behind in a task or that the system is crashing around them. It is also about keeping a level head and consistent measured progress and instilling processes and practices that require the operators to not skip steps or take short-cuts.

How many times has someone noticed a problem or introduced a problem during a task and said "oh, I'll come back and fix that later" only to forget to fix it and have it cause a failure later? In the same way safety risks and incidents need to be reported so that appropriate steps can be taken and nothing is forgotten, operational risks need to be reported. This can be as simple as requiring all observed problems to be logged as a ticket. The lighter-weight this process, the more likely a report will be filed. Anonymous reports should be allowed too.

Giving everyone a big red button empowers the operator to stop something from happening before the situation gets worse. An operator should be empowered to stop any software from launching or to decide at any point to stop an activity that appears to be too risky.

All the best practices and processes will atrophy if the priority is perceived to shift away from safety, or in the operational world -- away from system availability. Creating a culture is about management constantly reinforcing the importance. Posters help and work. Creating a culture requires developing a sense of pride in the team that they're good at it -- the best. Creating the culture requires transparency into the issues and visibility into the progress against those. It also requires creating a shared sense of failure when any one individual fails.

Auditing frequently is a good process to reinforce the discipline and keep looking for ways to improve even if their hasn't been an incident in months or years. That heartbeat and that constant emphasis and diligence help set the standard and will undoubtedly always uncover something new.

The written record is a key piece. It requires management to review in detail the event and really understand not just the proximal cause of the incident, but all the related causes.

Wednesday, January 23, 2008

Joel on Availability: Broken

Joel Spolsky posted an article today on system availability. He postulated some widely held, but in my opinion misguided notions about the viability of building highly-available systems entitled the "Five Whys".

To paraphrase, Joel argued that there are classes of failures that can't be anticipated and that instead of trying to anticipate them or how to measure your progress in system hardening through SLA metrics, you should focus on your diagnosis, root cause, and resolution process.

I agree those processes are necessary, but they are not sufficient. Building highly available software is possible. The processes, tools, and infrastructure necessary to guarantee availability does have a cost. Here is my analysis of where Joel's goes off track:

a) Joel describes an example of a poorly configured router causing a serious outage. He proceeds to argue that creating an SLA around the router isn't the right thing to do based on the opinion that many failure events are "unknown unknowns" or Black Swans. However, the poorly configured router is clearly an example of a "known known" and not a Black Swan. Someone busted the configuration. There are a number of easy ways this could have been prevented. Peer review of configuration, automated configuration installation, configuration validation tools, checklists are all viable solutions and relatively low-cost solutions that would have eliminated this problem.

b) Joel further argues that trying to have "6 nines" availability is a ridiculous goal as it is less than 30 seconds downtime per year. He has a point that most any event will be more than seconds long and so "6 nines" measured in a year is effectively 100%. If you're shooting for 6 nines, then you need to measure over a longer time-period than a year. That said, it doesn't invalidate the whole notion of an minute-based availability metric. There is a material difference between 99%, 99.9%, 99.99% and 99.999% availability. They are an order of magnitude each. 99% is 5,260 minutes of downtime a year -- 87 hours or 3.6 days! 99.9% is 526 minutes or 8.7 hours of downtime per year. There is sufficient granularity in those numbers to be statistically significant over a year. I don't believe you can argue that black-swan events should cause 3.6 days of downtime per-year and ideally you'd be much less than 8.7 hours/year too, making 4 9's a very reasonable target.

c) The entire notion that Joel describes of an "unexpected unexpected" in a deterministic computer system seems far-fetched to me. Large-scale systems only appear to be Brownian in nature when they're poorly built and poorly understood. Computer systems are deterministic in nature. What part of a deterministic system's behavior should be "unexpected"? I understand that many people may not take the time to understand the system, or the system may have evolved to a state of complexity beyond human comprehension, but that needn't be the case.

d) SLAs matter because you need to set a standard for yourself and strive to meet it. Failing to set a standard is a recipe for complacency.

e) Joel argues that continuous improvement after an event occurs is the best solution to increasing availability. Continuous improvement is good, but depending on the number of lurking issues it can take a very long time. Continuous improvement will only drive up system availability only if the rate of fixing root causes is higher than the rate of introducing new problems. If system entropy is increasing, then your continuous improvement process will continue to fall behind.

f) Joel also argues that some events are over a long-enough time period that they can't be predicted. "They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like 'mean time between failure.' What's the 'mean time between catastrophic floods in New Orleans?'" Over a long enough time scale you absolutely can predict the "mean time between catastrophic floods in New Orleans." It is a mean, not an absolutely prediction. Means account for variance. You can't say precisely that "in 2012 there will be another class-5 hurricane", but you can state the probability and use that to define an MTBF. In fact, that analysis had been done and the risk had been assessed and the US choose not to make the investments necessary. The structural flaws in the levees had been identified in 1986. I for one am glad that the government is not throwing up its hands and saying "major floods can't be predicted or mitigated" but is instead doing further investment in the flooding and disaster models and attempting to decide how to respond.

I do agree with some of his points.

1) Availability has a cost. You need to decide as an organization what level of availability you and your customers want to achieve, estimate the costs, and assess at what point you want to make that investment. You then need to invest in the processes, tools, technologies, infrastructure, and high-quality vendors who can allow you to meet that cost.

2) I agree that 99.999% or 99.9999% are effectively equivalent to 100% when measured in minutes over a year. A years worth of minutes isn't statistically significant at those levels. His point about AT&T is valid measured over a year, but if AT&T's last significant outage was in 1991 then they've clearly proven that high availability is possible.

3) You must identify the root cause. Patching the symptoms does not increase stability.

4) Continuous improvement processes are good. You must have them. But don't just do them after a failure. Audits are valuable tools, simulated scenarios are valuable.

We must not accept that higher-quality software and systems at lower cost is impossible. We must continue to innovate so that we can delivery higher-quality at lower cost or someone else will. In every case, the answer is to increase the standards and push the creative engineering talent we have to meet those standards. SLAs are important and valuable yard-sticks.

Wednesday, January 9, 2008

Creating an Environment for Debate

Debate, particularly technical debate, is healthy. Debate helps flesh out the issues. It helps build shared context. It illuminates hidden requirements or hidden biases.

Debate can be passionate, even heated. Debate is not comfortable for everyone. Debate can favor those most skilled in the techniques of debate. Debate can leave people feeling dissatisfied or disenfranchised. Debate can ruin a team.

The question is how do you create an environment that fosters productive debate while avoiding the negative aspects?

Debate is a process. It has steps and guidelines, triggers and rules. Here are the process guidelines that I've used to help promote healthy debate:

a) Recognize debate is a process. It has a beginning, middle, and end. Debate will not continue indefinitely without resolution. The goal is for the participants to all reach a joint conclusion. That may not be possible, but the ultimate decision should be intellectual not emotional, and all participants should understand each perspective.

b) Recognize the value and necessity of debate as a constructive force. Give it the time necessary.

c) Empower constructive dissension. Elevate those who are more naturally quiet and seek their input. Encourage new and different perspectives.

d) Do NOT allow intimidation. Intimidation discourages debate and is a technique commonly used to shut down discussion.

e) Require transparency with relevant participants. Discussions and decisions should be documented. Individuals should not be allowed to say one thing to one person and a different thing to another person.

f) Encourage multi-modal communication. Certain individuals speak better in person with a whiteboard. Others aren't able to formulate ideas effectively without time and express themselves better in written format. Do not preclude one or the other.

g) Eliminate emotion. Emotion doesn't have a place. Whether the debate is technical, philosophical, or religious, the debate should be intellectual not emotional and never physical.

h) Facilitate face-to-face discussion. Create a level playing ground. Encourage discussion. Do not allow either party to shut down or attempt to intimidate. The goal is to move the conversation from combative to constructive debate.

i) Watch for signals of an impasse. Typically impasses can be resolved with better requirements, or better understanding of different perspectives, or just further debate. However, sometimes the debate boils down to a core difference in beliefs that can't be resolved without following both paths to see the outcome, which often just isn't possible. Once it is boiled down to a core belief without emotion and a full understanding of both sides, often the parties can "agree to disagree". At that point, a decision is necessary and it isn't that one person is more right or more wrong.

j) Validate that everyone feels they were heard. If an impasse is reached, but individuals feel their point wasn't listened to or they weren't able to effectively communicate it, then the discussion is not yet complete. Everyone must feel heard to be able to support either decision.

j) Clearly identify the decider. If there is an impasse, then at the end of the debate there will be a decision. Participants need to know who that decider is and how he or she intends to decide. Those expectations need to be communicated.

In my experience, the number one trap that people fall into is that they allow intimidation. Intimidation is corrosive to constructive debate and a constructive corporate culture.

The number two trap is that there is no clear decider and no clear path to resolution.

The third trap is poor facilitation. If the moderator can't facilitate a productive conversation, then the pressure for resolution builds, but the resolution comes without everyone feeling heard.

Saturday, December 22, 2007

Creating Value Vs. Making Money

I distinctly separate the notions of creating value and making money. Generally you need to provide some value to the customer if you’re going to make money, but you can build substantial value without making any money.

Let me provide a few examples of endeavors that created incredible value but made relatively little money: the Interstate Freeway system, the Internet, Netscape Communications Corporation, NASA, and Xerox Parc in the 70’s. There are plenty of initiatives that make gobs of money, but don’t create any lasting value: bottled water, handbag manufacturing, municipal services, dry cleaners, and Google ads.

As engineers, we are primarily engaged in the creation of value. Making money is often a secondary but necessary evil.

The question then becomes – how do you measure the value you’re creating independently of the money you’re making? One way is to simply ask yourself how much would the engineered assets be worth sold as a company? You can also quantify the value of the Intellectual Property as measured in patents, trademarks, and copyrights. You can also look at the net sum of money that everyone using that valuable engineered good is making.

Part of the point of this is that the engineering innovation isn’t always directly monetized. Google could have developed its search engine technology and sold it as packaged software, or attempted to charge people for every click. Instead, Google stumbled on text-based advertising as it looked to monetize the value it created through traffic. The core search algorithm helped acquire customers and page views that lead to monetization through advertising.

Value creation and money making do not necessarily go hand-in-hand. To have a successful business you need to manage both. Think of value as potential-energy and money-making as kinetic energy. If you aren’t increasing the potential energy of your business, you’re growth options are limited. If you aren’t turning potential energy into kinetic energy then you’re failing to convert your value into cash flow. Those companies who create value but don’t make money are generally bought for multiples far less than the full potential. Those companies who generate money, but no additional value tend to stop growing. The best companies create an engine that generates more value and more money hand-in-hand.

Reuse by Copy vs Reuse by Reference

Copying code is a much maligned strategy for reuse. Conventional wisdom seems to hold that any reusable bits should be abstracted with a clean interface and the consumer of that code should use those interfaces directly and not change the underlying structure. Complexity must be abstracted.

Everyone has some experience where they forked a code-base and ended up with painful merge conflicts and spent a huge amount of energy converging.

However, copying code has significant advantages as a reuse strategy. Convergence is rarely as technically painful as it is politically painful. Often too many interfaces are exposed by reference leading to unnecessary complexity. Too often those interfaces are exposed without proper infrastructure to guarantee backwards compatibility, promote adoption, or track usage.

The simplest argument in favor of code-copying is that it allows you to use whatever code, tools, systems that you want to build your application. You’re not constrained in any way. So long as the application you’re building works, it doesn’t matter how clean it is.

I don’t care if my Tivo is written in Cobol or Perl so long as it works. I don’t care if they copied and hacked their Linux kernel. I don’t care if the source-code is littered with fragments of a dozen previous projects. As the end user I just care that the Tivo works.

HTML has been widely successful in large part because of the ease of code copying. Everyone has their own code.

So why reuse any software by reference? Ultimately it can be boiled down to a cost consideration. Every line of code that you are personally responsible for working is a line of code that you need to support and maintain. If you copy code you’re responsible for any issues (security, Y2K, Daylight-Savings-Time). You’re responsible for having some expertise in the code you copied.

The reason to use another piece of code by reference is to reduce your support cost and the expertise you need to have on your own team. Before you do that though, you need to be confident that your vendor has support infrastructure in place, is willing to guarantee backwards compatibility, and will treat your requirements with the appropriate priority.

The worst case scenario is to reference a piece of software which the vendor isn’t willing to guarantee backwards compatibility. In that instance, all advantage to by-reference just went out the window as you’re still stuck paying a support cost. Worse, every consumer of that software now has to pay that cost.

Done well software-by-reference can create significant economies of scale. You benefit by updates that other customers requested far before you needed that capability. Bugs can be identified once and fixed and then applied 100x or 1000x times instead of needing to be fixed independently 1000x times.

Striking the right balance is difficult. The goal is to share any code where centralized support costs are less than distributed support costs. However, in order for centralized support costs to be less, the actual bits that the centralized team supports need to be generic. Any “customized” code needs to be maintained and supported by the consumer, not by the centralized support team.

Unfortunately, these cost analyses often miss key details.

The following costs must be factored in when considering vending or consuming a by-reference service:
* the cost of maintaining backwards compatibility
* the cost of customer engagement for requirements
* the cost of deprecation of functionality on both the centralized team and the decentralized team
* the cost of training and documentation
* the cost of engaging clients in the upgrade process

The following costs must be factored in when considering vending or consuming a by-copy service:
* The cost of acquiring and continuously maintaining knowledge of that code
* The cost of identifying and fixing bugs in the copied code
* The cost of fixing externally-driven problems (security, Y2K, daylight-savings-time)

Engineers and managers looking to solve a problem with a piece of code often make the false assumption that controlling your destiny through by-copy semantics is better than taking a by-reference dependency on someone else. This implies that the customer has no confidence in the vendor. My advice is that if by-reference is the best solution based on the cost-analysis, then find a vendor you can trust. By properly leveraging internal or external vendors a small team can add substantial value on top of by-reference solutions.

Vendors often believe they can provide a better solution centrally through by-reference. Sometimes that is the case, but sometimes the customer's requirements are sufficiently "on the fringe" relative to other customers that they truly won't get the support they need.

Both models work and are appropriate in different situations. The challenge is to find the balance and eliminate institutional biases while developing a proper cost model.

Tuesday, December 18, 2007

Data-Driven vs Metric-Driven

There is a fundamental difference between having a data-driven and a metric-driven company. Data-driven companies and individuals tend to be skeptical of any new metric. Metrics mask relevant data that can lead to imprecision.

The Dow Jones Industrial Average is not a perfect indicator of economic health. It doesn’t speak to the balance of debt vs. free capital. It doesn’t speak to the value of the US dollar abroad. It doesn’t say anything about employment. It doesn’t even cover small-cap or mid-cap performance. That said, it is a reasonable metric for reporting on the overall health of the stock market.

Metric-driven companies focus on driving improvements in key metrics. Data-driven companies focus on completely understanding their business fundamentals. The question isn’t what-is-our-per-development-hour professional-service cost and can we reduce it, but what drives our per-development-hour professional-services cost? Given those drivers which can be directly influenced? What is the theoretical minimum cost? How far are we from theoretical minimum? What are other companies achieving? How much does it cost to reduce per-development-hour costs by $1/hr? At what point does it cost more than $1 to make $1?

Data-driven management is very time and labor-intensive. Managing by metrics is a fine way to drive change without understanding all the data. That said, often as you dig into the data, it becomes clear the obvious approach to driving down the metric is not always the most fruitful.

Metrics-driven management is better than gut-instinct driven, but the best decisions require a solid analysis of the data.

Failure Costs

Some of the worst decisions I’ve ever made or seen made occurred when the perceived failure costs were high. Humans will inherently overestimate their probability of success if the cost of failure is high. Even if subsequently failing has even higher costs.

If the cost of failure is higher than the acceptable threshold and there exists some human-driven solution that will mitigate that failure then the human will choose that solution independent of the probability of success or the perceived cost of additional failure.

Let’s imagine an action has failure cost C. Let’s imagine that X is the threshold at which the decider deems failure unacceptable. Let’s imagine P to be the probability of success when performing action A to avoid failure cost C. Let’s imagine P’ to be the perceived probability of success when performing action A to avoid failure cost C. Let’s imagine Z is the cost of failing on action C.

So if C > X then human will chose option A independent of P or Z. Humans typically do so by assigning P’ a higher probability of success than P and ignoring Z.

Build vs Buy

I use the following decision criteria when evaluating build vs buy: does implementing this particular piece of technology create shareholder value?

In order for a piece of technology to create shareholder value it must meet the following preconditions:

a) It must be directly in support of the company’s core mission
b) It must differentiate the company in some way from its competitors
c) It must ship and launch in time to recognize that value
d) Its long-term incremental support costs must be less than the value it creates
e) All of these statements should remain true over a 5yr time window

A simple test is “if company X acquired us in 5yrs, would they consider this technology an asset or a support burden?”

The question is not “Could we do it better?” Given enough resources and bandwidth most any engineering team could deliver something as good or better. But investing your resources in something that doesn’t create share-holder value erodes share-holder value.

The question is not “Could we do it cheaper?” Evaluating this properly is tricky, though possible. If you fundamentally believe you can do it cheaper then it does indeed create a differentiation that builds share-holder value.

Another common answer is that there is no one who can meet our unique needs. The first question to ask there is “why are our needs unique?” The more you can use standardized techniques to do standard business practices the lower your costs will be. There are times where it is worth reinventing an inefficient business from bottom-up. It takes cojones to say everyone else who pioneered this path is wrong. Institutionalized inefficiencies do exist though. The question then is – are you changing an industry by adopting techniques from another industry? If so, then the opportunity to buying instead of build appropriate pieces increases.

If there truly isn’t a good solution in the marketplace then it may make sense to invest. You aren’t limited to choosing to invest your own people talent. If there is an identified need in the market, there is quite probably someone trying to build a solution – some individual or startup or new product initiative at an existing company. Help them succeed. Invest in them. Champion their product. Become a reference customer.

The next answer is often “Vendors don’t give us the support we need.” Well, that can be true. Managing vendors properly is a skill. Most companies have to prioritize. If you’re not high on your vendor’s priority list, then you’re not going to get the attention. There are ways to combat this though. One is to squeeze, but not screw over your vendor in contract negotiations so they still feel incentive to make you happy. You can be one of the higher-margin customers if not a leading revenue customer, especially if you require very little support and hand-holding. Another technique is to maintain a very close working relationship with your vendor: regular contact and calls, demonstrating technical competency with the product and a sincere interest in their success.

Vendors can also take their products in different directions that no longer meet your needs. This puts your business at risk. There are two primary ways to combat this: the first is to join standards bodies working on standardizing the interfaces that you rely upon with that vendor. The second and related strategy is to define your own abstraction so that you aren’t dependent upon that vendor’s unique API capabilities. These are both attempts to drive your vendor into a position as a commodity player in the market. This is good for your business; you need to manage the risk that technology dependencies create.

The strategy I outline here encourages you to build-as-little-as-you-can-possibly-get-away-with. Everything you build should add substantial shareholder value. Monetizing that value is a different part of the equation. As engineers often the best way and primary way we contribute to the company is by building lasting shareholder value. We do that best by pushing the technology envelope for the business.

Accurate Estimation

Accurate estimation should be a required college course for everyone. Estimates are a daily part of life at any corporation. At the highest levels, managing a company is an exercise in anticipating the best avenues for growth and applying the appropriate resources to that problem. Doing that properly requires estimates. The more precise the estimates are, the better the decisions will be.

So how do you properly estimate? The first thing to do is recognize estimates will always have an error margin. The challenge is not just to get a first-order estimate, but to understand what drives the error.

In my experience, there are three primary ways to derive a first-order estimate. The first is to do a bottoms-up analysis of all available data and make predictions on each piece. The second is to build a model based on key variables which you can measure and get prior data. The third is to use gut instinct.

Very often the first two avenues prove either so much effort or so imprecise that gut instinct is applied instead.

Another form of “gut instinct” that has proven to have some value is in prediction markets where the average of a large sample size of individual votes is used to estimate. This has the advantage of potentially illuminating bimodal distributions that suggest there is a key binary driver in the prediction model.

In my experience, the most effective estimates are derived from combining the first three. One person builds a model or runs data through an existing prediction model. Another person builds a bottoms-up analysis of the prediction. A third person has their own gut instinct and does a simple validation of that with other experienced individuals.

The rationalization process is the key to success in this approach. If one of the three methods is out-of-whack with the others (or worse, none are even close) the next step is to rationalize the estimates. If the model is predicting lower, why did the bottoms-up predict higher, or why did gut instinct predict higher? If you can resolve the discrepancies you often illuminate the key drivers that are unique to the particular value you’re predicting. That learning can then be built back into the model and build into the bottoms-up-analysis steps in the future.

Now the challenge is to model the potential errors. Potential errors are stated as risks and assumptions. All predictions make certain assumptions and all have certain risks. If you can quantify the potential impact of a missed assumption or a risk and assign a likelihood value, you can build a confidence distribution with a Monte Carlo simulation. This is especially helpful if the models, analyses, or gut instinct are biased toward sunny-day scenarios.

The final, and often overlooked, step in the estimates process is to track your predictions vs your actual results. Measuring a large number of predictions and results allows you to model what your known error rate is and ascertain whether your processes are really improving.

If you can improve your estimates, you will execute more cleanly, deliver more consistently, operate more efficiently, and invest more wisely.

Engineering Manager’s Bill of Rights

As engineering managers you are a builder. You are first-and-foremost measured on the success of our team’s designs, implementation quality, and predictability of delivery in meeting the business objectives.

In order to do your job effectively and optimally, you and your management must defend a certain set of inalienable rights. To wit:

1. The right to your own commitments, estimates, and delivery dates
2. The right to the committed resources and budget
3. The right to define and design the solution to the requirements
4. The right to employ your own internal processes as necessary to deliver on the commitments
5. The right and responsibility to state your minimum quality bar for delivery
6. The right to reorganize
7. The right to transition staff from one task or role to another as appropriate to deliver the project
8. The right and responsibility to state your assumptions and risks
9. The right to restate commitments if priorities, requirements, resources or budget change due to external factors outside of your scope of control
10. The right to push the “stop” button in any launch

Without these rights, you have lost some of the flexibility you may need in order to deliver. You are no longer fully empowered. If you can't deploy your people as you see fit, but have to negotiate with every responsibility change, your effectiveness is reduced. All of these are part of "committing". If you "commit", but aren't given the tools to deliver on those commitments.

That said, there are certain rights that managers sometimes push for that are inappropriate. Most importantly, you do not have the right to privacy, isolation, or complete independence of design. Management or the customer may require review and you are required to respond.

You do not have the authority to ignore requirements. You do not have the authority to ignore changes in requirements or scope or budget levels. You do not have complete autonomy in your hiring or firing process. You do not have the authority to say a product should launch now, only that it is not ready yet.

In some cases, I've seen managers work diligently to defend their own rights while working equally hard to eliminate the rights of peer managers. "They can't take person X off this project, because we need A from that person." The push-back is A is the commitment. So long as the peer manager is still committing to A, then it is there prerogative whether to use person X or Y.

That said, in most companies everything is a little less formal. Waving a bill of rights around isn't a very persuasive way to get anything done. But the bill of rights gives you something to reflect upon from time to time. If commitments aren't being treated as sacrosanct, assumptions and risks are being ignored, timelines are being artificially shortened, or friction is increasing when you want to make staff transitions then it is time to evaluate whether you need to start taking a firmer position.

Friday, December 7, 2007

Stages of Operational Maturity

I’ve had the opportunity, directly or indirectly, to observe the evolution of a handful of companies from technology-startup to major service provider. In my experience operational maturity evolves in 12 phases as companies mature.

1 – You have a handful of machines hacked together in some corner. One or two people, typically the engineers, support the whole thing. You don’t have vendor contracts in place; you probably bought these from Fry’s or CompUSA. Your production environment is probably shared with your corporate environment. Your network connection fails occasionally. Dev, Test, and Production builds are three different directories on disk. You hand-created all the configuration. There are probably quite literally dust and cobwebs on your hosts.

2 – You probably have 5-10 hosts now. You decided to start making manual backups when you think of it. You created a single host for the IT-related stuff that the company needs (email, intranet). You created a simple script to copy dev to test to production instances. You started to use the term scalability even though you’re not quite sure what it means. You may have started to look for a datacenter. You copy the config files and yell at anyone who touches them. You started writing a couple scripts to page your cell phone when something breaks. You’re less willing to let people touch things.

3 – You got a datacenter space. You’re probably still at 20-30 hosts as managing anymore seems like a nightmare. People are constantly frustrated that the hosts don’t match up. Moving dev to test to production is a painful task that often breaks. You have a variety of hardware. You started looking a new hardware types that will serve you better. Your phone has become a pager or a Blackberry. You’re up a lot dealing with the system.

4 – You have enough hosts that you consider yourself “live” and running a “real system”. You’ve built some tools to automatically check whether hosts are in sync. You have a standard OS image that you use to build all hosts. You’ve begun to put some simple standard monitors in place. There is now an official operations function and perhaps even a separate IT function. Releases go out, typically in a big batch with lots of changes. You have a real network provider. You’re starting to consider whether you need multiple datacenters.

5 – You’ve invested in some more improved tool automation. You may have a network-bootable OS image. Terms like N+1 redundancy, disaster recovery, fault isolation, and security begin to enter the lexicon. Emails go back and forth about standardizing monitoring. You want more QA than you have. System administrators are highly skilled, specialized individuals with top-to-bottom knowledge of the system and are basking in the glow of a well running system that they can jump in and fix quickly. You’re pretty happy with where you’re at; looks like you could run this way for a while. You start to think about better cage design and start reading up on the latest deployment tools and configuration systems. You’re trying starting to re-evaluate your OS decision and when you’re going to upgrade. You’ve begun automating process restarts and other standard processes to deal with memory leaks and other transient problems. Life is good.

6 – The development team is scaling up as is the network. You’re now very definitely multi-datacenter as management knows that an earthquake would kill your business, plus investors and customers expect it. Things are starting to break. The monolithic deployment processes are creating a bottleneck for the organization. Releases are breaking. The number of steps it takes to perform an upgrade is killing your sysadmins or release engineers. Release management is becoming centralized and there is a push for tighter release management controls and QA. Problems are becoming more difficult to diagnose quickly. You have your first major network failure which hoses everything. You realize network configuration is a mess and you need a stronger network engineering team. Your datacenter provider who said they had plenty of space now tells you that either they no longer have space, or they’re considering closing the facility. Life sucks. You’re trying to control and contain the growth and change to get you back to stage 5.

7 – You give up on controlling and containing the growth. You realize the deployment management is a nightmare and make a few aborted attempts to improve it. Eventually you manage to strike on a solution that enables rapid roll-back, stores all host provisioning data and software packages centrally, and does network-booting and OS image installation easily. You work to transition ownership of development and staging environments to the development and QA teams. You attempt to decentralize ownership of releases and arrive at some truce between decentralization progress and centralized control. You tighten down monitoring.

8 – You’re managing the growth effectively. You’ve now seen enough issues that system administration is an exciting, creative pursuit that produces heroes on a consistent basis. System administrators aren’t touching configuration on any machine anymore. You can order the hardware you need. You’ve begun to automate some of the standard network administration operations. You have good subject-matter-experts in many of the critical domains. Monitors are well-fleshed-out. You’re starting to talk about your availability in “nines” – 2 nines, 3 nines. You aspire to 5 nines and tell everyone you’re going there. You have program management that is now supporting operations. You begin to talk about driving availability as a program. You have a standard failure-response model in place. You think you may be able to get out of reactive mode, but still find it is hard to put together documents or otherwise say where you’re going and what you’re requirements are.

9 – You’re now driving the business to meet your baseline SLA needs. If that’s 3 nines then you’re battening down processes, calling war team reviews, putting basic projections in place and trying to hold owners accountable. If that’s 4 or 5 nines, then you’re locking down change management processes, limiting the number of changes, growing QA, putting priority-based load-shedding in place, rate-capping transactions, and spending more on networking equipment designed to give you higher availability.

10 – Everyone realizes you now have a ‘real’ network. They begin to audit you. You deal with security audits, monitoring audits, asset management audits, ownership audits, escalation audits, and financial audits. Management is no longer worried that operations will fail and sink the company because you’ve met their comfort-zone availability number. You may be at 3 nines and management still says they want 5 nines, but in reality they’re comfortable with the current availability. Your response plans are working. Management begins to worry about new metrics -- new features, cost of operations, speed of roll-out and deployment. Efficiency or innovation-cycle become the new buzz-word. Finance begins to interrogate your purchasing decisions. Engineering becomes frustrated that you can’t support their hardware requirements as quickly as you used to be. You audit whether you have too much redundancy. You build a 3rd datacenter so you can go from 50% utilization of resources to 66% and maintain N+1 redundancy. Your program management is absorbed focusing on costs, optimizing delivery, and basic build-out work. You’re still not able to document where you’re going and what you’re requirements for development teams should be. Thank goodness availability is no longer a worry so you can focus on everything else.

11 – You drive down efficiency and begin to eliminate redundancy. You take away some of the bottlenecks to deployments that you conclude were giving a false-sense-of-security even though they were catching some things. You push hard on new projects. You're beginning to make a dent in speed, features, and cost. You try to go to just-in-time hardware purchases. You pass-the-pain to the engineering teams to model and drive down cost. You begin to walk that line between availability and efficiency. You are mostly fixing issues that came out of audits rather than the issues you know you still have. Buffers are gone, fixes aren’t happening, training programs aren’t sufficient to get new folks fully up-to-speed. Then comes the big crash. Risk increased, there was no way to deliver on all the new priorities without taking some risk. Now the CEO is now frustrated about availability again. He or she was enjoying not thinking about it anymore. The CFO is frustrated that in his or her calculations you should have reduced costs faster. Operations moral is in the tanks as no one is the hero anymore. Operations Management underestimated how much they could get done and is frustrated. Engineering is frustrated as obvious stuff isn’t happening correctly. Driving engineering teams is hard now as you have limited political capital. Life sucks again. You’re being driven and can barely find the time to drive. Key Operations Managers and Executives leave.

12 – You start to get back in the driver seat. Often new management comes in. They’re given a reprieve and grace period from some of the goals and pressures that were on the previous management. Availability is first-and-foremost again for a time. You still have to deal with the CFO, but priorities are clearer. You begin to develop more robust process discipline. Holes in the previous toolset are fixed as new leadership applies their patterns to the system. Routing control is improved; separation-of-concerns is improved to facilitate engineering; different services are allowed to have different SLAs and support models and processes adapt to support that; you finally write down the operational readiness requirements; tools are improved; the networking issues and complexities in switches, load-balancers, etc get some much needed attention from a new set of eyes. You begin to put component-level SLAs and cost models in place and audit for those. You generalize processes across the board and begin to apply those more uniformly. Your disaster planning is institutionalized. Audits are a breeze. You begin to audit your vendors and apply your processes to them. You’re driving again.

What happens next? I’ll tell you when I figure it out.