Principles of Large Scale Systems: December 2007

Saturday, December 22, 2007

Creating Value Vs. Making Money

I distinctly separate the notions of creating value and making money. Generally you need to provide some value to the customer if you’re going to make money, but you can build substantial value without making any money.

Let me provide a few examples of endeavors that created incredible value but made relatively little money: the Interstate Freeway system, the Internet, Netscape Communications Corporation, NASA, and Xerox Parc in the 70’s. There are plenty of initiatives that make gobs of money, but don’t create any lasting value: bottled water, handbag manufacturing, municipal services, dry cleaners, and Google ads.

As engineers, we are primarily engaged in the creation of value. Making money is often a secondary but necessary evil.

The question then becomes – how do you measure the value you’re creating independently of the money you’re making? One way is to simply ask yourself how much would the engineered assets be worth sold as a company? You can also quantify the value of the Intellectual Property as measured in patents, trademarks, and copyrights. You can also look at the net sum of money that everyone using that valuable engineered good is making.

Part of the point of this is that the engineering innovation isn’t always directly monetized. Google could have developed its search engine technology and sold it as packaged software, or attempted to charge people for every click. Instead, Google stumbled on text-based advertising as it looked to monetize the value it created through traffic. The core search algorithm helped acquire customers and page views that lead to monetization through advertising.

Value creation and money making do not necessarily go hand-in-hand. To have a successful business you need to manage both. Think of value as potential-energy and money-making as kinetic energy. If you aren’t increasing the potential energy of your business, you’re growth options are limited. If you aren’t turning potential energy into kinetic energy then you’re failing to convert your value into cash flow. Those companies who create value but don’t make money are generally bought for multiples far less than the full potential. Those companies who generate money, but no additional value tend to stop growing. The best companies create an engine that generates more value and more money hand-in-hand.

Reuse by Copy vs Reuse by Reference

Copying code is a much maligned strategy for reuse. Conventional wisdom seems to hold that any reusable bits should be abstracted with a clean interface and the consumer of that code should use those interfaces directly and not change the underlying structure. Complexity must be abstracted.

Everyone has some experience where they forked a code-base and ended up with painful merge conflicts and spent a huge amount of energy converging.

However, copying code has significant advantages as a reuse strategy. Convergence is rarely as technically painful as it is politically painful. Often too many interfaces are exposed by reference leading to unnecessary complexity. Too often those interfaces are exposed without proper infrastructure to guarantee backwards compatibility, promote adoption, or track usage.

The simplest argument in favor of code-copying is that it allows you to use whatever code, tools, systems that you want to build your application. You’re not constrained in any way. So long as the application you’re building works, it doesn’t matter how clean it is.

I don’t care if my Tivo is written in Cobol or Perl so long as it works. I don’t care if they copied and hacked their Linux kernel. I don’t care if the source-code is littered with fragments of a dozen previous projects. As the end user I just care that the Tivo works.

HTML has been widely successful in large part because of the ease of code copying. Everyone has their own code.

So why reuse any software by reference? Ultimately it can be boiled down to a cost consideration. Every line of code that you are personally responsible for working is a line of code that you need to support and maintain. If you copy code you’re responsible for any issues (security, Y2K, Daylight-Savings-Time). You’re responsible for having some expertise in the code you copied.

The reason to use another piece of code by reference is to reduce your support cost and the expertise you need to have on your own team. Before you do that though, you need to be confident that your vendor has support infrastructure in place, is willing to guarantee backwards compatibility, and will treat your requirements with the appropriate priority.

The worst case scenario is to reference a piece of software which the vendor isn’t willing to guarantee backwards compatibility. In that instance, all advantage to by-reference just went out the window as you’re still stuck paying a support cost. Worse, every consumer of that software now has to pay that cost.

Done well software-by-reference can create significant economies of scale. You benefit by updates that other customers requested far before you needed that capability. Bugs can be identified once and fixed and then applied 100x or 1000x times instead of needing to be fixed independently 1000x times.

Striking the right balance is difficult. The goal is to share any code where centralized support costs are less than distributed support costs. However, in order for centralized support costs to be less, the actual bits that the centralized team supports need to be generic. Any “customized” code needs to be maintained and supported by the consumer, not by the centralized support team.

Unfortunately, these cost analyses often miss key details.

The following costs must be factored in when considering vending or consuming a by-reference service:
* the cost of maintaining backwards compatibility
* the cost of customer engagement for requirements
* the cost of deprecation of functionality on both the centralized team and the decentralized team
* the cost of training and documentation
* the cost of engaging clients in the upgrade process

The following costs must be factored in when considering vending or consuming a by-copy service:
* The cost of acquiring and continuously maintaining knowledge of that code
* The cost of identifying and fixing bugs in the copied code
* The cost of fixing externally-driven problems (security, Y2K, daylight-savings-time)

Engineers and managers looking to solve a problem with a piece of code often make the false assumption that controlling your destiny through by-copy semantics is better than taking a by-reference dependency on someone else. This implies that the customer has no confidence in the vendor. My advice is that if by-reference is the best solution based on the cost-analysis, then find a vendor you can trust. By properly leveraging internal or external vendors a small team can add substantial value on top of by-reference solutions.

Vendors often believe they can provide a better solution centrally through by-reference. Sometimes that is the case, but sometimes the customer's requirements are sufficiently "on the fringe" relative to other customers that they truly won't get the support they need.

Both models work and are appropriate in different situations. The challenge is to find the balance and eliminate institutional biases while developing a proper cost model.

Tuesday, December 18, 2007

Data-Driven vs Metric-Driven

There is a fundamental difference between having a data-driven and a metric-driven company. Data-driven companies and individuals tend to be skeptical of any new metric. Metrics mask relevant data that can lead to imprecision.

The Dow Jones Industrial Average is not a perfect indicator of economic health. It doesn’t speak to the balance of debt vs. free capital. It doesn’t speak to the value of the US dollar abroad. It doesn’t say anything about employment. It doesn’t even cover small-cap or mid-cap performance. That said, it is a reasonable metric for reporting on the overall health of the stock market.

Metric-driven companies focus on driving improvements in key metrics. Data-driven companies focus on completely understanding their business fundamentals. The question isn’t what-is-our-per-development-hour professional-service cost and can we reduce it, but what drives our per-development-hour professional-services cost? Given those drivers which can be directly influenced? What is the theoretical minimum cost? How far are we from theoretical minimum? What are other companies achieving? How much does it cost to reduce per-development-hour costs by $1/hr? At what point does it cost more than $1 to make $1?

Data-driven management is very time and labor-intensive. Managing by metrics is a fine way to drive change without understanding all the data. That said, often as you dig into the data, it becomes clear the obvious approach to driving down the metric is not always the most fruitful.

Metrics-driven management is better than gut-instinct driven, but the best decisions require a solid analysis of the data.

Failure Costs

Some of the worst decisions I’ve ever made or seen made occurred when the perceived failure costs were high. Humans will inherently overestimate their probability of success if the cost of failure is high. Even if subsequently failing has even higher costs.

If the cost of failure is higher than the acceptable threshold and there exists some human-driven solution that will mitigate that failure then the human will choose that solution independent of the probability of success or the perceived cost of additional failure.

Let’s imagine an action has failure cost C. Let’s imagine that X is the threshold at which the decider deems failure unacceptable. Let’s imagine P to be the probability of success when performing action A to avoid failure cost C. Let’s imagine P’ to be the perceived probability of success when performing action A to avoid failure cost C. Let’s imagine Z is the cost of failing on action C.

So if C > X then human will chose option A independent of P or Z. Humans typically do so by assigning P’ a higher probability of success than P and ignoring Z.

Build vs Buy

I use the following decision criteria when evaluating build vs buy: does implementing this particular piece of technology create shareholder value?

In order for a piece of technology to create shareholder value it must meet the following preconditions:

a) It must be directly in support of the company’s core mission
b) It must differentiate the company in some way from its competitors
c) It must ship and launch in time to recognize that value
d) Its long-term incremental support costs must be less than the value it creates
e) All of these statements should remain true over a 5yr time window

A simple test is “if company X acquired us in 5yrs, would they consider this technology an asset or a support burden?”

The question is not “Could we do it better?” Given enough resources and bandwidth most any engineering team could deliver something as good or better. But investing your resources in something that doesn’t create share-holder value erodes share-holder value.

The question is not “Could we do it cheaper?” Evaluating this properly is tricky, though possible. If you fundamentally believe you can do it cheaper then it does indeed create a differentiation that builds share-holder value.

Another common answer is that there is no one who can meet our unique needs. The first question to ask there is “why are our needs unique?” The more you can use standardized techniques to do standard business practices the lower your costs will be. There are times where it is worth reinventing an inefficient business from bottom-up. It takes cojones to say everyone else who pioneered this path is wrong. Institutionalized inefficiencies do exist though. The question then is – are you changing an industry by adopting techniques from another industry? If so, then the opportunity to buying instead of build appropriate pieces increases.

If there truly isn’t a good solution in the marketplace then it may make sense to invest. You aren’t limited to choosing to invest your own people talent. If there is an identified need in the market, there is quite probably someone trying to build a solution – some individual or startup or new product initiative at an existing company. Help them succeed. Invest in them. Champion their product. Become a reference customer.

The next answer is often “Vendors don’t give us the support we need.” Well, that can be true. Managing vendors properly is a skill. Most companies have to prioritize. If you’re not high on your vendor’s priority list, then you’re not going to get the attention. There are ways to combat this though. One is to squeeze, but not screw over your vendor in contract negotiations so they still feel incentive to make you happy. You can be one of the higher-margin customers if not a leading revenue customer, especially if you require very little support and hand-holding. Another technique is to maintain a very close working relationship with your vendor: regular contact and calls, demonstrating technical competency with the product and a sincere interest in their success.

Vendors can also take their products in different directions that no longer meet your needs. This puts your business at risk. There are two primary ways to combat this: the first is to join standards bodies working on standardizing the interfaces that you rely upon with that vendor. The second and related strategy is to define your own abstraction so that you aren’t dependent upon that vendor’s unique API capabilities. These are both attempts to drive your vendor into a position as a commodity player in the market. This is good for your business; you need to manage the risk that technology dependencies create.

The strategy I outline here encourages you to build-as-little-as-you-can-possibly-get-away-with. Everything you build should add substantial shareholder value. Monetizing that value is a different part of the equation. As engineers often the best way and primary way we contribute to the company is by building lasting shareholder value. We do that best by pushing the technology envelope for the business.

Accurate Estimation

Accurate estimation should be a required college course for everyone. Estimates are a daily part of life at any corporation. At the highest levels, managing a company is an exercise in anticipating the best avenues for growth and applying the appropriate resources to that problem. Doing that properly requires estimates. The more precise the estimates are, the better the decisions will be.

So how do you properly estimate? The first thing to do is recognize estimates will always have an error margin. The challenge is not just to get a first-order estimate, but to understand what drives the error.

In my experience, there are three primary ways to derive a first-order estimate. The first is to do a bottoms-up analysis of all available data and make predictions on each piece. The second is to build a model based on key variables which you can measure and get prior data. The third is to use gut instinct.

Very often the first two avenues prove either so much effort or so imprecise that gut instinct is applied instead.

Another form of “gut instinct” that has proven to have some value is in prediction markets where the average of a large sample size of individual votes is used to estimate. This has the advantage of potentially illuminating bimodal distributions that suggest there is a key binary driver in the prediction model.

In my experience, the most effective estimates are derived from combining the first three. One person builds a model or runs data through an existing prediction model. Another person builds a bottoms-up analysis of the prediction. A third person has their own gut instinct and does a simple validation of that with other experienced individuals.

The rationalization process is the key to success in this approach. If one of the three methods is out-of-whack with the others (or worse, none are even close) the next step is to rationalize the estimates. If the model is predicting lower, why did the bottoms-up predict higher, or why did gut instinct predict higher? If you can resolve the discrepancies you often illuminate the key drivers that are unique to the particular value you’re predicting. That learning can then be built back into the model and build into the bottoms-up-analysis steps in the future.

Now the challenge is to model the potential errors. Potential errors are stated as risks and assumptions. All predictions make certain assumptions and all have certain risks. If you can quantify the potential impact of a missed assumption or a risk and assign a likelihood value, you can build a confidence distribution with a Monte Carlo simulation. This is especially helpful if the models, analyses, or gut instinct are biased toward sunny-day scenarios.

The final, and often overlooked, step in the estimates process is to track your predictions vs your actual results. Measuring a large number of predictions and results allows you to model what your known error rate is and ascertain whether your processes are really improving.

If you can improve your estimates, you will execute more cleanly, deliver more consistently, operate more efficiently, and invest more wisely.

Engineering Manager’s Bill of Rights

As engineering managers you are a builder. You are first-and-foremost measured on the success of our team’s designs, implementation quality, and predictability of delivery in meeting the business objectives.

In order to do your job effectively and optimally, you and your management must defend a certain set of inalienable rights. To wit:

1. The right to your own commitments, estimates, and delivery dates
2. The right to the committed resources and budget
3. The right to define and design the solution to the requirements
4. The right to employ your own internal processes as necessary to deliver on the commitments
5. The right and responsibility to state your minimum quality bar for delivery
6. The right to reorganize
7. The right to transition staff from one task or role to another as appropriate to deliver the project
8. The right and responsibility to state your assumptions and risks
9. The right to restate commitments if priorities, requirements, resources or budget change due to external factors outside of your scope of control
10. The right to push the “stop” button in any launch

Without these rights, you have lost some of the flexibility you may need in order to deliver. You are no longer fully empowered. If you can't deploy your people as you see fit, but have to negotiate with every responsibility change, your effectiveness is reduced. All of these are part of "committing". If you "commit", but aren't given the tools to deliver on those commitments.

That said, there are certain rights that managers sometimes push for that are inappropriate. Most importantly, you do not have the right to privacy, isolation, or complete independence of design. Management or the customer may require review and you are required to respond.

You do not have the authority to ignore requirements. You do not have the authority to ignore changes in requirements or scope or budget levels. You do not have complete autonomy in your hiring or firing process. You do not have the authority to say a product should launch now, only that it is not ready yet.

In some cases, I've seen managers work diligently to defend their own rights while working equally hard to eliminate the rights of peer managers. "They can't take person X off this project, because we need A from that person." The push-back is A is the commitment. So long as the peer manager is still committing to A, then it is there prerogative whether to use person X or Y.

That said, in most companies everything is a little less formal. Waving a bill of rights around isn't a very persuasive way to get anything done. But the bill of rights gives you something to reflect upon from time to time. If commitments aren't being treated as sacrosanct, assumptions and risks are being ignored, timelines are being artificially shortened, or friction is increasing when you want to make staff transitions then it is time to evaluate whether you need to start taking a firmer position.

Friday, December 7, 2007

Stages of Operational Maturity

I’ve had the opportunity, directly or indirectly, to observe the evolution of a handful of companies from technology-startup to major service provider. In my experience operational maturity evolves in 12 phases as companies mature.

1 – You have a handful of machines hacked together in some corner. One or two people, typically the engineers, support the whole thing. You don’t have vendor contracts in place; you probably bought these from Fry’s or CompUSA. Your production environment is probably shared with your corporate environment. Your network connection fails occasionally. Dev, Test, and Production builds are three different directories on disk. You hand-created all the configuration. There are probably quite literally dust and cobwebs on your hosts.

2 – You probably have 5-10 hosts now. You decided to start making manual backups when you think of it. You created a single host for the IT-related stuff that the company needs (email, intranet). You created a simple script to copy dev to test to production instances. You started to use the term scalability even though you’re not quite sure what it means. You may have started to look for a datacenter. You copy the config files and yell at anyone who touches them. You started writing a couple scripts to page your cell phone when something breaks. You’re less willing to let people touch things.

3 – You got a datacenter space. You’re probably still at 20-30 hosts as managing anymore seems like a nightmare. People are constantly frustrated that the hosts don’t match up. Moving dev to test to production is a painful task that often breaks. You have a variety of hardware. You started looking a new hardware types that will serve you better. Your phone has become a pager or a Blackberry. You’re up a lot dealing with the system.

4 – You have enough hosts that you consider yourself “live” and running a “real system”. You’ve built some tools to automatically check whether hosts are in sync. You have a standard OS image that you use to build all hosts. You’ve begun to put some simple standard monitors in place. There is now an official operations function and perhaps even a separate IT function. Releases go out, typically in a big batch with lots of changes. You have a real network provider. You’re starting to consider whether you need multiple datacenters.

5 – You’ve invested in some more improved tool automation. You may have a network-bootable OS image. Terms like N+1 redundancy, disaster recovery, fault isolation, and security begin to enter the lexicon. Emails go back and forth about standardizing monitoring. You want more QA than you have. System administrators are highly skilled, specialized individuals with top-to-bottom knowledge of the system and are basking in the glow of a well running system that they can jump in and fix quickly. You’re pretty happy with where you’re at; looks like you could run this way for a while. You start to think about better cage design and start reading up on the latest deployment tools and configuration systems. You’re trying starting to re-evaluate your OS decision and when you’re going to upgrade. You’ve begun automating process restarts and other standard processes to deal with memory leaks and other transient problems. Life is good.

6 – The development team is scaling up as is the network. You’re now very definitely multi-datacenter as management knows that an earthquake would kill your business, plus investors and customers expect it. Things are starting to break. The monolithic deployment processes are creating a bottleneck for the organization. Releases are breaking. The number of steps it takes to perform an upgrade is killing your sysadmins or release engineers. Release management is becoming centralized and there is a push for tighter release management controls and QA. Problems are becoming more difficult to diagnose quickly. You have your first major network failure which hoses everything. You realize network configuration is a mess and you need a stronger network engineering team. Your datacenter provider who said they had plenty of space now tells you that either they no longer have space, or they’re considering closing the facility. Life sucks. You’re trying to control and contain the growth and change to get you back to stage 5.

7 – You give up on controlling and containing the growth. You realize the deployment management is a nightmare and make a few aborted attempts to improve it. Eventually you manage to strike on a solution that enables rapid roll-back, stores all host provisioning data and software packages centrally, and does network-booting and OS image installation easily. You work to transition ownership of development and staging environments to the development and QA teams. You attempt to decentralize ownership of releases and arrive at some truce between decentralization progress and centralized control. You tighten down monitoring.

8 – You’re managing the growth effectively. You’ve now seen enough issues that system administration is an exciting, creative pursuit that produces heroes on a consistent basis. System administrators aren’t touching configuration on any machine anymore. You can order the hardware you need. You’ve begun to automate some of the standard network administration operations. You have good subject-matter-experts in many of the critical domains. Monitors are well-fleshed-out. You’re starting to talk about your availability in “nines” – 2 nines, 3 nines. You aspire to 5 nines and tell everyone you’re going there. You have program management that is now supporting operations. You begin to talk about driving availability as a program. You have a standard failure-response model in place. You think you may be able to get out of reactive mode, but still find it is hard to put together documents or otherwise say where you’re going and what you’re requirements are.

9 – You’re now driving the business to meet your baseline SLA needs. If that’s 3 nines then you’re battening down processes, calling war team reviews, putting basic projections in place and trying to hold owners accountable. If that’s 4 or 5 nines, then you’re locking down change management processes, limiting the number of changes, growing QA, putting priority-based load-shedding in place, rate-capping transactions, and spending more on networking equipment designed to give you higher availability.

10 – Everyone realizes you now have a ‘real’ network. They begin to audit you. You deal with security audits, monitoring audits, asset management audits, ownership audits, escalation audits, and financial audits. Management is no longer worried that operations will fail and sink the company because you’ve met their comfort-zone availability number. You may be at 3 nines and management still says they want 5 nines, but in reality they’re comfortable with the current availability. Your response plans are working. Management begins to worry about new metrics -- new features, cost of operations, speed of roll-out and deployment. Efficiency or innovation-cycle become the new buzz-word. Finance begins to interrogate your purchasing decisions. Engineering becomes frustrated that you can’t support their hardware requirements as quickly as you used to be. You audit whether you have too much redundancy. You build a 3rd datacenter so you can go from 50% utilization of resources to 66% and maintain N+1 redundancy. Your program management is absorbed focusing on costs, optimizing delivery, and basic build-out work. You’re still not able to document where you’re going and what you’re requirements for development teams should be. Thank goodness availability is no longer a worry so you can focus on everything else.

11 – You drive down efficiency and begin to eliminate redundancy. You take away some of the bottlenecks to deployments that you conclude were giving a false-sense-of-security even though they were catching some things. You push hard on new projects. You're beginning to make a dent in speed, features, and cost. You try to go to just-in-time hardware purchases. You pass-the-pain to the engineering teams to model and drive down cost. You begin to walk that line between availability and efficiency. You are mostly fixing issues that came out of audits rather than the issues you know you still have. Buffers are gone, fixes aren’t happening, training programs aren’t sufficient to get new folks fully up-to-speed. Then comes the big crash. Risk increased, there was no way to deliver on all the new priorities without taking some risk. Now the CEO is now frustrated about availability again. He or she was enjoying not thinking about it anymore. The CFO is frustrated that in his or her calculations you should have reduced costs faster. Operations moral is in the tanks as no one is the hero anymore. Operations Management underestimated how much they could get done and is frustrated. Engineering is frustrated as obvious stuff isn’t happening correctly. Driving engineering teams is hard now as you have limited political capital. Life sucks again. You’re being driven and can barely find the time to drive. Key Operations Managers and Executives leave.

12 – You start to get back in the driver seat. Often new management comes in. They’re given a reprieve and grace period from some of the goals and pressures that were on the previous management. Availability is first-and-foremost again for a time. You still have to deal with the CFO, but priorities are clearer. You begin to develop more robust process discipline. Holes in the previous toolset are fixed as new leadership applies their patterns to the system. Routing control is improved; separation-of-concerns is improved to facilitate engineering; different services are allowed to have different SLAs and support models and processes adapt to support that; you finally write down the operational readiness requirements; tools are improved; the networking issues and complexities in switches, load-balancers, etc get some much needed attention from a new set of eyes. You begin to put component-level SLAs and cost models in place and audit for those. You generalize processes across the board and begin to apply those more uniformly. Your disaster planning is institutionalized. Audits are a breeze. You begin to audit your vendors and apply your processes to them. You’re driving again.

What happens next? I’ll tell you when I figure it out.

Saturday, December 22, 2007

Tuesday, December 18, 2007

Friday, December 7, 2007

Blog Archive