Principles of Large Scale Systems: Stages of Operational Maturity

I’ve had the opportunity, directly or indirectly, to observe the evolution of a handful of companies from technology-startup to major service provider. In my experience operational maturity evolves in 12 phases as companies mature.

1 – You have a handful of machines hacked together in some corner. One or two people, typically the engineers, support the whole thing. You don’t have vendor contracts in place; you probably bought these from Fry’s or CompUSA. Your production environment is probably shared with your corporate environment. Your network connection fails occasionally. Dev, Test, and Production builds are three different directories on disk. You hand-created all the configuration. There are probably quite literally dust and cobwebs on your hosts.

2 – You probably have 5-10 hosts now. You decided to start making manual backups when you think of it. You created a single host for the IT-related stuff that the company needs (email, intranet). You created a simple script to copy dev to test to production instances. You started to use the term scalability even though you’re not quite sure what it means. You may have started to look for a datacenter. You copy the config files and yell at anyone who touches them. You started writing a couple scripts to page your cell phone when something breaks. You’re less willing to let people touch things.

3 – You got a datacenter space. You’re probably still at 20-30 hosts as managing anymore seems like a nightmare. People are constantly frustrated that the hosts don’t match up. Moving dev to test to production is a painful task that often breaks. You have a variety of hardware. You started looking a new hardware types that will serve you better. Your phone has become a pager or a Blackberry. You’re up a lot dealing with the system.

4 – You have enough hosts that you consider yourself “live” and running a “real system”. You’ve built some tools to automatically check whether hosts are in sync. You have a standard OS image that you use to build all hosts. You’ve begun to put some simple standard monitors in place. There is now an official operations function and perhaps even a separate IT function. Releases go out, typically in a big batch with lots of changes. You have a real network provider. You’re starting to consider whether you need multiple datacenters.

5 – You’ve invested in some more improved tool automation. You may have a network-bootable OS image. Terms like N+1 redundancy, disaster recovery, fault isolation, and security begin to enter the lexicon. Emails go back and forth about standardizing monitoring. You want more QA than you have. System administrators are highly skilled, specialized individuals with top-to-bottom knowledge of the system and are basking in the glow of a well running system that they can jump in and fix quickly. You’re pretty happy with where you’re at; looks like you could run this way for a while. You start to think about better cage design and start reading up on the latest deployment tools and configuration systems. You’re trying starting to re-evaluate your OS decision and when you’re going to upgrade. You’ve begun automating process restarts and other standard processes to deal with memory leaks and other transient problems. Life is good.

6 – The development team is scaling up as is the network. You’re now very definitely multi-datacenter as management knows that an earthquake would kill your business, plus investors and customers expect it. Things are starting to break. The monolithic deployment processes are creating a bottleneck for the organization. Releases are breaking. The number of steps it takes to perform an upgrade is killing your sysadmins or release engineers. Release management is becoming centralized and there is a push for tighter release management controls and QA. Problems are becoming more difficult to diagnose quickly. You have your first major network failure which hoses everything. You realize network configuration is a mess and you need a stronger network engineering team. Your datacenter provider who said they had plenty of space now tells you that either they no longer have space, or they’re considering closing the facility. Life sucks. You’re trying to control and contain the growth and change to get you back to stage 5.

7 – You give up on controlling and containing the growth. You realize the deployment management is a nightmare and make a few aborted attempts to improve it. Eventually you manage to strike on a solution that enables rapid roll-back, stores all host provisioning data and software packages centrally, and does network-booting and OS image installation easily. You work to transition ownership of development and staging environments to the development and QA teams. You attempt to decentralize ownership of releases and arrive at some truce between decentralization progress and centralized control. You tighten down monitoring.

8 – You’re managing the growth effectively. You’ve now seen enough issues that system administration is an exciting, creative pursuit that produces heroes on a consistent basis. System administrators aren’t touching configuration on any machine anymore. You can order the hardware you need. You’ve begun to automate some of the standard network administration operations. You have good subject-matter-experts in many of the critical domains. Monitors are well-fleshed-out. You’re starting to talk about your availability in “nines” – 2 nines, 3 nines. You aspire to 5 nines and tell everyone you’re going there. You have program management that is now supporting operations. You begin to talk about driving availability as a program. You have a standard failure-response model in place. You think you may be able to get out of reactive mode, but still find it is hard to put together documents or otherwise say where you’re going and what you’re requirements are.

9 – You’re now driving the business to meet your baseline SLA needs. If that’s 3 nines then you’re battening down processes, calling war team reviews, putting basic projections in place and trying to hold owners accountable. If that’s 4 or 5 nines, then you’re locking down change management processes, limiting the number of changes, growing QA, putting priority-based load-shedding in place, rate-capping transactions, and spending more on networking equipment designed to give you higher availability.

10 – Everyone realizes you now have a ‘real’ network. They begin to audit you. You deal with security audits, monitoring audits, asset management audits, ownership audits, escalation audits, and financial audits. Management is no longer worried that operations will fail and sink the company because you’ve met their comfort-zone availability number. You may be at 3 nines and management still says they want 5 nines, but in reality they’re comfortable with the current availability. Your response plans are working. Management begins to worry about new metrics -- new features, cost of operations, speed of roll-out and deployment. Efficiency or innovation-cycle become the new buzz-word. Finance begins to interrogate your purchasing decisions. Engineering becomes frustrated that you can’t support their hardware requirements as quickly as you used to be. You audit whether you have too much redundancy. You build a 3rd datacenter so you can go from 50% utilization of resources to 66% and maintain N+1 redundancy. Your program management is absorbed focusing on costs, optimizing delivery, and basic build-out work. You’re still not able to document where you’re going and what you’re requirements for development teams should be. Thank goodness availability is no longer a worry so you can focus on everything else.

11 – You drive down efficiency and begin to eliminate redundancy. You take away some of the bottlenecks to deployments that you conclude were giving a false-sense-of-security even though they were catching some things. You push hard on new projects. You're beginning to make a dent in speed, features, and cost. You try to go to just-in-time hardware purchases. You pass-the-pain to the engineering teams to model and drive down cost. You begin to walk that line between availability and efficiency. You are mostly fixing issues that came out of audits rather than the issues you know you still have. Buffers are gone, fixes aren’t happening, training programs aren’t sufficient to get new folks fully up-to-speed. Then comes the big crash. Risk increased, there was no way to deliver on all the new priorities without taking some risk. Now the CEO is now frustrated about availability again. He or she was enjoying not thinking about it anymore. The CFO is frustrated that in his or her calculations you should have reduced costs faster. Operations moral is in the tanks as no one is the hero anymore. Operations Management underestimated how much they could get done and is frustrated. Engineering is frustrated as obvious stuff isn’t happening correctly. Driving engineering teams is hard now as you have limited political capital. Life sucks again. You’re being driven and can barely find the time to drive. Key Operations Managers and Executives leave.

12 – You start to get back in the driver seat. Often new management comes in. They’re given a reprieve and grace period from some of the goals and pressures that were on the previous management. Availability is first-and-foremost again for a time. You still have to deal with the CFO, but priorities are clearer. You begin to develop more robust process discipline. Holes in the previous toolset are fixed as new leadership applies their patterns to the system. Routing control is improved; separation-of-concerns is improved to facilitate engineering; different services are allowed to have different SLAs and support models and processes adapt to support that; you finally write down the operational readiness requirements; tools are improved; the networking issues and complexities in switches, load-balancers, etc get some much needed attention from a new set of eyes. You begin to put component-level SLAs and cost models in place and audit for those. You generalize processes across the board and begin to apply those more uniformly. Your disaster planning is institutionalized. Audits are a breeze. You begin to audit your vendors and apply your processes to them. You’re driving again.

What happens next? I’ll tell you when I figure it out.

Principles of Large Scale Systems

Friday, December 7, 2007

Stages of Operational Maturity

No comments:

Blog Archive