Thursday May 21, 2009
Configuring Web Clouds with Chef
I'm not generally passionate about network and system operations, I prefer to focus my attention and creativity on system and software architectures. However, infrastructure provisioning, application deployment, monitoring and maintenance are facts of life for online services. When those basic functions aren't functioning well, then I get passionate about them. When service continuity is impacted and operations staff are overworked, it really bothers me; it tells me that I or other developers I'm working with are doing a poor job of delivering resilient software. I've had many conversations with folks who've accepted as a given that development teams and operations teams have friction between them; some even suggest that they should. After all, so goes that line of thinking, the developers are graded on how rapidly they implement features and fix bugs whereas the operators are graded on service availability and performance. Well, you can sell that all you want but I won't buy it.
In my view, developers need to deliver software that can be operated smoothly and operators need to provide feedback on how smoothly the software is operating; dev and ops must collaborate. I accept as a given that developers
- Use source control
- Write unit tests (after the fact or before/during TDD style)
- Write functional and integration tests
- Maintain a build system for running test harnesses and packaging code
- Document internal architecture and operating interfaces
- Plan for change with respect to scale charactistics and functionality
Conversely, I accept as a given that operators
- Use configuration management
- Automate infrastructure provisioning, code deployment and rollback
- Monitor infrastructure and application metrics
I don't want to oversimplify, there's more to the obligations that dev and ops have to each other in order to collaborate effectively. What I've noticed though is that a lot of operators might be skilled with configuring specific server infrastructure or performing OS analysis but configuration management and automation requires using really good tools that they lack. I've seen situations where the available tools are perceived as too complicated and so tools are developed that usually consist of a lot of specialized shell scripts (or perhaps it's just plain old NIH). Cfengine
is a good start but the reports I have is that it's difficult to work with and, if you're not very careful, may automatically manage to mis
configure your systems. Puppet
was developed to be a more powerful system for configuration management but the feedback I've seen on it is adding new functionality is hard; it has its own configuration language and when you want to extend it you have to deal with a lot of complicated mechanisms. Chef
was developed to answer that frustration; by making the configuration language a DSL
on top of an already widely used scripting language (ruby), the chef code base provides an easier way to extend it and the chef codebase itself is reportedly an order of magnitude smaller and simpler than puppet's (cavaet: I generally distrust SLoC
metrics but just sayin').
So I've been giving Chef a test-drive for this infrastructure-on-EC2 management project that's been cooking. The system implemented the following use cases:
- Launch web app servers on EC2 with Apache, Passenger, RoR (+other gems) and overlay a set of rails apps out of git
- Launch a pair of reverse proxies (with ha-proxy) in front of the app servers - and reconfigure them when the set of app servers is expanded or contracted
- Configure the proxy for failover with heartbeat
- Add new rails apps to the set of app servers
- Updating/rolling back rails apps
The system is enabled through a combination of the EC2 API (via RightAWS
) and Chef's REST API as well as using chef-deploy
(think: Capistrano run by a system provisioning agent) to augment Chef's functionality. So far, it seems to be working great!
There's a lot of energy in the Chef community (check out Casserole), combined with monitoring, log management and cloud technologies, I think there's a lot of IT streamlining ahead. Perhaps the old days of labor and communication intensive operations will give way to a new era of autonomic computing. I'll post further about some of the mechanics of working with ruby, rails, chef, EC2, chef-deploy and other tools in the weeks ahead (particularly now that EC2 has native load balancing, monitoring and auto-scaling capabilities). I'll also talk a bit about this stuff at a Velocity BoF. If you're thinking about attending Velocity, O'Reilly is offering 30% off to the first 30 people to register today with the code vel09d30 today (no I'm not getting any kinduva kickback from O'Reilly). And you can catch Infrastructure in the Cloud Era with Adam Jacob (Opscode), Ezra Zygmuntowicz (EngineYard) to learn more about Chef and cloud management.
( May 21 2009, 12:30:07 PM PDT )