There was an interesting talk at Velocity 2010 last week with Tom Cook from Facebook. I particularly enjoyed learning about how their engineering and operations teams work with each other. I am a firm believer that excellent inter-team cooperation, and minimizing the need for communication (or making it really good), are crucial factors in making a company efficient and successful.
- A global changelog is maintained by engineers and IT when any change is made on the infrastructure (anyone can look into it, information not segregated into exclusive mailing lists)
- Internal news portal to announce features, or critical outages
- IRC used heavily (again, anyone can read)
- Nagios and Ganglia are used for monitoring (had to extend Nagios)
- All of the Nagios alerts are fed into other tools to aggregate them (emails cause too much noise)
- Configuration management implemented with CFengine and sync every 15min ("Puppet is a fantastic option", "Chef is newer and looks great")
- Configuration management should be implemented regardless of how many servers you run
- Very short dev cycles: code is pushed daily (mostly bugfixes), new features pushed every week
- Version control everything
- Testing performed by engineers, not by separate QA teams
- Ops people are integrated with engineering teams, and engineering heavily involved into deploying code to prod
- Advantages: ops people know the applications well and can help troubleshoot problems, and engineering helps making the deployment successful (no "commit and quit").
- Instrument everything: hardware, software, OS, etc
Here is the video of the talk.