mrb's blog

Engineering & Operations at Facebook

Keywords: largescale sysadmin

There was an interesting talk at Velocity 2010 last week with Tom Cook from Facebook. I particularly enjoyed learning about how their engineering and operations teams work with each other. I am a firm believer that excellent inter-team cooperation, and minimizing the need for communication (or making it really good), are crucial factors in making a company efficient and successful.

  • A global changelog is maintained by engineers and IT when any change is made on the infrastructure (anyone can look into it, information not segregated into exclusive mailing lists)
  • Internal news portal to announce features, or critical outages
  • IRC used heavily (again, anyone can read)
  • Nagios and Ganglia are used for monitoring (had to extend Nagios)
  • All of the Nagios alerts are fed into other tools to aggregate them (emails cause too much noise)
  • Configuration management implemented with CFengine and sync every 15min ("Puppet is a fantastic option", "Chef is newer and looks great")
  • Configuration management should be implemented regardless of how many servers you run
  • Very short dev cycles: code is pushed daily (mostly bugfixes), new features pushed every week
  • Version control everything
  • Testing performed by engineers, not by separate QA teams
  • Ops people are integrated with engineering teams, and engineering heavily involved into deploying code to prod
  • Advantages: ops people know the applications well and can help troubleshoot problems, and engineering helps making the deployment successful (no "commit and quit").
  • Instrument everything: hardware, software, OS, etc

Here is the video of the talk.


darul wrote: Very instructive article about how to improve developpment, team working process to use etc..., many tools I have never heard about before...a lot of thing I still have to learn. I gonna watch video soon. Thanks you Marc for your relevant articles. 15 Jul 2010 16:54 UTC

mrb wrote: Happy you enjoy it! Check out other Velocity 2010 presentations as well. Some of them are also fascinating, eg. James Hamilton's talk on datacenter infrastructures (Amazon). 16 Jul 2010 05:56 UTC