Velocity: John Allpaw @flickr, Capacity Planning

What can cause downtime:

  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems

Deployment and management tricks from the HPC world: ganglia, System Imager

Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks.

fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too]

Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day.

[So flickr uses nagios + ganglia]

One key trick is to build kill switches in all the features so as to turn things off when load increases.

About alq

Devops entrepreneur
This entry was posted in production, scalability, technical and tagged , , . Bookmark the permalink.

Leave a Reply