Velocity: John Allpaw @flickr, Capacity Planning

24 06 2008

What can cause downtime:

  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems

Deployment and management tricks from the HPC world: ganglia, System Imager

Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks.

fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too]

Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day.

[So flickr uses nagios + ganglia]

One key trick is to build kill switches in all the features so as to turn things off when load increases.


Actions

Information

Leave a comment

You must be logged in to post a comment