What can cause downtime:
- bugs
- edge cases
- security incidents
- real capacity problems
Deployment and management tricks from the HPC world: ganglia, System Imager
Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks.
fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too]
Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day.
[So flickr uses nagios + ganglia]
One key trick is to build kill switches in all the features so as to turn things off when load increases.