Getting fast internet to an office in Manhattan: work from home

As Datadog is looking for new digs I've had the pleasure to spend half a day on the phone with various internet providers to upgrade the T1 that is already in place. In this day and age, a T1 is still brandished in the richest city of the world as some serious connectivity. Imagine, 1.5 megabit per second shared between phone lines and data lines. Back when 33.6 kilobit per second were all the rage, it made sense to show up at work to get online. But now I get 30Mb/s at home... So real estate developers and building management companies need to update their "I-sound-cool" tech lingo and understand that a T1 is not a strong selling point. Which brings me to the second point: getting internet service from providers. Choose from cable companies, phone companies, wireless providers and internet providers. I'll start with the cable companies.

Cable companies do not service every building, despite the fact that they have what turns out to be a decent offering ($300/month for 50 Mb/s down, 5Mb/s up). I called Time Warner and RCN and neither will service the building. Nevermind that as a consumer I'll be charged $80/month for the same bandwidth, I'm still kindly asked to look somewhere else.

Next stop: phone companies. Fiber-to-the-home (aka FIOS in Verizon jargon) would be more than adequate. But of course, no fiber in the building. The best I am offered is a 7 Mb/s assuming the office is not too far from the Verizon building. It's also only $90/month with a phone line, which at this point would only be used occasionally to send a fax.

Wireless providers are a bit more promising. There I have a choice between enterprise overpriced WiMax at $800/month for a paltry 8 Mb/s both ways, Verizon LTE at $10 per GB (and decent 10-15 Mb/s) or cheaper Clear(wire( consumer access at $50/month for 4-5 Mb/s.

Last came the internet providers proper, who seem to be milking smaller businesses by offering 10 Mb/s at a whopping $1,300 per month. Considering that it's a 17-stories building and that 1Gb/s should cost between $5k-10k per month, it should be possible to buy 1 access and split it across all tenants, rather than having 15 companies each pay $1,000 to selfishly enjoy 10 Mb/s.

Bottom line: if you want cheap internet, rent an apartment!

Ignite talks are not easy

Most recently I was attending devopsdays in Boston to get feedback on our work at datadoghq.com. The feedback was more than encouraging but I'll keep that for later. I had volunteered to give a short presentation on the outdated assumptions behind systems and application monitoring. The format is simple: 5 minutes, 20 slides auto-advancing every 15 seconds. I knew it was difficult but I failed to step back and think about how these constraints would affect my message.

The most obvious trap I fell into is that I felt compelled to use all 20 slides with different content rather than repeating the same slide over and over again. I could have gotten away with 3 slides rehashing the same idea rather than 20 slides trying to convey 5 major ideas. So that was my biggest mistake: trying to be too synthetic; too much content without giving the audience a chance to digest.

In hindsight I would have picked 1 idea and expounded it over the whole 5 minutes. After all in normal conversation it will easily take 5 minutes to convey one idea, complete with arguments and examples.

Sensing that I could have done a better job I went out looking for references on presentation and stumbled upon The Naked Presenter by Garr Reynolds. I highly recommend the first chapter. It reminds the reader that presentation is not about the tools or the format; it's about the audience and the message. Why is my audience here and how do I avoid wasting their time?

All that said it's always good to get a chance to improve.

EC2 Micro-instances value analysis

Evaluating Amazon’s EC2 Micro Instances at DocumentCloud. An interesting benchmark using image processing. For highly-parallel jobs this means using a ton of micro instances to get results for cheaper. To make this kind of decision with a simple API call did not exist 5 years ago... Now what we need is a simple way to make this kind of decisions easily. And that's what DataDog is about... More details soon.

Cassandra training with Jon Ellis from Riptano

Riptano, a newly-formed venture now offers training and commercial support of Cassandra, a key-value store of Facebook lineage. Cassandra's initial claim to fame is being the data store behind facebook's inbox. The training session started with a relatively high-level presentation of Cassandra's data model before jumping quickly into some real code from Twissandra, a simplified twitter clone based on Django. From there we were introduced to super-columns and their limitations, i.e. their subcolumns are not indexed so one should not pack too much in a super-column. As the day progressed we started to get deeper into operations and internals where the rubber usually meets the road and Jon was obviously very well-acquainted with the subject matter. My suggestion would be to add more diagrams to the presentation materials to illustrate the numerous points made during the session. Overall, considering the relatively paucity of documentation on Cassandra Jon's in-depth session is a nice shortcut to spending time scouring mailing lists and reading the source code to get a solid grasp of the topic. In the context of DataDog we use Cassandra to persist reliably and with little latency all inbound signals. But I'll save details for later...

Interesting EC2 DNS bug

EC2's internal DNS servers don't get updates when you stop and restart EBS-backed instances. I came across this bug as I was trying to get the scala off-line compiler to work on a restarted instance. fsc uses java.net.InetAddress.getLocalhost(), which triggers a DNS call. After some time spent reading the code, a tcpdump session convinced me that the machine thought it was something else (at least at the DNS level). Call it split personality. To reproduce:
  1. start an EBS-backed instance
  2. note its name and its internal ip (uname -n, ip addr)
  3. stop and restart the instance
  4. its node name remains unchanged, its ip has changed, yet dig +short instance_dns_name returns the old IP, even hours after the restart
Annoying!

CMG'09: Solaris/Linux Performance Measurement and Tuning (part 2)

Adrian Cockcroft (Netflix) My notes:
  • Netflix releases every 2 weeks, first in beta and tracks everything
  • Everything at netflix (or in web-land in general) instrumented, in libraries so that instrumentation comes for free
  • Beware of kernel tweaks, good for older kernels, now a lot more auto-tuned
  • On Solaris, microstate data very useful
  • With Poisson arrivals, steady state, N identical servers, approximation of response time, R = S / ((1 - utilization)^N), S = service time, utilization = throughput * S
  • Issues with this simplistic model: bursted traffic, service time varies, N servers don't process the same thing,  virtual hardware make it a lot harder to figure out
  • Measurement errors (especially around measuring time)
  • So don't bother about utilization
  • Load average on linux is broken, it includes disk activity
  • I/O wait is fundamentally broken, the cpu never waits for I/O per se
  • Cockcroft Headroom Plots: 99th-%ile against response time
  • On linux, best way to track i/o per process is with SystemTap

CMG'09: "How 'normal' is your IT Data?"

Dr. Mazda Marvasti My notes on this very informative talk (the best I've seen today). The goal of the study was to evaluate the hypotheses around normal distribution assumption built in the newer IT monitoring tools, that create dynamic thresholds of the various metrics they collect.
  • Analyzed 4 workloads: ad-serving on LAMP, bond processing, stock trades and some online application
  • Test for normal distribution: Kolmogorov-Smirnov as it makes no assumption on the data distributions
  • Used average shifted histograms for the test
  • Results: none of the basic metrics (OS, applications, business-oriented) are normally distributed, neither are their averages, when looking at blocks of 1 hour
  • For instance Monday 9am does not look at all like Tuesday 9am
  • Also Mondays 9am don't on average converge, meaning that their average are not independent and/or the averages are not identically distributed
  • Business cycles matter very much in analysis, spectral analysis can help!
  • Correlations examined using Spearman's ranked correlation coefficient (though results not presented).
  • Conclusion: go for non-parametric analysis, known distributions don't really apply
  • If you enable dymanic thresholds based on normal distribution assumptions, expect a 10x in the number of alerts -- though it's possible to mitigate this with use of topology rules (e.g. "don't alert me if event 1 and event 2 coöccur)
My take on this: IT data analysis is challenging. One question is: how much is it worth, i.e. at what scale do you get your money back (and more) by getting this type of fairly sophisticated analysis and what kind of return can you expect of it? While the answer depends on the nature of the business conducted, I'm curious to see whether it's bigger shops with expensive applications, cloud-scale companies or whether this is going to percolate toward the smaller web shops, integral to an Infrastructure-as-a-Service offering? Stay tuned...

CMG'09: "How do you analyze 100,000s of servers?"

Charles Loboz (microsoft)
  • No homogeneous software/hardware/applications
  • Access is often limited (e.g. hotmail servers are off-limit)
  • In the old days, 1 server analyzed per day
  • Stopped using averages and stddev (because data are not normal)
  • Built 10-bin histograms for utilization
  • Even that is limited, because long tails are the ones triggering issues (e.g. bad queries triggering load, then all queries will pile up)
  • No one cares about utilization (except data geeks), only performance matters
  • Estimate utilization impact on performance with "Performance Impact Factor" (PIF): a weighted average of histograms, heavy utilization should be favored to make long tails more obvious, for CPU, for net, for IO
Recipe
  • Compute histograms
  • Compute PIFs for each server
  • Cross-tabulate PIFs to server names to tag servers as underused, overloaded, etc.
  • Store everything in a database
Pitfalls
  • PIF averages don't mean anything
  • It's good to tell a "dead-cold" server, but it's not good to tell you that you have an issue, just that you have to investigate