Cassandra training with Jon Ellis from Riptano

Riptano, a newly-formed venture now offers training and commercial support of Cassandra, a key-value store of Facebook lineage. Cassandra’s initial claim to fame is being the data store behind facebook’s inbox.

The training session started with a relatively high-level presentation of Cassandra’s data model before jumping quickly into some real code from Twissandra, a simplified twitter clone based on Django. From there we were introduced to super-columns and their limitations, i.e. their subcolumns are not indexed so one should not pack too much in a super-column.

As the day progressed we started to get deeper into operations and internals where the rubber usually meets the road and Jon was obviously very well-acquainted with the subject matter. My suggestion would be to add more diagrams to the presentation materials to illustrate the numerous points made during the session.

Overall, considering the relatively paucity of documentation on Cassandra Jon’s in-depth session is a nice shortcut to spending time scouring mailing lists and reading the source code to get a solid grasp of the topic.

In the context of DataDog we use Cassandra to persist reliably and with little latency all inbound signals. But I’ll save details for later…

Posted in technical | Tagged , | Leave a comment

Interesting EC2 DNS bug

EC2′s internal DNS servers don’t get updates when you stop and restart EBS-backed instances.

I came across this bug as I was trying to get the scala off-line compiler to work on a restarted instance. fsc uses java.net.InetAddress.getLocalhost(), which triggers a DNS call. After some time spent reading the code, a tcpdump session convinced me that the machine thought it was something else (at least at the DNS level). Call it split personality.

To reproduce:

  1. start an EBS-backed instance
  2. note its name and its internal ip (uname -n, ip addr)
  3. stop and restart the instance
  4. its node name remains unchanged, its ip has changed, yet dig +short instance_dns_name returns the old IP, even hours after the restart

Annoying!

Posted in amazon, cloud computing, linux, networks | Leave a comment

CMG’09: Solaris/Linux Performance Measurement and Tuning (part 2)

Adrian Cockcroft (Netflix)

http://www.slideshare.net/adrianco/solaris-linux-performance-tools-and-tuning

My notes:

  • Netflix releases every 2 weeks, first in beta and tracks everything
  • Everything at netflix (or in web-land in general) instrumented, in libraries so that instrumentation comes for free
  • Beware of kernel tweaks, good for older kernels, now a lot more auto-tuned
  • On Solaris, microstate data very useful
  • With Poisson arrivals, steady state, N identical servers, approximation of response time, R = S / ((1 – utilization)^N), S = service time, utilization = throughput * S
  • Issues with this simplistic model: bursted traffic, service time varies, N servers don’t process the same thing,  virtual hardware make it a lot harder to figure out
  • Measurement errors (especially around measuring time)
  • So don’t bother about utilization
  • Load average on linux is broken, it includes disk activity
  • I/O wait is fundamentally broken, the cpu never waits for I/O per se
  • Cockcroft Headroom Plots: 99th-%ile against response time
  • On linux, best way to track i/o per process is with SystemTap
Posted in conference, monitoring | Tagged , , , | Leave a comment

CMG’09: “How ‘normal’ is your IT Data?”

Dr. Mazda Marvasti

My notes on this very informative talk (the best I’ve seen today). The goal of the study was to evaluate the hypotheses around normal distribution assumption built in the newer IT monitoring tools, that create dynamic thresholds of the various metrics they collect.

  • Analyzed 4 workloads: ad-serving on LAMP, bond processing, stock trades and some online application
  • Test for normal distribution: Kolmogorov-Smirnov as it makes no assumption on the data distributions
  • Used average shifted histograms for the test
  • Results: none of the basic metrics (OS, applications, business-oriented) are normally distributed, neither are their averages, when looking at blocks of 1 hour
  • For instance Monday 9am does not look at all like Tuesday 9am
  • Also Mondays 9am don’t on average converge, meaning that their average are not independent and/or the averages are not identically distributed
  • Business cycles matter very much in analysis, spectral analysis can help!
  • Correlations examined using Spearman’s ranked correlation coefficient (though results not presented).
  • Conclusion: go for non-parametric analysis, known distributions don’t really apply
  • If you enable dymanic thresholds based on normal distribution assumptions, expect a 10x in the number of alerts — though it’s possible to mitigate this with use of topology rules (e.g. “don’t alert me if event 1 and event 2 coöccur)

My take on this: IT data analysis is challenging. One question is: how much is it worth, i.e. at what scale do you get your money back (and more) by getting this type of fairly sophisticated analysis and what kind of return can you expect of it? While the answer depends on the nature of the business conducted, I’m curious to see whether it’s bigger shops with expensive applications, cloud-scale companies or whether this is going to percolate toward the smaller web shops, integral to an Infrastructure-as-a-Service offering?

Stay tuned…

Posted in conference, technical | Tagged , , , , | Leave a comment

CMG’09: “How do you analyze 100,000s of servers?”

Charles Loboz (microsoft)

  • No homogeneous software/hardware/applications
  • Access is often limited (e.g. hotmail servers are off-limit)
  • In the old days, 1 server analyzed per day
  • Stopped using averages and stddev (because data are not normal)
  • Built 10-bin histograms for utilization
  • Even that is limited, because long tails are the ones triggering issues (e.g. bad queries triggering load, then all queries will pile up)
  • No one cares about utilization (except data geeks), only performance matters
  • Estimate utilization impact on performance with “Performance Impact Factor” (PIF): a weighted average of histograms, heavy utilization should be favored to make long tails more obvious, for CPU, for net, for IO

Recipe

  • Compute histograms
  • Compute PIFs for each server
  • Cross-tabulate PIFs to server names to tag servers as underused, overloaded, etc.
  • Store everything in a database

Pitfalls

  • PIF averages don’t mean anything
  • It’s good to tell a “dead-cold” server, but it’s not good to tell you that you have an issue, just that you have to investigate
Posted in conference, distributed computing, monitoring, production | Tagged , , | Leave a comment

At CMG’09 today

On paper it looked like a scientific approach to performance management, born in the mainframe days when computers were expensive. Now it’s cloud-scale that matters (and an ailing world economy if you’re not a bank) so managing capacity rigorously (and in an automated fashion) makes sense.

So far no breakthrough though, it’s a bit too applied to my taste. Let’s see what the next sessions hold in store.

Posted in Uncategorized | Tagged | Leave a comment

Looking into system performance of an Oracle data warehouse

Introduction

This is the start of an ongoing investigation into system performance of an oracle 10.2 data warehouse being loaded . The database server has 2 real storage volumes (called dw-clear and dw-encrypt) and 1 virtual one (dw-encrypt-u) used to decrypt data on the fly. Most of the data and the i/o are on the dw-clear volume.

System-data performance have been collected via sadc -d to capture per-device statistics. The data are then extracted using sadf -d filename -- -d -b -d. The summary is available here as a csv. It’s a large table of block i/o stats, cpu stats and per-device i/o stats, suitable to be imported into R.

The system characteristics are as follows.

  • Sun x4150 64GB RAM, 2×4 x5450, 1 4Gb/s QL2462 HBA with 2 ports.
  • 3 device-mapper devices, 2 using a round-robin multipath (v1, v2), 1 using an on-the-fly cipher to decode encrypted data (v3).
  • 3PAR S400 with 10k drives and 4Gb/s HBAs.
  • Out of the 64GB, 8GB are set aside as HugePages to serve as memory pages for the SGA.

The goal of this investigation is to understand what the bottleneck is in the processing and what can be done to remove it.

Let’s start with cpu utilization.

Distribution of CPU time spent in userland when not idle

Distribution of CPU time spent in userland when not idle

Not terribly loaded (I’m filtering out the long idle portions with user > 5. How about I/O?

% of CPU spent waiting on IO

% of CPU spent waiting on IO

Interesting, iowait is not negligible. Is it correlated to anything in particular? First of all, let’s see how iowait varies with device utilization of v1.
IOWait against v1 device utilization v1 is slowly but surely bringing iowait higher, to the point than more than one processor ends up waiting on I/O.

To be continued…

Posted in investigation, oracle, storage, sun, technical | Tagged | Leave a comment

Blog battle on the storage appliance front

Backblaze has started an interesting conversation by detailing how they get to $117,000 per PB, down to the type and number of SATA card used in their design. A great PR move for a company in the crowded personal backup space. Of course publishing comparisons with Dell, Sun, NetApp and EMC at 8x, 10x, 30x the price is a sure way to start stirring people’s emotions. The first to publish a lengthy response (that StorageMojo could find) is Joerg Moellenkamp in a blog post. Laudable in pointing design flaws for fundamentally 2 different markets. Sure, Sun’s hardware is a great piece of engineering, squarely aimed at the enterprise market. Which, incidentally, is not buying in droves and Sun’s financials is clearly reflecting that. Backblaze took the google route for storage and it’s hard to see, given the competitive pressure, how they would be better off spending their margin on Sun hardware. The era of gold-plated hardware is slowly drawing to a close and I can’t say I oppose that change.

Posted in storage, sun, technical | Leave a comment

Netflix describes its culture

Posted in Uncategorized | Tagged , , | Leave a comment

Catching up on Velocity 09

This year I could not attend Velocity so I decided to catch up via http://velocityconference.blip.tv. Here are a few notes on the sessions I have been able to see so far.

John Allspaw (Ops) & Paul Hammond (Dev): 10+ Deploys per day: Dev/Ops coöperation at Flickr

This is a topic dear to my heart: changing the culture shared (or not) by dev and ops.

  • Contrary to popular wisdom, ops’ real mission is not to keep the service stable per se, but to enable the business.
  • Business requires change
  • Build the tools and the culture that allow repeated change with minimal uncertainty.
  • Automate your infrastructure
  • Use one shared source control, between devs and ops so that everyone on the team knows where to look
  • Reduce all manual steps down to one, that of deciding to build and deploy
  • Small frequent changes better than fewer large changes
  • Use “feature flags”, i.e., use code to enable features, rather than branches
  • Ship TRUNK so that everyone knows what gets released
  • Feature flags allow for private betas, reduces uncertainty
  • Dark launches: enable the feature to exercise the data path but don’t present the results to the end-user
  • Metrics, metrics, metrics
  • Add context to it, such as the last time something was deployed
  • We use IRC and IM bots to bring system updates into the conversation between dev and ops in real time, then push the logs into a search engine
  • Develop respect and trust between devs and ops
  • Have a healthy attitude toward failure (don’t blame, fix the problem first)
Posted in culture | Leave a comment