What’s happened in the last four months

29 10 2008

I have been busy building a colocated platform for a client of ours. Early June work starts putting together a brand new development environment, which we built with VMware ESX, a host of Dell R805s and 3PAR E200. Mid-August we were able to “move” in the “production” suite with racks and power. 3 months later, we are ready to launch the full platform so that’s left me with little time for non-essential things such as posting here.

Now that launch is almost there, I’ll be happy to distill what I’ve learned in the process. Stay tuned.





A utility function to make a storage decision

11 07 2008

I am reproducing an internal twiki post that I put together when we decided to give the Sun Thumper (x4500) a try as an iSCSI target running ZFS.

Introduction

Inspired by a nice paper on utility the following model aims at providing quantitative justifications to choose a SAN solution. This approach relies on the ability to create a “utility” function that takes into account a certain number of factors (explained below) and come up with a mono-dimensional measure (no unit).

Factors are:

  • Performance (measured in iops)
  • Capacity (here assumed to be usable, typically raid5 for dev/test, expressed in TB)
  • Availability (aggregate in %)
  • Power (in kWh)
  • Acquisition cost (in $, depreciated over n years)
  • Revenue (in $, if we can tie that to some business figure)
  • Management (in hours per TB)
  • Reliability (fractional and total, in %), a measure of the chances to lose some or all of the datasets

Definitions

Utility = Revenue - Cost(downtime) - Cost(data loss) - Cost(management) - Acquisition
Cost(downtime) = (1 - Availability) * SAN size * #developers per TB * hourly developer rate
Cost(data loss) = Hourly Failure * SAN size * restore hr per TB * (hourly IT rate * #IT staff per TB +  hourly developer rate * #developers per TB)
and
Acquisition = capex and opex over the lifetime of the SAN

We will assume that revenue is 0 for development and testing. This is of course not true but since any scenario yields the same basic functionality this is a safe assumption to make.

We also assume that we have a backup of everything. In case this is impossible, the restore time becomes that of recreating data from scratch.

The winner is the solution with the higher utility.





Attempts at using a SunFire x4500 “Thumper” as an iSCSI SAN

6 07 2008

For development purposes I was looking for a “cheap” upgrade over our aging Apple XServeRAID fiber channel arrays. These served us well, if one excludes the lack of LUN masking in the later firmware versions, but we have consistently outgrown their native capacity. Besides Apple (and its dismal enterprise support) has stopped selling them so we have been left with no choice but to look elsewhere.

The basic requirements are:

  1. Fiber Channel or iSCSI target support to support database workloads
  2. Ability to carve LUNs out of a pool that is large enough (at least 20 TB of raw storage)
  3. Ability to clone volumes
  4. Ability to take snapshots within seconds

There are a few candidates that fullfil this bill: Equallogic, Sun Thumper, 3PAR, Compellent, the XServe replacement from Promise coupled with LVM, various Overland devices.

The general hotness of ZFS and the “Try-n-buy” program from Sun made it acceptable to give their hardware and software a try. After all Solaris is not that different from linux (or should I say commute both terms), the hardware is dirt cheap (you can’t get much cheaper) and Sun’s expertise with hardware systems based on Opterons has been proven in-house on their wonderful SunFire x4600.

To make a long story short, the iSCSI target daemon on Solaris 10 is not stable enough for production use. We were plagued with numerous core dumps, causing the iSCSI setup to flickr and initiators to moderately appreciate the frequent interrupts. Setting up OpenSolaris seemed to help a bit but our trust in the stability of the code had suffered an irremediable blow (well, not quite irremediable, but at least for another year or so).

Pressed for time I have decided to cut our losses short and to spend more money to get a 48 TB 3PAR E200, vastly more expensive per TB but also known to work. I really wished the Thumper trial had been successful; I believe in OpenSolaris, what I’ve seen of ZFS makes me green with envy (compared to ext3 + LVM) but I simply cannot justify downtime and/or the hiring of a Solaris core code guru to troubleshoot this mess.





Structure08, my impressions so far

25 06 2008

It’s a day packed with keynotes, panels and shmoozing, with some topics overlapping with Velocity; yet at a much higher level. We’ve alternated between interesting panels (”Harnessing explosive growth”) where the key points are:

  1. a proper architecture lets you scale [much like in traditional building]
  2. build kill switches in all your features
  3. get operations and development on a symbiotic relationship [salesforce and amazon do it]

Some other panels are clearly more about pushing your product (”The race to the next database”). The topic of processing data (possibly in the cloud) is of course crucial yet very few concerns around switching costs, security and privacy are addressed. My take on this is that if you need to run analytics on your data sets and said data sets are huge, you need compute to be close (from a network distance perspective) to your data. Which means that your data must be in the cloud. While I’m reluctant to go down that route right now, Greg Papadopoulos @sun made the compelling analogy that money storage is delegated to reputable third-parties (called banks) so data are likely to follow the same treatment, i.e. the cloud is likely to become the most secure place to store data (or most resilient with an acceptable security policy). Sun’s interesting take on cloud computing is Project Caroline, where infrastructure, including network bits, is driven by code, in a way, that’s presumable a bit cleaner than EC2 (which is quite bare).

Dr. Vogel’s presentation @amzn, was inspiring despite containing basically little new information but fits well into this type of conference, which act as reinforcement devices to jumpstart a new industry.

Live coverage is at gigaom.





Velocity: John Allpaw @flickr, Capacity Planning

24 06 2008

What can cause downtime:

  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems

Deployment and management tricks from the HPC world: ganglia, System Imager

Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks.

fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too]

Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day.

[So flickr uses nagios + ganglia]

One key trick is to build kill switches in all the features so as to turn things off when load increases.





Velocity: Adam Bechtel @yahoo, Performance plumbing

24 06 2008

When building a global network, you start building out knobs (usually implemented as routing policies):

cost, packet loss, latency, maintenance, diversity, isolation, “special”

[Really funny analogy between anycast and toilets, caching and water supply]

After having developed routing policies, you start looking into anycast. One of the first services to be anycast is DNS.

Anycast scaling: vip, ecmp

Anycast considerations: how to monitor services? how to control users? how to handle transient network events?





Velocity: Panel, a survival guide

24 06 2008

Panelists: presented by Adam Jacob (HJK Solutions), Shayan Zadeh (Zoosk, Inc. ), Brian Moon (dealnews.com), Don MacAskill (SmugMug), John Allspaw (Flickr (Yahoo!)), Michael Halligan (BitPusher, LLC) and a gentleman (Fotolog)

Don McAskill: Rafael Nadal started to win Roland-Garros and his fanclub was there. He won the Open, which created a huge spike. Comments had to be turned off for the site to survive. The next year, he won again and stats had to be turned off. For his third victory servers did not collapse. This year he won and we did not even register.

John Allspaw: code gets pushed 20 to 30 times a day… Major events triggered traffic spikes.

Don would love to not operate a data center anymore, despite their expertise.

John: DB problems are hard [everyone in agreement, myself included]

[Discussion follows on scalablity: do not optimize for scale too early]

Don: EC2 is not worth it for servers that run around the clock, but if you’re good at shutting down instances that you don’t need.





Velocity: Sean Quilan @google, Storage at scale

24 06 2008

Strategy: buy lots of commodity hardware, because problems tend to be too big for their problem space. Hardware reliability is not that useful as well because it’s expensive.

[Showing the same pictures over and over again, someone from Google PR, please authorize the release of newer pictures]

[A GFS description follows, nothing new so far, read the papers on the topic]

[A BigTable description follows, same deal]

I wish this talk had some new information…





Velocity: Rich Wolski @ucsb, EUCALYPTUS

24 06 2008

Eucalyptus is an open-source implementation (not production-ready) of a compute cloud API-compatible with EC2. In academia sysadmin time is very expensive so the roll-out has to be really simple. Eucalyptus currently uses xen and includes a security layer that replaces Amazon’s use of the credit card authentication/authorization system.

Mention of ROCKS, a cluster deployment system.





Velocity: Brent Chapman @great circle, what can IT professionals learn from emergency services?

23 06 2008

Example: a car hits a fire hydrant. Lots of agencies involved (fire dpt, ambo, police, electrical company). How do they coördinate all that?

Incident Command System is the protocol used in pretty much all emergency situations (courses available here).

I’ll put a pointer to slides, the example used in the talk is good. The wikipedia article is supposedly good and this article from ham radio operators is a good introduction.