A utility function to make a storage decision

11 07 2008

I am reproducing an internal twiki post that I put together when we decided to give the Sun Thumper (x4500) a try as an iSCSI target running ZFS.

Introduction

Inspired by a nice paper on utility the following model aims at providing quantitative justifications to choose a SAN solution. This approach relies on the ability to create a “utility” function that takes into account a certain number of factors (explained below) and come up with a mono-dimensional measure (no unit).

Factors are:

  • Performance (measured in iops)
  • Capacity (here assumed to be usable, typically raid5 for dev/test, expressed in TB)
  • Availability (aggregate in %)
  • Power (in kWh)
  • Acquisition cost (in $, depreciated over n years)
  • Revenue (in $, if we can tie that to some business figure)
  • Management (in hours per TB)
  • Reliability (fractional and total, in %), a measure of the chances to lose some or all of the datasets

Definitions

Utility = Revenue - Cost(downtime) - Cost(data loss) - Cost(management) - Acquisition
Cost(downtime) = (1 - Availability) * SAN size * #developers per TB * hourly developer rate
Cost(data loss) = Hourly Failure * SAN size * restore hr per TB * (hourly IT rate * #IT staff per TB +  hourly developer rate * #developers per TB)
and
Acquisition = capex and opex over the lifetime of the SAN

We will assume that revenue is 0 for development and testing. This is of course not true but since any scenario yields the same basic functionality this is a safe assumption to make.

We also assume that we have a backup of everything. In case this is impossible, the restore time becomes that of recreating data from scratch.

The winner is the solution with the higher utility.





Attempts at using a SunFire x4500 “Thumper” as an iSCSI SAN

6 07 2008

For development purposes I was looking for a “cheap” upgrade over our aging Apple XServeRAID fiber channel arrays. These served us well, if one excludes the lack of LUN masking in the later firmware versions, but we have consistently outgrown their native capacity. Besides Apple (and its dismal enterprise support) has stopped selling them so we have been left with no choice but to look elsewhere.

The basic requirements are:

  1. Fiber Channel or iSCSI target support to support database workloads
  2. Ability to carve LUNs out of a pool that is large enough (at least 20 TB of raw storage)
  3. Ability to clone volumes
  4. Ability to take snapshots within seconds

There are a few candidates that fullfil this bill: Equallogic, Sun Thumper, 3PAR, Compellent, the XServe replacement from Promise coupled with LVM, various Overland devices.

The general hotness of ZFS and the “Try-n-buy” program from Sun made it acceptable to give their hardware and software a try. After all Solaris is not that different from linux (or should I say commute both terms), the hardware is dirt cheap (you can’t get much cheaper) and Sun’s expertise with hardware systems based on Opterons has been proven in-house on their wonderful SunFire x4600.

To make a long story short, the iSCSI target daemon on Solaris 10 is not stable enough for production use. We were plagued with numerous core dumps, causing the iSCSI setup to flickr and initiators to moderately appreciate the frequent interrupts. Setting up OpenSolaris seemed to help a bit but our trust in the stability of the code had suffered an irremediable blow (well, not quite irremediable, but at least for another year or so).

Pressed for time I have decided to cut our losses short and to spend more money to get a 48 TB 3PAR E200, vastly more expensive per TB but also known to work. I really wished the Thumper trial had been successful; I believe in OpenSolaris, what I’ve seen of ZFS makes me green with envy (compared to ext3 + LVM) but I simply cannot justify downtime and/or the hiring of a Solaris core code guru to troubleshoot this mess.