Started a friendfeed webops public group

Feel free to join: http://friendfeed.com/web-ops

Posted in cloud computing | Tagged , , , | Leave a comment

#structure09 Hosting on commodity hardware

I just got out of the panel on commodity hardware and did not get a chance to participate so here’s my take on it.

The panel started with an opening question: google, amazon and the likes run at a huge scale on commodity hardware, yet enterprise vendors still push customized hardware and expensive at that.

To me the answer is pretty obvious: enterprise hardware is being for the most part sold to people who don’t know how to architect and design software on a commoditized stack. Let’s be honest, look at most “enterprise” hardware/software literature: it’s just noise and a waste of both the writer’s and the reader’s time. And by stack I mean from the server, all the way up to the application code.

If you constrain yourself to buy servers that cost no more than $5k, buying high-end database software makes little sense. Rather you recognize that low-end compute is how you get economies of scale and you apply the same reasoning to your networking gear, storage systems, database software, load balancing software, etc.

Google, from its earlier papers, seems to be the first to have understood that, rejecting the usual marketing garbage from large vendors. And for that we should be grateful.

Posted in architecture, cloud computing, infrastructure | Tagged | Leave a comment

I love Amazon Web Services open pricing

I’ve just spent 2 hours crafting a spreadsheet to compare how much it would cost to set up a decent platform to deliver the kind of data services I manage, vs. the same on EC2. Easy access to pricing is a key variable that’s often hard to get from vendors without being subjected to the “custom solution” time-waste. Technology vendors, your customers, more often than not, know what they want. When I ask for a price list, don’t try to second-guess whether I’ve done my homework, just give me the price list. If I have questions regarding the “solution” I’ll be more than happy to ask.

Posted in Uncategorized | Tagged , , | Leave a comment

How about sub-second queries in Hadoop?

Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data — in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses.

Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source.

If you’ve done something like that, I’d be interested to know before I embark on a route where others have failed before.

Posted in architecture, database, distributed computing, infrastructure, rdbms, scalability, yahoo | Tagged | Leave a comment

Notes from the 2009 Hadoop Summit West

I just got back from Santa Clara where Yahoo and Cloudera were hosting the 2009 Hadoop Summit West on Wednesday followed by a training on Thursday. My interest was one of a prospective user — to gauge how real and mature hadoop is.

The turn-out was more than decent, in the hundreds; a number from Yahoo, running the largest clusters so far, a few folks from Amazon, Facebook, some local universities and a fair number of small companies that have deployed their own clusters (or are running on EC2).

The good news first, hadoop is real and it’s getting real use. It’s clearly a promising platform with active use and development. The scaling model is fairly simple: buy more machines. The current sweet spot is dual-quad hosts with 4x1TB drives and 16GB or so of ECC RAM. Decoupling storage from a central system (à la SAN) is the way to go. Some folks have tried to hook up Thumpers to Niagara chips that run a lot of threads in parallel with some success but the TCO question is unclear.

Hence we can start with a handful of cheap machines and go from there. A few things to watch for: the secondary name node for instance, is there here for backup but to persist the DFS layout structures that exist in RAM on the primary name node. It could have been implemented in a more robust fashion using a sql database rather than requiring a re-implementation of redo logs and data files.

That’s overall the negative point: applications built on the platform (such as hive, hbase and pig) are still pretty much works in progress, somewhat duplication functionality. There is an air of Not Invented Here that still pervades but it’s a sign that the whole thing is still young. A vocal user base that meets regularly should help the project focus on the pieces that truly do not exist yet.

Posted in infrastructure, scalability, storage | 1 Comment

Very interesting talk about SmugMug

A few key points: 2 ops people, automatic scaling, 1000s of cores on EC2, PBs of storage on S3.

http://mysqlconf.blip.tv/file/2037101

Posted in architecture, cloud computing, infrastructure, production, storage, technical | Leave a comment

Tokyo train map meets Internet powerhouses


Web Trend Map 4 Final Beta

Originally uploaded by formforce

Cute.

Posted in architecture | Leave a comment

A sensible approach to source code branching

Source code branching is one of the most contentious activity that you can engage in a software company. For some reason that’s eluding me, I keep hearing the same arguments over and over again about why we should not use branches, about how branching is hard. It’s not, neither conceptually, nor practically, it simply requires to be methodical and to overcome a visceral fear of the *Merge*. It works more or less with all current tools, with CVS probably the hardest to deal with and the last batch of distributed source control, the easiest.

One of the primary problems that Feature Crews address is the difficulty of maintaining the integrity of very large code bases under development (imagine 1000 developers coding against a 10,000,000 line system). FC poses the problem as the tension between a) keeping the main branch as current as possible, and b) keeping the main branch as robust as possible. The FC solution is to make features an atomic transaction. A feature is either 0% complete or 100% complete, and a feature is not 100% complete until it can be demonstrated that it satisfies the same quality criteria as the rest of the main branch.

Here’s an excerpt from Lean Software Engineering. FC in this context means “feature crew”.

“Features-in-process are not allowed on the main branch. The FC alternative is branch-by-feature. A crew takes a branch when it takes possession of the feature kanban. The crew is responsible for forward-integrating any changes that are checked into main while their feature is in process. That is, if another crew integrates and breaks your feature-in-process, it’s your responsibility, not theirs. When your feature is finally complete AND you have integrated with all changes on main AND you pass all of the quality gates, THEN you can reverse integrate your feature into the main branch, and everybody else will have to forward integrate your changes.”

Here it is: use branches extensively, merge back and forth. It takes some time, a bit of practice, but it puts to rest these endless discussions about whether we should branch, when and what for.

Posted in technical | Tagged , , , | Leave a comment

Robots

Somewhat unrelated to the topic of data crunching and computing I wanted to mention an eye-opening book about robots: Wired for War by P.W. Singer.

Posted in Uncategorized | Leave a comment

Fun but not pratical: cloud computing steganography

Differential Power Analysis is a neat way to cryptanalyze smart cards and that triggered an interesting counter-measure: keeping power consumption constant regardless of the computation performed. Moving to a bigger scale and assuming low cloud compute costs, one could hide sensitive data processing in one VM by running ninety-nine others with slightly different data, whose results will be discarded silently.

Posted in architecture, cloud computing | Tagged , | Leave a comment