How about sub-second queries in Hadoop?

Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data — in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses.

Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source.

If you’ve done something like that, I’d be interested to know before I embark on a route where others have failed before.

About alq

Devops entrepreneur
This entry was posted in architecture, database, distributed computing, infrastructure, rdbms, scalability, yahoo and tagged . Bookmark the permalink.

Leave a Reply