« December 2008 | Main | November 2009 »

April 2009 Archives

April 20, 2009

Map-reduce, Hadoop and Clouds - When and When Not

Can you process the entire Facebook network from a PC in your garage?

Over a decade ago, before I started my graduate work, my then-to-be advisor gave me a little book and asked that I don't return to him until after I've read it cover-to-cover. This book, How not to get a PhD (apparently re-published on a more positive note since 2002), was an invaluable source of wisdom and advice to me not only before I started my grad work, but even during it. It enumerated all the wrong reasons why one would want to do a PhD. Now, a decade since my degree was granted, rolled up and squirilled away, I still find the wisdom in that little book relevant. Not so much in making decisions about matters to do with formal education, but in real life.

In today's agile world, software projects typically demand strategic choices over a plethora of possible solutions to most problems. Some solutions are appropriate for some problems and some aren't. It's important to choose a solution for all the right reasons, and it's doubly important to not choose it for all the wrong reasons. Human factors complicate this issue because inappropriate solutions to some problems are appropriate and good solutions to other problems. And these "other" problems are sometimes faced by reputed technology companies, who lend "celebrity" status to the solutions. This leads the young engineer astray, and coerces them into thinking that just because Google or Microsoft have found great success with a particular paradigm, they ought to as well. Further, within most companies, there are bound to be a number of non-engineering execs whose bandwidth is almost completely soaked up by non-engineering matters. It is consequently easy to get their buy-in for adopting such technologies using that unfortunately untrue magic phrase, an egregiously extant engineering enchantment and the mother of all modus ponens if I can call it that: If it is good enough for Google, it ought to be good enough for us". It behooves every company to have at least one theoretically and practically savvy engineer to protect precisely against this kind of thing happening. And to support decisions that are data-driven, rather than driven by subjective matters such as coolness and esthetics. At worst, data-driven decisions are at least likely to prevent general resentment among staff - one person's sense of beauty may not always be another's. On the other hand, a truly creative idea or proposal is bound to bubble up and find general support in the numbers. Those that shirk experimentation, calculation and testing are the prophets of faith that every organization should strive to avoid hiring into engineering, in preference to those that support the voice of empirical reason.

In this article, I want to focus on two specific cool technologies that frequently go hand-in-hand: Google's Map-reduce (e.g. Hadoop), and Cloud Computing (e.g. EC2). For whatever reasons possibly including those I've mentioned above, temptation is high among many engineers today to use these sledge hammers to crack nuts. As an illustrative case in point, I'd like to take one of our typical problems and talk about how we address it at 33Across. Our customers often express amazement at how we manage to process their gargantuan data sets and produce results in short order, seemingly with ease. We are often asked probing questions that try to get us to divulge our core data-processing techniques. Needless to say, I won't be describing any of our proprietary procedures or secret sauces here. But a great deal can be said without going into such matters. Most of what we have done to get where we are at is simply to follow published information you can find in any decent computing journal on occasion, but most of the time in good Computing-101 texts. Before we dive into using a framework that has any overhead associated with it, we always do quick back-of-the-envelope calculations (not unlike Rapleaf's nice analysis of whether to host or cloud) to determine if it will really be worth it.

Continue reading "Map-reduce, Hadoop and Clouds - When and When Not" »

About April 2009

This page contains all entries posted to aBlog in April 2009. They are listed from oldest to newest.

December 2008 is the previous archive.

November 2009 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35