Can you process the entire Facebook network from a PC in your garage?
Over a decade ago, before I started my graduate work, my then-to-be
advisor gave me a little book and asked that I don't return to him
until after I've read it cover-to-cover. This
book,
How not to get a PhD (apparently re-published on
a
more positive note since 2002), was an invaluable source of wisdom and
advice to me not only before I started my grad work, but even during
it. It enumerated all the wrong reasons why one would want to do a
PhD. Now, a decade since my degree was granted, rolled up and
squirilled away, I still find the wisdom in that little book
relevant. Not so much in making decisions about matters to do with
formal education, but in real life.
In today's agile world, software projects typically demand strategic
choices over a plethora of possible solutions to most problems. Some
solutions are appropriate for some problems and some aren't. It's
important to choose a solution for all the right reasons, and it's
doubly important to not choose it for all the wrong reasons. Human
factors complicate this issue because inappropriate solutions to some
problems are appropriate and good solutions to other problems. And
these "other" problems are sometimes faced by reputed technology
companies, who lend "celebrity" status to the solutions. This leads
the young engineer astray, and coerces them into thinking that just
because Google or Microsoft have found great success with a particular
paradigm, they ought to as well. Further, within most companies,
there are bound to be a number of non-engineering execs whose
bandwidth is almost completely soaked up by non-engineering matters.
It is consequently easy to get their buy-in for adopting such
technologies using that unfortunately untrue magic phrase, an
egregiously extant engineering enchantment and the mother of all modus ponens if I can
call it that: If it is good enough for Google, it ought to be
good enough for us". It behooves every company to have at least
one theoretically and practically savvy engineer to protect precisely
against this kind of thing happening. And to support decisions that
are data-driven, rather than driven by subjective matters such as
coolness and esthetics. At worst, data-driven decisions are
at least likely to prevent general resentment among staff - one
person's sense of beauty may not always be another's. On the other
hand, a truly creative idea or proposal is bound to bubble up and find general
support in the numbers. Those that shirk experimentation,
calculation and testing are the prophets of faith that every
organization should strive to avoid hiring into engineering, in
preference to those that support the voice of empirical
reason.
In this article, I want to focus on two specific cool
technologies that frequently go hand-in-hand: Google's Map-reduce
(e.g. Hadoop), and Cloud Computing (e.g. EC2). For whatever
reasons possibly including those I've mentioned above, temptation is high among
many engineers today to use these sledge hammers to crack nuts. As an
illustrative case in point, I'd like to take one of our typical
problems and talk about how we address it
at 33Across. Our customers often
express amazement at how we manage to process their gargantuan data
sets and produce results in short order, seemingly with ease. We are
often asked probing questions that try to get us to divulge our core
data-processing techniques. Needless to say, I won't be describing
any of our proprietary procedures or secret sauces here. But a great
deal can be said without going into such matters. Most of what we
have done to get where we are at is simply to follow published
information you can find in any decent computing journal on occasion,
but most of the time in good Computing-101 texts. Before we dive into
using a framework that has any overhead associated with it, we always
do quick back-of-the-envelope calculations (not unlike Rapleaf's
nice analysis of whether
to host or cloud) to determine if it will really be worth it.