Can you process the entire Facebook network from a PC in your garage?Over a decade ago, before I started my graduate work, my then-to-be advisor gave me a little book and asked that I don't return to him until after I've read it cover-to-cover. This book, How not to get a PhD (apparently re-published on a more positive note since 2002), was an invaluable source of wisdom and advice to me not only before I started my grad work, but even during it. It enumerated all the wrong reasons why one would want to do a PhD. Now, a decade since my degree was granted, rolled up and squirilled away, I still find the wisdom in that little book relevant. Not so much in making decisions about matters to do with formal education, but in real life.
In today's agile world, software projects typically demand strategic choices over a plethora of possible solutions to most problems. Some solutions are appropriate for some problems and some aren't. It's important to choose a solution for all the right reasons, and it's doubly important to not choose it for all the wrong reasons. Human factors complicate this issue because inappropriate solutions to some problems are appropriate and good solutions to other problems. And these "other" problems are sometimes faced by reputed technology companies, who lend "celebrity" status to the solutions. This leads the young engineer astray, and coerces them into thinking that just because Google or Microsoft have found great success with a particular paradigm, they ought to as well. Further, within most companies, there are bound to be a number of non-engineering execs whose bandwidth is almost completely soaked up by non-engineering matters. It is consequently easy to get their buy-in for adopting such technologies using that unfortunately untrue magic phrase, an egregiously extant engineering enchantment and the mother of all modus ponens if I can call it that: If it is good enough for Google, it ought to be good enough for us". It behooves every company to have at least one theoretically and practically savvy engineer to protect precisely against this kind of thing happening. And to support decisions that are data-driven, rather than driven by subjective matters such as coolness and esthetics. At worst, data-driven decisions are at least likely to prevent general resentment among staff - one person's sense of beauty may not always be another's. On the other hand, a truly creative idea or proposal is bound to bubble up and find general support in the numbers. Those that shirk experimentation, calculation and testing are the prophets of faith that every organization should strive to avoid hiring into engineering, in preference to those that support the voice of empirical reason.
In this article, I want to focus on two specific cool technologies that frequently go hand-in-hand: Google's Map-reduce (e.g. Hadoop), and Cloud Computing (e.g. EC2). For whatever reasons possibly including those I've mentioned above, temptation is high among many engineers today to use these sledge hammers to crack nuts. As an illustrative case in point, I'd like to take one of our typical problems and talk about how we address it at 33Across. Our customers often express amazement at how we manage to process their gargantuan data sets and produce results in short order, seemingly with ease. We are often asked probing questions that try to get us to divulge our core data-processing techniques. Needless to say, I won't be describing any of our proprietary procedures or secret sauces here. But a great deal can be said without going into such matters. Most of what we have done to get where we are at is simply to follow published information you can find in any decent computing journal on occasion, but most of the time in good Computing-101 texts. Before we dive into using a framework that has any overhead associated with it, we always do quick back-of-the-envelope calculations (not unlike Rapleaf's nice analysis of whether to host or cloud) to determine if it will really be worth it.
What does it take to store the entire useful Facebook graph in memory on a single machine? And what would the specifications of such a machine be? Obviously we aren't talking super-computers here, but something you can buy off the shelf from Dell, or your local Fry's. Here's how we do our back-of-the-envelope calculations - Say that you want to store the graph as a vector of nodes, where the i'th element of your vector is actually node i. Having the luxury of being able to scan and compute on edges sequentially in our algorithms means that we can simply store the edge-lists for our nodes as singly linked lists of destination nodes. Actually we use a map rather than a singly linked list, but to keep the argument simple, assume this data structure to be a list for now. Also ignore, for now, the quantities we compute on the edges and the nodes themselves. Consider only the graph, which consists of nodes, edges and the normalized weight of each edge. How much space does it take to store the Facebook graph in RAM? Well, 200 million is an order of magnitude less than 4 billion, which is approximately the largest value storable in a long (four-byte) integer data type (32 bits means 232 bit patterns), so we can comfortably assume that our graph nodes can be indexed by an unsigned long integer. This means that each element of our singly linked list, which ought to contain a destination node, a weight on the edge, and a pointer to its sibling will occupy exactly 16 bytes (pointers on a 64-bit Dell PowerEdge are 8 bytes in length, and we use a single floating point number (4 bytes) to store an edge's weight. We're talking about 200M × 20 edges = 4B edges in all. At 16 bytes per edge, we're looking at approximately 64 billion bytes (that's 64 GB of RAM). Even if you add in the overhead of a sentinel node to each of our linked lists, and the constant cost of storing 200M start pointers in a vector, you're still looking at a maximum of under 70G to store the entire graph in memory.
Now, I want to point out that today, computers with this much main memory are not only available, they're even relatively inexpensive. While companies like Virident pride themselves in building eco-friendly green machines with gigantic amounts (over 256GB per blade) of solid-state RAM having comparable read-latencies as DIMMs, for only slightly more modest tasks than processing Facebook's entire graph, one can even get away with far cheaper solutions. A dual-Xeon Quad-core Dell PowerEdge 1950 III, which comes in a slim 1U form factor, complete with cable management cages and rack-mount rails can be purchased for under about $3K, especially if you're on good terms with the company rep :-). Quoting from Dell's website, a "2-processor Quad Core Intel® Xeon® X5470, 2X6MB Cache, 3.33GHz, 1333MHz FSB" PowerEdge 1950 III (1U form factor) costs about $4K, and adding in 32G of RAM from Kingston still prices the machine at just slightly over $5K. There you have it - 8 computing cores, and 32G of RAM - allowing for the OS and other house-keeping programs, that's still enough juice to load a graph with almost two billion edges (just under half the size of Facebook). Finally, if you really need it, you can, for not a whole lot more money, actually buy 2U servers with up to 192G of RAM and 16 cores, allowing you to store the entire Facebook graph and still leave much room to spare.
Now, it's all well and good that you can store such graph in memory on a single machine in your garage, but how efficiently can you process it? Evidently very efficiently. Let's take edge weight normalization as an example. I pick this because it's an operation that needs to touch every edge on the graph, and touch each edge twice. It is not an operation that can be optimized much by most compilers. While you can calculate the number of nanoseconds it will take to perform each touch and go from published FLOPS specs on the hardware, we resort to extrapolating results from simple empirical tests. On the Dell we cited above, it takes about 25 nanoseconds for a single touch and go (accumulating the edge weight into a normalizer). A second touch and go will be required to divide by this normalizer, and so we're talking approximately 50 nanoseconds per edge. At this rate, normalizing a graph of 1B edges ought to take 50 × 10-9 × 1B = 50 seconds (less than a minute). Normalization can be efficiently parallelized over the nodes in the graph, and so if you run 8 concurrent threads, you're only looking at slightly over 6 seconds to normalize the entire graph. And that's less time than spent waiting for some social network pages to load!
Lest this article be misinterpreted as an advocacy against using the cloud or map-reduce, let me clarify by explicitly stating that's not the case. As Don Knuth observes piercingly, premature optimization is the root of all evil. Map-reduce is an optimization technique that buys efficiency using a clever divide-and-conquer approach. But to know what to divide, and how, and whether at all, one needs to first measure, estimate and calculate using tried and trusted techniques. Most of the time, a bigger home is not the right answer to clutter in a small one. An engineer who cannot write efficient code on a single computer most likely won't be able to write efficient code in the cloud. Only, it will be much harder to tell. And so in summary, let me leave you with these parting thoughts - One must first learn to walk properly on the ground before trying to fly in the clouds.