« How to browse the Internet perfectly anonymously | Main | The kiss you can remember »

Calculating the median of a MySql table column

My staff are amazingly efficient at automating the various slices and dices of data for me. They like to do it in SQL scripts, MySql in specific, and as a consequence, the data I want to look at is automatically generated every night into various MySql tables. I, in turn, have my own scripts, and ye old console, to extract the information I want. However, one issue that bugged me early on was getting at the median of a set of numbers.

When analyzing data that has been collected for you, it is often useful to look at the median rather than the mean whenever you suspect the presence of outliers. As an example, a single McMansion valued at $12M can easily throw the mean home price significantly off in a small neighborhood. But the median is unlikely to change much. Simply put, the median is the middle number, or the average of the middle two numbers in an ordered list of numbers. How does one get it in MySql?

Looking for a simple answer to this turned out to be an interesting exercise; so interesting, in fact, that I had many times asked this question of potential hires (If you're a potential hire, note that I don't ask this question any more). More often than not, if a candidate doesn't know of a canonical answer, their approach to finding a solution tells me far more than just how to get the median in MySql.

On a Linux command line, you want something like:

  echo "SELECT MEDIAN(foo) FROM bar" | mysql -p mydb
analogous to the

  echo "SELECT AVG(foo) FROM bar" | mysql -p mydb
But alas, such a function is conspicuously missing in MySql. Part of the reason is probably to protect from gross inefficiencies injected by naive DBAs and programmers. If the column of interest is not indexed, then a median computation necessarily involves a sorting (which is log-linear in complexity). It is not uncommon for uninformed programmers to make unjustifiably liberal use of seemingly simple statements that hide an enormous amount of complexity. (For this reason, powerful high-level programming languages and tools that provide simple abstractions for complex algorithms should only ever be used by well-educated programmers who know what they are doing and what it will cost). But returning to the issue at hand, and knowing now that MySql doesn't provide a native function call for it, how does one calculate the median? There are several ways, and many good ones are described in the MySql Forum. But none is the pretty one-liner you want.

The answer to this question is what potentially distinguishes a good engineer or scientist from a good manager. A good manager is focused on getting the job done, and does not care too much about exactly how it is done as long as she is confident it's being done efficiently. A good scientist or engineer is naturally curious, and wants to solve the problem a particular way just to know whether it can be done or not. The good engineer will most likely come up with one of the many excellent answers on the MySql forum. A good manager will likely say, "Is there a reason it has to be done in MySql? 'Cos I don't know the answer off the top of my head, but I'm sure I can find it soon enough if I browse a few minutes. If you just want the median efficiently, and don't care how I get it, I'd probably just use some combination of MySql and Perl/php/whatever:

  echo SELECT foo FROM bar ORDER BY foo \
    | mysql -N -p mydb \
    | perl -e 'use POSIX qw(floor ceil);  
                @f = <>; 
                ($a,$b) = (floor(@f/2-0.5), ceil(@f/2-0.5));
                $median = ($f[$a]+$f[$b])/2; 
                print "$median\n";
      '
 
Here is the line-by-line breakdown:

  1. SELECT: The column must be ordered (doesn't matter which way). If the column is not indexed, it can be sorted externally using the Linux sort program with the same or similar complexity. Note the -N option to mysql which tells it to not print the column header.
  2. Perl: Perl and its POSIX floor() and ceil() functions calculate the middle two rows in such a way that they are the same for a table with an odd-number of rows. If the table has 5 rows for example, then the first number, obtaining by flooring 2.0 is 2 and the second number, obtained by lifting 2.0 is also 2, which is the middle row in a 5 row Perl array with rows (0,1,2,3,4). If the table had 4 rows, then the first number will be floor(1.5) = 1 and the second number will be ceil(1.5) which is 2. In either case, the average of the numbers in these two rows gives the correct median.
If you simply want to use the perl script as a unix command called median, you can grab the program from here for presorted input or here for unsorted input. With these tucked away in your personal bin directory, you can simply issue:

  echo "SELECT foo FROM bar order by foo" | mysql -p mydb -N | median
or

  echo "SELECT foo FROM bar" | mysql -p mydb -N | median-unsorted
If inside a Perl or Mod_perl script using Perl DBI, one could use the following fragment:

  use DBI;
  use POSIX qw(floor ceil);
  ...
  my $sth = dbh->prepare('SELECT foo FROM bar ORDER BY foo");
  $sth->execute();
  my @f = $sth->fetchrow_array();
  my($a,$b) = (floor(@f/2-0.5), ceil(@f/2-0.5));
  my $median = ($f[$a]+$f[$b])/2;
  ...
Of course, if the MySql table is indexed, one may also get at the median in a two step procedure completely inside of a MySql script, if that's what is needed - first selecting the total number of rows with "COUNT(*)" and then extracting from the ordered column the average from the middle or two middle rows since we know which these are, for example:

   SELECT AVG(foo) FROM (SELECT foo FROM bar ORDER BY foo limit m,n) as x
where you know m to identify the row number of the first of the 1 or 2 middle rows and n to be either 1 or 2 depending upon whether the total number of rows is odd or even respectively.

Finally, lest you worry about the memory requirements imposed by sucking in an entire column into a Perl or other such array let me reassure you that (a) Perl's implementation of in-memory arrays is very efficient, (b) you can easily read in arrays with millions of rows without taking up more than a few tens of megabytes of memory and (c) if, in fact, your table had many billions of rows, you should be able to get a very good approximation to your true median by sampling every nth row (Why?). This last point is especially relevant to scalable data analysis. You can calculate your answer as a median of medians by farming out computations using such paradigms as MapReduce.

TrackBack

TrackBack URL for this entry:
http://www.pandamatak.com/cgi-bin/mt/mt-tb.cgi/50

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on February 22, 2008 6:52 AM.

The previous post in this blog was How to browse the Internet perfectly anonymously.

The next post in this blog is The kiss you can remember.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35