Looking for a simple answer to this turned out to be an interesting exercise; so interesting, in fact, that I had many times asked this question of potential hires (If you're a potential hire, note that I don't ask this question any more). More often than not, if a candidate doesn't know of a canonical answer, their approach to finding a solution tells me far more than just how to get the median in MySql.
On a Linux command line, you want something like:
echo "SELECT MEDIAN(foo) FROM bar" | mysql -p mydb
echo "SELECT AVG(foo) FROM bar" | mysql -p mydb
The answer to this question is what potentially distinguishes a good engineer or scientist from a good manager. A good manager is focused on getting the job done, and does not care too much about exactly how it is done as long as she is confident it's being done efficiently. A good scientist or engineer is naturally curious, and wants to solve the problem a particular way just to know whether it can be done or not. The good engineer will most likely come up with one of the many excellent answers on the MySql forum. A good manager will likely say, "Is there a reason it has to be done in MySql? 'Cos I don't know the answer off the top of my head, but I'm sure I can find it soon enough if I browse a few minutes. If you just want the median efficiently, and don't care how I get it, I'd probably just use some combination of MySql and Perl/php/whatever:
echo SELECT foo FROM bar ORDER BY foo \
| mysql -N -p mydb \
| perl -e 'use POSIX qw(floor ceil);
@f = <>;
($a,$b) = (floor(@f/2-0.5), ceil(@f/2-0.5));
$median = ($f[$a]+$f[$b])/2;
print "$median\n";
'
- SELECT: The column must be ordered (doesn't matter which way). If the column is not indexed, it can be sorted externally using the Linux sort program with the same or similar complexity. Note the -N option to mysql which tells it to not print the column header.
- Perl: Perl and its POSIX floor() and ceil() functions calculate the middle two rows in such a way that they are the same for a table with an odd-number of rows. If the table has 5 rows for example, then the first number, obtaining by flooring 2.0 is 2 and the second number, obtained by lifting 2.0 is also 2, which is the middle row in a 5 row Perl array with rows (0,1,2,3,4). If the table had 4 rows, then the first number will be floor(1.5) = 1 and the second number will be ceil(1.5) which is 2. In either case, the average of the numbers in these two rows gives the correct median.
echo "SELECT foo FROM bar order by foo" | mysql -p mydb -N | median
echo "SELECT foo FROM bar" | mysql -p mydb -N | median-unsorted
use DBI;
use POSIX qw(floor ceil);
...
my $sth = dbh->prepare('SELECT foo FROM bar ORDER BY foo");
$sth->execute();
my @f = $sth->fetchrow_array();
my($a,$b) = (floor(@f/2-0.5), ceil(@f/2-0.5));
my $median = ($f[$a]+$f[$b])/2;
...
SELECT AVG(foo) FROM (SELECT foo FROM bar ORDER BY foo limit m,n) as x
Finally, lest you worry about the memory requirements imposed by sucking in an entire column into a Perl or other such array let me reassure you that (a) Perl's implementation of in-memory arrays is very efficient, (b) you can easily read in arrays with millions of rows without taking up more than a few tens of megabytes of memory and (c) if, in fact, your table had many billions of rows, you should be able to get a very good approximation to your true median by sampling every nth row (Why?). This last point is especially relevant to scalable data analysis. You can calculate your answer as a median of medians by farming out computations using such paradigms as MapReduce.