« Fixing Fontconfig on Linux with strace (Unable to load default config file) | Main | Governor Schwarzenegger helps to make Bay Area housing affordable »

On Data Analysis

Forget OLAP cubes and pivot tables! As is the case most of the time, 90% of the insight can be gleaned from very simple data plots. To be sure, the remaining 10% does shed valuable new light, but is also orders of magnitude relatively more tedious to distill.

One of my favorite tools for performing quick sanity checks on data and even for inferring high-level trends is a histogram; it simply partitions your data points into a fixed number of buckets, with each bucket holding points that fall within a given range. The resultant bucket sizes are then available to eye-ball, often plotted as bars whose lengths are proportional to the number of elements in the corresponding buckets.

When your data is generated from Unix/Linux scripts as is typically the case with most LAMP based systems, migrating it into CSV tables to histogram within Excel, or even firing up your free copy of OpenOffice is a cumbersome overhead. A simple script will often suffice to generate compelling histograms.

Here is one such script I had written a long time ago. I have used it so many times that I feel that someone else, somewhere else, is bound to benefit from it too.

As an example of its usage, here is the command line and output for generating a histogram over a set of 10K Gaussian random numbers. (As a by-product, observe the cool trick exploiting the Central Limit Theorem to generate a normally distributed random number using an awk one-liner. Normalized Gaussian randoms are useful to deliberately add controlled noise to a process, for example in selecting the top 3 (with variety) out of a ranked list of 10 ads to show on a publication.

$ gawk 'BEGIN {for(i=0;i<1e4;i++)print rand()+rand()+rand()}' \
   | histo -stars -scale 50 -interval 0.15
# NumSamples = 10000; Max = 2.94838; Min = 0.0360424
# Mean = 1.5030500454; Variance = 0.253799739614845; SD = 0.503785410283828
# Each * represents a count of 50
     0.0360 - 0.1860 [    13]: *
     0.1860 - 0.3360 [    58]: **
     0.3360 - 0.4860 [   125]: ***
     0.4860 - 0.6360 [   250]: *****
     0.6360 - 0.7860 [   374]: ********
     0.7860 - 0.9360 [   552]: ************
     0.9360 - 1.0860 [   758]: ****************
     1.0860 - 1.2360 [   942]: *******************
     1.2360 - 1.3860 [  1026]: *********************
     1.3860 - 1.5360 [  1130]: ***********************
     1.5360 - 1.6860 [  1145]: ***********************
     1.6860 - 1.8360 [  1006]: *********************
     1.8360 - 1.9860 [   854]: ******************
     1.9860 - 2.1360 [   652]: **************
     2.1360 - 2.2860 [   501]: ***********
     2.2860 - 2.4360 [   306]: *******
     2.4360 - 2.5860 [   174]: ****
     2.5860 - 2.7360 [   108]: ***
     2.7360 - 2.8860 [    25]: *
     2.8860 - 3.0360 [     1]: *

Enjoy!

&

TrackBack

TrackBack URL for this entry:
http://www.pandamatak.com/cgi-bin/mt/mt-tb.cgi/41

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on January 8, 2008 7:38 AM.

The previous post in this blog was Fixing Fontconfig on Linux with strace (Unable to load default config file).

The next post in this blog is Governor Schwarzenegger helps to make Bay Area housing affordable.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35