Ad Hoc Data Analysis From The Unix Command Line/Rewriting The Data With Inline perl

I'm reminded of the day my daughter came in, looked over my shoulder at some Perl 4 code, and said, 'What is that, swearing?'

—Larry Wall

Command Line perl

edit

A tutorial on perl is beyond the scope of this document; if you don't know perl, you should learn at least a little bit. If you invoke perl like perl -n -e '#a perl statement' the -n option causes perl to wrap your -e argument in a implicit while loop like this:

while (<>) {
   # a perl statement
}

This loop reads standard input a line at a time into the variable $_, and then executes the statement(s) give by the -e argument. Given -p instead of -n, perl to adds a print statement to the loop as well:

while (<>) {
   # a perl statement
   print $_;
}

Example - Using perl to create an indicator variable

edit

Education level is recorded in columns 53-54 as ordered set of categories, where 11 and above indicates a college degree. Let's condense this to a single indicator variable for completed college or not. The raw data:

$ cat pums_53.dat | grep "^P" | cut -c53-54 | head -5
12
11
06
03
08

And once passed through the perl script:

$ cat pums_53.dat | grep "^P" | cut -c53-54 | 
perl -ne 'print $_>=11?1:0,"\n"' | head -5
1
1
0
0
0

And the final result:

~/census_data>cat pums_53.dat | grep "^P" | cut -c53-54 |
perl -ne 'print $_>=11?1:0,"\n"' | sort | uniq -c
37507 0
21643 1

About 36% of Washingtonians have a college degree.

Example - computing conditional probability of membership in two sets

edit

Let's look at the relationship between education level and whether or not people ride their bikes to work. People's mode of transportation to work is encoded as a series of categories in columns 191-192, where category 9 indicates a bicycle. We'll use an inline perl script to rewrite both education level and mode of transportation:

$ cat pums_53.dat | grep "^P" | cut -c53-54,191-192 | 
perl -ne 'print substr($_,0,2)>=11?1:0,substr($_,2,2)==9?1:0,"\n";' | 
sort | uniq -c 
37452 00
   55 01
21532 10
  111 11

55/(55+36447) = 0.15% of non college educated people ride their bike to work. 111/(111+20219) = 0.56% of college educated people ride their bike to work.

Sociological interpretation is left as an exercise for the reader.

Example - A histogram with custom bucket size

edit

Suppose we wanted to take a look at distribution of personal incomes. The normal trick of sort and uniq would work, but the personal income in the census data has resolution down to the $10 level, so the output would be very long and it would be hard to quickly see the pattern. We can use perl to round the income data down to the nearest $10,000 on the fly. Before the inline perl script:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | head -4
0018000
0004100
0004300
0005300

And after:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | 
perl -pe '$_=10000*int($_/10000)."\n"' | head -4
10000
0
0
0

And finally, the distribution (up to $100,000). The extra grep [0-9] ensures that blank records are not considered in the distribution.

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] |
perl -pe '$_=10000*int($_/10000)."\n"' | sort -n | uniq -c | head -12
   20 -10000
15193      0
 8038  10000
 6776  20000
 5436  30000
 3685  40000
 2370  50000
 1536  60000
  899  70000
  521  80000
  326  90000
  283 100000

Example - Finding the median (or any percentile) of a distribution

edit

If we sort all the incomes in order and had a way to pluck out the middle number, we could easily get the median. I'll give two ways to do this. The first uses cat -n. If given the -n option, cat prepends line numbers to each line. We see that there are 46,359 non blank records, so the 23179th one in sorted order is the median.

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | wc -l
46359 
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort | 
cat -n | grep "^ *23179"
23179 0019900

An even simpler method, using head and tail:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort |
head -23179| tail -1
0019900

The median income in Washington state in 2000 was $19,900.

Example - Finding the average of a distribution

edit

What about the average? One way to compute the average is to accumulate a running sum with perl, and do the division by hand at the end:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | 
perl -ne 'print  $sum+=$_,"\n";' | cat -n | tail -1
46359 1314603988

$1314603988/ 46359 = $28357.0393666818

You could also get perl to do this division with an END block which perl will execute only after it has exhausted standard input:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | 
perl -ne '$sum += $_; $count++; END {print $sum/$count,"\n";}' 
28357.0393666818

Counting Part 2 - sort and uniq · Quick Plotting With gnuplot