Ad Hoc Data Analysis From The Unix Command Line/Printable version
This is the print version of Ad Hoc Data Analysis From The Unix Command Line You won't see this message or any elements not part of the book's content when you print or preview this page. |
The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_Unix_Command_Line
Preliminaries
Formatting
editThese typesetting conventions will be used when presenting example interactions at the command line:
$ command argument1 argument2 argument3 output line 1 output line 2 output line 3 [...]
The "$ " is the shell prompt. What you type is shown in boldface; command output is in regular type.
Example data
editI will use the following sample files in the examples.
The Unix password file
editThe password file can be found in /etc/passwd. Every user on the system has one line (record) in the file. Each record has seven fields separated by colon (':') characters. The fields are username, encrypted password, userid, default group, gecos, home directory and default shell. We can look at the first few lines with the head command, which prints just the first few lines of a file. Correspondingly, the tail command prints just the last few lines.
$ head -5 /etc/passwd root:x:0:0:root:/:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin adm:x:3:4:adm:/var/adm:/sbin/nologin lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
Census data
editThe US Census releases Public Use Microdata Samples (PUMS) on its website. We will use the 1% sample of Washington state's data, the file pums_53.dat, which can be downloaded here
$ head -2 pums_53.dat H000011715349 53010 99979997 70 15872 639800 120020103814700280300000300409 02040201010103020 0 0 014000000100001000 0100650020 0 0 0 0 0000 0 0 0 0 0 05000000000004400000000010 76703521100000002640000000000 P00001170100001401000010420010110000010147030400100012005003202200000 005301000 000300530 53079 53 7602 76002020202020202200000400000000000000010005 30 53010 70 9997 99970101006100200000001047904431M 701049-20116010 520460000000001800000 00000000000000000000000000000000000000001800000018000208
Important note: The format of this data file is described in an excel spreadsheet that can be downloaded here.
Developer efficiency vs. computer efficiency
editThe techniques discussed here are usually extremely efficient in terms of developer time, but generally less efficient in terms of compute resources (CPU, I/O, memory). This kind of brute force and ignorance may be inelegant, but when you don't yet understand the scope of your problem, it is usually best to spend 30 seconds writing a program that will run for 3 hours than vice versa.
The online manual
editThe "man" command displays information about a given command (colloquially referred to as the command's "man page"). The online man pages are an extremely valuable resource; if you do any serious work with the commands presented here, you'll eventually read all their man pages top to bottom. In Unix literature the man page for a command (or function, or file) is typically referred to as command(n). The number "n" specifies a section of the manual to disambiguate entries which exist in multiple sections. So, passwd(1) is the man page for the passwd command, and passwd(5) is the man page for the passwd file. On a Linux system you ask for a certain section of the manual by giving the section number as the first argument as in "man 5 passwd". Here's what the man command has to say about itself:
$ man man man(1) man(1) NAME man - format and display the on-line manual pages manpath - determine user's search path for man pages SYNOPSIS man [-acdfFhkKtwW] [--path] [-m system] [-p string] [-C config_file] [-M pathlist] [-P pager] [-S section_list] [section] name ... DESCRIPTION man formats and displays the on-line manual pages. If you specify section, man only looks in that section of the manual. name is normally the name of the manual page, which is typically the name of a command, function, or file. [...]
Standard Input, Standard Output, Redirection and Pipes
"This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."—Doug McIlroy, the inventor of Unix pipes
The commands I'm going to talk about here are called filters. Data passes through them and they modify it a bit on the way. These commands read data from their "standard input" and write data to their "standard output." By default, standard input is your keyboard and standard output is your screen. For example, the tr command is a filter that translates one set of characters to another. This invocation of tr turns all lower case characters to upper case characters:
$ tr "[:lower:]" "[:upper:]" hello HELLO i feel like shouting I FEEL LIKE SHOUTING [ctrl-d]
Ctrl-d is how you tell the command from the keyboard that you're done entering input.
You can tell your shell to connect standard output to a file instead of your screen using the ">" operator. The term for this is "redirection". One would talk about "redirecting" tr's output to a file. Later you can use the cat command to write the file to your screen.
$ tr a-z A-Z > tr_output this is a test [ctrl-d] $ cat tr_output THIS IS A TEST
Many Unix commands that take a file as an argument will read from standard input if not given a file. For example, the grep command searches a file for a string and prints the lines that match. If I wanted to find my entry in the password file I might say:
$ grep jrauser /etc/passwd jrauser:x:7777:100:John Rauser:/home/jrauser:/bin/bash
But I could also redirect a file to grep's standard input using "<" operator. You can see that the "<" and ">" operators are like little arrows that indicate the flow of data.
$ grep jrauser < /etc/passwd jrauser:x:7777:100:John Rauser:/home/jrauser:/bin/bash
You can use the pipe "|" operator to connect the standard output of one command to the standard input of the next. The cat command reads a file and writes it to its standard output, so yet another way to find my entry in the password file is:
$ cat /etc/passwd | grep jrauser jrauser:x:7777:100:John Rauser:/home/jrauser:/bin/bash
For a slightly more interesting example, the mail command will send a message it reads from standard input. Let's send my entry in the password file to me in an email.
$ cat /etc/passwd | grep jrauser | mail -s "passwd entry" jrauser@example.com
Using output with headers
editIn many situations, you end up with output that has a first line that is a header describing the data, and subsequent lines that are the data. An example is ps:
$ ps | head -5 PID TTY TIME CMD 22313 ttys000 0:00.86 -bash 31537 ttys001 0:00.06 -bash 22341 ttys002 0:00.28 -bash 70093 ttys002 0:00.00 head -5
If you wish to manipulate the data but not the header use tail with -n switch to start at line 2. For example:
$ ps | tail -n +2 | grep bash | head -5 22313 ttys000 0:00.86 -bash 31537 ttys001 0:00.06 -bash 22341 ttys002 0:00.28 -bash 70120 ttys002 0:00.00 -bash
This output shows only "bash" processes (because of grep)
References
edit
Counting Part 1 - grep and wc
- "90% of data analysis is counting" - John Rauser
...well, at least once you've figured out the right question to ask, which is, perhaps, the other 90%.
Example - Counting the size of a population
editThe simplest command for counting things is wc, which stands for word count. By default, wc prints the number of lines, words, and characters in a file.
$ wc pums_53.dat 85025 1219861 25659175 pums_53.dat
Nearly always we just want to count the number of lines (records), which can be done by giving the -l option to wc.
$ wc -l pums_53.dat 85025 pums_53.dat
Example - Using grep to select a subset
editSo, recalling that this is a 1% sample, there were 8.5 million people in Washington as of the 2000 census? Nope, the census data has two kinds of records, one for households and one for persons. The first character of a record, an H or P, indicates which kind of record it is. We can grep for and count just person records like this:
$ grep -c "^P" pums_53.dat 59150
The caret '^' means that the 'P' must occur at the beginning of the line. So there were about 5.9 million people in Washington State in 2000. Also interesting, the average household had 59,150/(85,025-59,150) = 2.3 people.
Picking The Data Apart With cut
Fixed width data
editHow many households had just 1 person? Referring to the file layout, we see that the 106th and 107th characters of a household record indicate the number of people in the household. We can use the cut command to pull out just that bit of data from each record. The argument -c106-107 instructs cut to print the 106th through 107th characters of each line. The head command prints just the first few lines of a file (or its standard input).
$ census_data>grep "^H" pums_53.dat | cut -c106-107 | head -5 03 02 03 02 02
You can give cut a comma separated list to pull out multiple ranges. To see the household id along with the number of occupants of the household:
$ census_data>grep "^H" pums_53.dat | cut -c2-8,106-107 | head -5 000011703 000024602 000231203 000242102 000250202
The -c argument is used for working with so called "fixed-width" data. Data where the columns of a record are found at certain offset in bytes from the beginning of a record. Fixed width data abounds on a Unix system. ls -l writes its output in a fixed width format:
$ ls -l /etc | head -5 total 6548 -rw-r--r-- 1 root root 46 Dec 4 12:23 adjtime drwxr-xr-x 4 root root 4096 Oct 8 2003 alchemist -rw-r--r-- 1 root root 1048 Aug 31 2001 aliases -rw-r--r-- 1 root root 12288 Oct 8 2003 aliases.db
As does ps:
$ ps -u' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND jrauser 26870 0.0 0.1 2576 1388 pts/0 S 09:45 0:00 /bin/bash jrauser 8943 0.0 0.0 2820 880 pts/0 R 12:58 0:00 ps -u
Returning to the question of how many 1 person households are there in Washington:
$ grep "^H" pums_53.dat | cut -c106-107 | grep -c 01 7192
7,192, or about 28% of households have only one occupant.
Delimited data
editIn delimited data, elements of a record are separated by a special 'delimiter' character. In the password file, fields are delimited by colons:
$ head -5 /etc/passwd root:x:0:0:root:/:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin adm:x:3:4:adm:/var/adm:/sbin/nologin lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
The 7th column of the password file is the user's login shell. How many people use bash as their shell?
$ cut -d: -f7 /etc/passwd | grep -c /bin/bash 170
You can give either -c or -f a comma separated list, so to see a few users that use tcsh as their shell:
$ cut -d: -f1,7 /etc/passwd | grep /bin/tcsh | head -5 iglass:/bin/tcsh svowell:/bin/tcsh dsedaris:/bin/tcsh skine:/bin/tcsh jhitt:/bin/tcsh
Tricky delimiters
editThe space character is a common delimiter. Unfortunately, your shell probably discards all extra whitespace on the command line. You can sneak a space character past your shell by wrapping it in quotes, like: cut -d" " -f 5
The tab character is another common delimiter. It can be hard to spot, because on the screen it just looks like any other white space. The od (octal dump) command can give you insight into the precise formatting of a file. For instance I have a file which maps first names to genders (with 95% probability). When casually inspected, it looks like fixed width data:
$ head -5 gender.txt AARON M ABBEY F ABBIE F ABBY F ABDUL M
But on closer inspection there are tab characters delimiting the columns:
$ od -bc gender.txt | head 0000000 101 101 122 117 116 040 040 040 040 040 040 011 115 012 101 102 A A R O N \t M \n A B 0000020 102 105 131 040 040 040 040 040 040 011 106 012 101 102 102 111 B E Y \t F \n A B B I 0000040 105 040 040 040 040 040 040 011 106 012 101 102 102 131 040 040 E \t F \n A B B Y 0000060 040 040 040 040 040 011 106 012 101 102 104 125 114 040 040 040 \t F \n A B D U L 0000100 040 040 040 011 115 012 101 102 105 040 040 040 040 040 040 040 \t M \n A B E
The first thing to do is read your system's manpage on "cut": it may already delimit by tab by default. If not, it requires a bit of trickery to get a tab character past your shell to the cut command. First, many shells have a feature called tab completion; when you hit tab they don't actually insert a tab, instead they attempt to figure out which file, directory or command you want and type that instead. In many shells you can overcome this special functionality by typing a control-v first. Whatever character you type after the control-v is literally inserted. Like a space character, you need to protect the tab character with quotes or the shell will discard it like any other white space separating pieces of the command line.
So to get the ratio of male first names to female first names I might run the following commands. Between the double quotes I typed control-v and then hit tab.
$ wc -l gender.txt 5017 gender.txt $ cut -d" " -f2 gender.txt | grep M | wc -l 1051 $ cut -d" " -f2 gender.txt | grep F | wc -l 3966
Apparently there's much more variation in female names than male names.
If your system's cut command delimits on tab, the above command becomes simply cut -f2 gender.txt.
Joining The Data with join
Please note - Join assumes that that input data is sorted based on the key on which the join is going to take place.
Delimited data
editIn delimited data, elements of a record are separated by a special 'delimiter' character. In the CSV files, fields are delimited by commas or tabs:
$ cat j1 1,a 1,b 2,c 2,d 2,e 3,f 3,g 4,h 4,i 5,j
$ cat j2 1,A 1,B 1,C 2,D 2,E 4,F 4,G 5,H 6,I 6,J
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2 1,a,A 1,a,B 1,a,C 1,b,A 1,b,B 1,b,C 2,c,D 2,c,E 2,d,D 2,d,E 2,e,D 2,e,E 3,f, 3,g, 4,h,F 4,h,G 4,i,F 4,i,G 5,j,H 6,,I 6,,J
Explanation of options:
"-t ," Input and output field separator is "," (for CSV) "-a 1" Output a line for every line of j1 not matched in j2 "-a 2" Output a line for every line of j2 not matched in j1 "-o 0,1.2,2.2" Output field format specification:
0 denotes the match (join) field (needed when using "-a") 1.2 denotes field 2 from file 1 ("j1") 2.2 denotes field 2 from file 2 ("j2").
Using the "-a" option creates a full outer join as in SQL.
This command must be given two and only two input files.
Multi-file Joins
editTo join several files you can loop through them.
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2 > J
File "J" is now the full outer join of "j1", "j2".
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 J j3 > J
and so on through j4, j5, ...
For many files this is best done with a loop
$ for i in * ; do join -t , -a 1 -a 2 -o 0,1.2,2.2 J $i > J ; done
Sorted Data Note
editjoin assumes that the input data has been sorted by the field to be joined. See section on sort for details. • Counting Part 2 - sort and uniq
Credits: Some text adapted from Ted Harding's email to the R mailing list.
Counting Part 2 - sort and uniq
So far we've seen how to use cut, grep and wc to select and count records with certain qualities. But each set of records we'd like to count requires a separate command, as with counting the numbers of male and female names in the most recent example. Combining the uniq and sort commands allows us to count many groups at once.
uniq and sort
editThe uniq command squashes out contiguous duplicate lines. That is, it copies from its standard input to its standard output, but if a line is identical to the immediately preceding line, the duplicate line is not written. For example:
$ cat foo a a a b b a a a c $ uniq foo a b a c
Note that 'a' is written twice because uniq compares only to the immediately preceding line. If the data is sorted first, we get each distinct record just once:
$ sort foo | uniq a b c
Finally, giving the -c option causes uniq to write counts associated with each distinct entry:
$ sort foo | uniq -c 6 a 2 b 1 c
Sorting a CSV file by an arbitrary column is easy as well:
$ cat file.csv a, 10, 0.5 b, 20, 0.1 c, 14, 0.01 d, 55, 0.23 e, 94, 0.78 f, 1, 0.34 g, 75, 1.0 h, 3, 2.0 i, 12, 1.5 $ sort -n -t"," -k 2 file.csv f, 1, 0.34 h, 3, 2.0 a, 10, 0.5 i, 12, 1.5 c, 14, 0.01 b, 20, 0.1 d, 55, 0.23 g, 75, 1.0 e, 94, 0.78 $ sort -n -t"," -k 3 file.csv c, 14, 0.01 b, 20, 0.1 d, 55, 0.23 f, 1, 0.34 a, 10, 0.5 e, 94, 0.78 g, 75, 1.0 i, 12, 1.5 h, 3, 2.0
Example - Creating a frequency table
editThe combination of sort and uniq -c is extremely powerful. It allows one to create frequency tables from virtually any record oriented text data. Returning to the name to gender mapping of the previous chapter, we could have gotten the count of male and female names in one command like this:
$ cut -d" " -f2 gender.txt | sort | uniq -c 3966 F 1051 M
Example - Creating another frequency table
editAnd returning to the census data, we can now easily compute the complete distribution of occupants per household:
$ grep "^H" pums_53.dat | cut -c106-107 | sort | uniq -c 1796 00 7192 01 7890 02 3551 03 3195 04 1391 05 518 06 190 07 79 08 39 09 14 10 14 11 3 12 3 13
Example - Verifying a primary key
editThis is a good opportunity to point out a big benefit of being able to play with data in this fashion. It allows you to quickly spot potential problems in a dataset. In the above example, why are there 1,796 households with 0 occupants? As another example of quickly verifying the integrity of data, let's make sure that household id is truly a unique identifier:
$ grep "^H" pums_53.dat | cut -c2-8 | sort | uniq -c | grep -v "^ *1 " | wc -l 0
This grep invocation will print only lines that do not (because of the -v flag) begin with a series of spaces followed by a 1 (the count from uniq -c) followed by a tab (entered using the control-v trick). Since the number of lines written is zero, we know that each household id occurs once and only once in the file.
The technique of grepping uniq's output for lines with a certain count is generally useful. One other common application is finding the set of overlapping (duplicated) keys in a pair of files by grepping the output of uniq -c for lines that begin with a 2.
Example - A frequency table sorted by most common category
editThrowing an extra sort on the end of the pipeline will sort the frequency table so that the most common class is at the top (or bottom). This is useful when data is categorical and does not have a natural order. You'll want to give sort the -n option so that it sorts the counts numerically instead of lexically, and I like to give the -r option to reverse the sort so that the output is sorted in descending order, but this just a stylistic issue. For example, here is the distribution of household heating fuel from most common to least common:
$ grep "^H" pums_53.dat | cut -c132 | sort | uniq -c | sort -rn 12074 3 7007 1 3161 1372 6 1281 4 757 2 170 8 43 9 6 5 4 7
Type 3, electricity, is most common, followed by type 1, gas. Type 7 is solar power.
Converting the frequency table to proper CSV
editThe output of uniq -c is not in proper CSV form. This makes is necessary to convert the output if further operations on the output are wanted. Here we use a bit of inline perl to rewrite the lines and reverse the order of the fields.
$ cut -d" " -f2 gender.txt | sort | uniq -c | perl -pe 's/^\s*([0-9]+) (\S+).*/$2, $1/' F, 3966 M, 1051
Rewriting The Data With Inline perl
“ | I'm reminded of the day my daughter came in, looked over my shoulder at some Perl 4 code, and said, 'What is that, swearing?' | ” |
—Larry Wall |
Command Line perl
editA tutorial on perl is beyond the scope of this document; if you don't know perl, you should learn at least a little bit. If you invoke perl like perl -n -e '#a perl statement' the -n option causes perl to wrap your -e argument in a implicit while loop like this:
while (<>) { # a perl statement }
This loop reads standard input a line at a time into the variable $_, and then executes the statement(s) give by the -e argument. Given -p instead of -n, perl to adds a print statement to the loop as well:
while (<>) { # a perl statement print $_; }
Example - Using perl to create an indicator variable
editEducation level is recorded in columns 53-54 as ordered set of categories, where 11 and above indicates a college degree. Let's condense this to a single indicator variable for completed college or not. The raw data:
$ cat pums_53.dat | grep "^P" | cut -c53-54 | head -5 12 11 06 03 08
And once passed through the perl script:
$ cat pums_53.dat | grep "^P" | cut -c53-54 | perl -ne 'print $_>=11?1:0,"\n"' | head -5 1 1 0 0 0
And the final result:
~/census_data>cat pums_53.dat | grep "^P" | cut -c53-54 | perl -ne 'print $_>=11?1:0,"\n"' | sort | uniq -c 37507 0 21643 1
About 36% of Washingtonians have a college degree.
Example - computing conditional probability of membership in two sets
editLet's look at the relationship between education level and whether or not people ride their bikes to work. People's mode of transportation to work is encoded as a series of categories in columns 191-192, where category 9 indicates a bicycle. We'll use an inline perl script to rewrite both education level and mode of transportation:
$ cat pums_53.dat | grep "^P" | cut -c53-54,191-192 | perl -ne 'print substr($_,0,2)>=11?1:0,substr($_,2,2)==9?1:0,"\n";' | sort | uniq -c 37452 00 55 01 21532 10 111 11
55/(55+36447) = 0.15% of non college educated people ride their bike to work. 111/(111+20219) = 0.56% of college educated people ride their bike to work.
Sociological interpretation is left as an exercise for the reader.
Example - A histogram with custom bucket size
editSuppose we wanted to take a look at distribution of personal incomes. The normal trick of sort and uniq would work, but the personal income in the census data has resolution down to the $10 level, so the output would be very long and it would be hard to quickly see the pattern. We can use perl to round the income data down to the nearest $10,000 on the fly. Before the inline perl script:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | head -4 0018000 0004100 0004300 0005300
And after:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | perl -pe '$_=10000*int($_/10000)."\n"' | head -4 10000 0 0 0
And finally, the distribution (up to $100,000). The extra grep [0-9] ensures that blank records are not considered in the distribution.
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | perl -pe '$_=10000*int($_/10000)."\n"' | sort -n | uniq -c | head -12 20 -10000 15193 0 8038 10000 6776 20000 5436 30000 3685 40000 2370 50000 1536 60000 899 70000 521 80000 326 90000 283 100000
Example - Finding the median (or any percentile) of a distribution
editIf we sort all the incomes in order and had a way to pluck out the middle number, we could easily get the median. I'll give two ways to do this. The first uses cat -n. If given the -n option, cat prepends line numbers to each line. We see that there are 46,359 non blank records, so the 23179th one in sorted order is the median.
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | wc -l 46359 $ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort | cat -n | grep "^ *23179" 23179 0019900
An even simpler method, using head and tail:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort | head -23179| tail -1 0019900
The median income in Washington state in 2000 was $19,900.
Example - Finding the average of a distribution
editWhat about the average? One way to compute the average is to accumulate a running sum with perl, and do the division by hand at the end:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | perl -ne 'print $sum+=$_,"\n";' | cat -n | tail -1 46359 1314603988
$1314603988/ 46359 = $28357.0393666818
You could also get perl to do this division with an END block which perl will execute only after it has exhausted standard input:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | perl -ne '$sum += $_; $count++; END {print $sum/$count,"\n";}' 28357.0393666818
Quick Plotting With gnuplot
Example - creating a scatter plot
editDoes the early bird get the worm? Let's look at the relationship between the time a person leaves for work and their income. Income is recorded in columns 297-303, and the time a person leaves for work is recorded in columns 196-198, encoded in ten minute intervals. This pipeline extracts, cleans and formats the data:
$ cat pums_53.dat | grep "^P" | cut -c196-198,297-303 | grep -v "^000" | grep -v " $" | perl -pe 'substr($_,3,0)=" ";' > time_vs_income
The greps knock out records for which either field is null, and the perl script inserts a space between the two columns so gnuplot can parse the columns apart. Plotting in gnuplot is simple:
$ gnuplot G N U P L O T Linux version 3.7 patchlevel 1 last modified Fri Oct 22 18:00:00 BST 1999 Terminal type set to 'x11' gnuplot> plot 'time_vs_income' with points
And the resulting plot:
Recall that 0 on the x-axis is midnight, and 20 is 200 minutes after midnight or about 3:20am. Increased density in the beginning of the traditional 1st and 2nd shift periods is apparent. Folks who work regular business hours clearly have higher incomes. It would be interesting to compute the average income in each time bucket, but that makes a pretty hairy command line perl script. Here is it in all its gruesome glory:
$ cat pums_53.dat | grep "^P" | cut -c196-198,297-303 | grep -v "^000" | grep -v " $" | perl -ne '/(\d{3})(\d{7})/; $sum{$1}+=$2; $count{$1}++; END { foreach $k (keys(%count)) {print $k," ",$sum{$k}/$count{$k},"\n"}}' | sort -n > time_vs_avgincome
You can plot the result for yourself if you're curious.
Example - Creating a bar chart with gnuplot
editLet's look at historic immigration rates among Washingtonians. Year of immigration is recorded in columns 78-81, and 0000 means the person is a native born citizen. We can apply the usual tricks with cut, grep, sort, and uniq, but it's a bit hard to see the patterns when scrolling back and forth in text output, it would nicer if we could see a plot.
$ cat pums_53.dat | grep "^P" | cut -c78-81 | grep -v 0000 | sort | uniq -c | head -10 2 1910 7 1914 12 1919 7 1920 6 1921 5 1922 7 1923 5 1924 8 1925
Gnuplot is a fine graphing tool for this purpose, but it wants the category label to come first, and the count to come second, so we need to write a perl script to reverse uniq's output and stick the result in a file. See perlrun(1) for details on the -a and -F options to perl.
$ cat pums_53.dat | grep "^P" | cut -c78-81 | grep -v 0000 | sort | uniq -c | perl -lape 'chomp $F[-1]; $_ = join " ", reverse @F' > year_of_immigration
Now we can make a bar chart from the contents of the file with gnuplot.
gnuplot> plot 'year_of_immigration' with impulses
Here's the graph gnuplot creates:
Be a bit careful interpreting this plot, only people who are still alive can be counted, so it naturally goes up and to the right (people who immigrated more recently have a better chance of still being alive). That said, there seems to have been an increase in immigration after the end of World War II, and also a spike after the end if the Vietnam war. I remain at a loss to explain the spike around 1980, consult your local historian.
External links
editset term win; set grid;A1=-300;A2=250;n=360;z=2500;splot(x**2)/(A1**2) + (y**2)/(A2**2) - ((2*x*y)/(A1*A2))*cos(n-z) - (sin(n-z))**2
Appendices
Appendix A: pcalc source code
editA perl read-eval-print loop. This makes a very handy calculator on the command line. Example usage:
$ pcalc 1+2 3 $ pcalc "2*2" 4 $ pcalc 2*3 6
Source:
#!/opt/third-party/bin/perl use strict; if ($#ARGV >= 0) { eval_print(join(" ",@ARGV)) } else { use Term::ReadLine; my $term = new Term::ReadLine 'pcalc'; while ( defined ($_ = $term->readline("")) ) { s/[\r\n]//g; eval_print($_); $term->addhistory($_) if /\S/; } } sub eval_print { my ($str) = @_; my $result = eval $str; if (!defined($result)) { print "Error evaluating '$str'\n"; } else { print $result,"\n"; } }
Appendix B: Random unfinished ideas
editIdeas too good to delete, but that aren't fleshed out.
Micro shell scripts from the command line
editExample - which .so has the object I want?
Using backticks
editExample - killing processes by name
editkill `ps auxww | grep httpd | grep -v grep | awk '{print $2}'`
Example - tailing the most recent log file in one easy step
edittail -f `ls -rt *log | tail -1`
James' xargs trick
editJames uses echo with xargs and feeds one xargs' output into another xargs in clever ways to build up complex command lines.
tee(1)
editperl + $/ == agrep
editExample - Finding duplicate keys in two files
edit