Ad Hoc Data Analysis From The Unix Command Line

      Once upon a time, I was working with a colleague who needed to do some quick data analysis to get a handle on the scope of a problem. He was considering importing the data into a database or writing a program to parse and summarize that data. Either of these options would have taken hours at least, and possibly days. I wrote this on his whiteboard:

      Your friends: cat, find, grep, wc, cut, sort, uniq

      These simple commands can be combined to quickly answer the kinds of questions for which most people would turn to a database, if only the data were already in a database. You can quickly (often in seconds) form and test hypotheses about virtually any record oriented data source.

      Intended audience

      You've logged into a Unix box of some flavor and run some basic commands like ls and cd and cat. If you don't know what the ls command does, you need a more basic introduction to Unix than I'm going to give here.

      Table of Contents

      1. Preliminaries
      2. Standard Input, Standard Output, Redirection and Pipes
      3. Counting Part 1 - grep and wc
      4. Picking The Data Apart With cut
      5. Joining The Data Together With join
      6. Counting Part 2 - sort and uniq
      7. Rewriting The Data With Inline perl
      8. Quick Plotting With gnuplot
      9. Appendices
      Last modified on 5 January 2013, at 23:52