An Awk Primer/Awk Program Files

An Awk Primer

Sometimes an Awk program needs to be used repeatedly. In that case, it's simple to execute the Awk program from a shell script. For example, consider an Awk script to print each word in a file on a separate line. This could be done with a script named "words" containing:

   awk '{c=split($0, s); for(n=1; n<=c; ++n) print s[n] }' $1

"Words" could then be made executable (using "chmod +x words") and the resulting shell "program" invoked just like any other command. For example,

"words" could be invoked from the "vi" text editor as follows:

   :%!words

This would turn all the text into a list of single words.

For another example, consider the double-spacing program mentioned previously. This could be slightly changed to accept standard input, using a "-" as described earlier, then copied into a file named "double":

   awk '{print; if (NF != 0) print ""}' -

—and then could be invoked from "vi" to double-space all the text in the editor.

The next step would be to also allow "double" to perform the reverse operation: To take a double-spaced file and return it to single-spaced, using the option:

   undouble

The first part of the task is, of course, to design a way of stripping out the extra blank lines, without destroying the spacing of the original single-spaced file by taking out all the blank lines. The simplest approach would be to delete every other blank line in a continuous block of such blank lines. This won't necessarily preserve the original spacing, but it will preserve spacing in some form.

The method for achieving this is also simple, and involves using a variable named "skip". This variable is set to "1" every time a blank line is skipped, to tell the Awk program NOT to skip the next one. The scheme is as follows:

   BEGIN {set skip to 0}
   scan the input:
      if skip == 0    if line is blank
                         skip = 1
                      else
                         print the line
                      get next line of input
      if skip == 1    print the line
                      skip = 0
                      get next line of input

This translates directly into the following Awk program:

   BEGIN      {skip = 0}
   skip == 0  {if (NF == 0) 
                {skip = 1} 
               else 
                {print}; 
               next}
   skip == 1  {print; 
               skip = 0;
               next}

This program could be placed in a separate file, named, say, "undouble.awk", with the shell script "undouble" written as:

   awk -f undouble.awk

It could also be embedded directly in the shell script, using single-quotes to enclose the program and backslashes ("\") to allow for multiple lines:

   awk 'BEGIN      {skip = 0} \
        skip == 0  {if (NF == 0) 
                     {skip = 1}  \
                    else 
                     {print};  \
                    next} \
        skip == 1  {print; \
                    skip = 0; \
                    next}'

Remember that when "\" is used to embed an Awk program in a script file, the program appears as one line to Awk. A semicolon must be used to separate commands.

For a more sophisticated example, I have a problem that when I write text documents, occasionally I'll somehow end up with the same word typed in twice: "And the result was also also that ... " Such duplicate words are hard to spot on proofreading, but it is straightforward to write an Awk program to do the job, scanning through a text file to find duplicate; printing the duplicate word and the line it is found on if a duplicate is found; or otherwise printing "no duplicates found".

   BEGIN { dups=0; w="xy-zzy" }
         { for( n=1; n<=NF; n++) 
              { if ( w == $n ) { print w, "::", $0 ; dups = 1 } ; w = $n }
         } 
   END   { if (dups == 0) print "No duplicates found." }

The "w" variable stores each word in the file, comparing it to the next word in the file; w is initialized to "xy-zzy" since that is unlikely to be a word in the file. The "dup" variable is initialized to 0 and set to 1 if a duplicate is found; if it's still 0 at the end of the end, the program prints the "no duplicate found" message. As with the previous example, we could put this into a separate file or embed it into a script file.

These last examples use variables to allow an Awk program to keep track of what it has been doing. Awk, as repeatedly mentioned, operates in a cycle: get a line, process it, get the next line, process it, and so on; to have an Awk program remember things between cycles, it needs to leave a little message for itself in a variable.

For example, say we want to match on a line whose first field has the value 1,000—but then print the next line. We could do that as follows:

   BEGIN        {flag = 0}
   $1 == 1000   {flag = 1; 
                 next}
   flag == 1    {print; 
                 flag = 0;
                 next}

This program sets a variable named "flag" when it finds a line starting with 1,000, and then goes and gets the next line of input. The next line of input is printed, and then "flag" is cleared so the line after that won't be printed.

If we wanted to print the next five lines, we could do that in much the same way using a variable named, say, "counter":

   BEGIN         {counter = 0}
   $1 == 1000    {counter = 5;
                  next}
   counter > 0   {print; 
                  counter--;
                  next}

This program initializes a variable named "counter" to 5 when it finds a line starting with 1,000; for each of the following 5 lines of input, it prints them and decrements "counter" until it is zero.

This approach can be taken to as great a level of elaboration as needed. Suppose we have a list of, say, five different actions on five lines of input, to be taken after matching a line of input; we can then create a variable named, say, "state", that stores which item in the list to perform next. The scheme is generally as follows:

   BEGIN {set state to 0}
   scan the input:
      if match        set state to 1
                      get next line of input
      if state == 1   do the first thing in the list
                      state = 2
                      get next line of input
      if state == 2   do the second thing in the list
                      state = 3
                      get next line of input
      if state == 3   do the third thing in the list
                      state = 4
                      get next line of input
      if state == 4   do the fourth thing in the list
                      state = 5
                      get next line of input
      if state == 5   do the fifth (and last) thing in the list
                      state = 0
                      get next line of input

This is called a "state machine". In this case, it's performing a simple list of actions, but the same approach could also be used to perform a more complicated branching sequence of actions, such as we might have in a flowchart instead of a simple list.

We could assign state numbers to the blocks in the flowchart and then use if-then tests for the decision-making blocks to set the state variable to indicate which of the alternate actions should be performed next. However, few Awk programs require such complexities, and going into more elaborate examples here would probably be more confusing than it's worth. The essential thing to remember is that an awk program can leave messages for itself in a variable on one line-scan cycle to tell it what to do on later line-scan cycles.