103.2 Process text streams using filters edit

Candidates should be able to apply filters to text streams.


Key Knowledge Areas

  • Send text files and output streams through text utility filters to modify the output using standard UNIX commands found in the GNU textutils package.

Text Processing Utilities edit

Linux has a rich assortment of utilities and tools for processing and manipulating text files. In this section we cover some of them.

cat - cat is short for concatenate and is a Linux command used to write the contents of a file to standard output. Cat is usually used in combination with other command to perform manipulation of the file or if you wish to quickly get an idea of the contents of a file. The simplest format of the command Is:

# cat /etc/aliases

Cat can take several parameters, the most commonly used being -n and -b which output line numbers on all lines and non-empty lines only respectively.


head and tail - The utilities head and tail are often used to examine log files. By default they output 10 lines of text. Here are the main usages.

List 20 first lines of /var/log/messages:

# head -n 20 /var/log/messages

# head -20 /var/log/messages


List 20 last lines of /etc/aliases:

# tail -20 /etc/aliases

The tail utility has an added option that allows one to list the end of a text starting at a given line.


List text starting at line 25 in /var/log/messages:

# tail +25 /etc/log/messages

Finally tail can continuously read a file using the -f option. This is most useful when you are examining live log files for example.


wc -The wc utility counts the number of bytes, words, and lines in files. Several options allow you to control wc's output.

Options for wc
-l count number of lines
-w count number of words
-c or -m count number of bytes or characters


nl - The nl utility has the same output as cat -b

Number all lines including blanks:

# nl -ba /etc/lilo.conf


Number only lines with text:

# nl -bt /etc/lilo.conf


expand/unexpand - The expand command is used to replace TABs with spaces. One can also use unexpand for the reverse operations.

od There are a number of tools available for this. The most common ones are od (octal dump) and hexdump.

split - splitting files - The split tool can split a file into smaller files using criteria such as size or number of lines. For example we can spilt /etc/passwd into smaller files containing 5 lines each

# split -l 5 /etc/passwd

This will create files called xaa, xab, xac, xad ... each file contains at least 5 lines. It is possible to give a more meaningful prefix name for the files (other than x) such as passwd-5. on the command line

# split -l 5 /etc/passwd passwd-5

This has created files identical to the ones above (aa, xab, xac, xad ...) but the names are now passwd-5aa, passwd-5ab, passwd-5ac, passwd-5ad


Erasing consecutive duplicate lines

The uniq tool will send to stdout only one copy of consecutive identical lines.

Consider the following example:

# uniq > /tmp/UNIQUE

line 1

line 2

line 2

line 3

line 3

line 3

line 1

^D


The file /tmp/UNIQUE has the following content:

# cat /tmp/UNIQUE

line 1

line 2

line 3

line 1


NOTE: From the example above we see that when using uniq non consecutive identical lines are still printed to STDOUT. Usually the output is sorted first so that identical lines all appear together.

# sort | uniq > /tmp/UNIQUE

cut The cut utilility can extract a range of characters or fields from each line of a text. The –c option is used to cut based on character positions.

Syntax:

cut {range1,range2}

Example:

# cut –c5-10,15- /etc/password


The example above outputs characters 5 to 10 and 15 to end of line for each line in /etc/password. One can specify the field delimiter (a space, a commas etc ...) of a file as well as the fields to output. These options are set with the –d and –f flags respectively.

Syntax:

{delimiter} -f {fields}

Example:

# cut -d: -f 1,7 --output-delimiter=" " /etc/passwd

This outputs fields 1 and 7 of /etc/passwd delimited with a space. The default output-delimiter is the same as the original input delimiter. The --output-delimiter option allows you to change this.

paste/join - The easiest utility is paste, which concatenates two files next to each other.

Syntax:

paste text1 text2

With join you can further specify which fields you are considering.

Syntax:

join -j1 {field_num} -j2{field_num} text1 text2 or

join -1 {field_num} -2{field_num} text1 text2

Text is sent to stdout only if the specified fields match. Comparison is done one line at a time and as soon as no match is made the process is stopped even if more matches exist at the end of the file.

sort - By default, sort will arrange a text in alphabetical order. To perform a numerical sort use the -n option.

Formatting output with fmt and pr edit

fmt is a simple text formatter that reformats text into lines of a specified length.

You can modify the number of characters per line of output using fmt. By default fmt will concatenate lines and output 75 character lines.

fmt options

-w number of characters per line

-s split long lines but do not refill

-u place one space between each word and two spaces at the end of a sentence

Long files can be paginated to fit a given size of paper with the pr utility. Text is broken into pages of a specified length and page headers are added. One can control the page length (default is 66 lines) and page width (default 72 characters) as well as the number of columns.


pr can also produce multi-column output.

When outputting text to multiple columns each column will be evenly truncated across the defined page width. This means that characters are dropped unless the original text is edited to avoid this.


tr The tr utility translates one set of characters into another.

Example changing uppercase letters into lowercase

tr 'A-B' 'a-b' < file.txt


Replacing delimiters in /etc/passwd:

# tr ':' ' ' < /etc/passwd


NOTE: tr has only two arguments!.


sed sed stands for stream editor and is used to manipulate text stream tr will not read from a file, it only reads standard input. It is most commonly used to transform text input generated by other commands in bash scripts. sed is a complex tool that can take some time to master. It's most common use case is to find and replace text in an input stream. Sed's output is written to standard out, with the original file left untouched, and needs to be redirected to a file to make the changes permanent.

The command:

# sed ‘s/linux/Linux/g‘ readme.txt > ReadMe.txt

will replace every occurrence of the word linux with Linux in the readme.txt file. The g at the end of the command is used to make the replacement global so sed will process the entire line and not stop at the first occurrence of the word linux. For more informaiton on sed refer to section 103.7



Used files, terms and utilities:* cat

  • cut
  • expand
  • fmt
  • head
  • od
  • join
  • nl
  • paste
  • pr
  • sed
  • sort
  • split
  • tail
  • tr
  • unexpand
  • uniq
  • wc


Previous Chapter | Next Chapter