Guide to Unix/Commands/Text Processing

Unix supports multiple text processing commands.

awk

awk is a powerful text-processing tool using regular expressions, providing expanded capabilities beyond #cut and #sed. You can learn more in AWK and An Awk Primer Wikibooks.

Oneliner examples:

echo abcd |awk '/b.*d/'
- Outputs lines matching a regular expression, like grep command.
echo abcd |awk '/b.*d/ {print $0}'
- Does the same as above, with an explicit print statement. $0 stands for the entire line.
echo ab cd |awk '/b.*d/ {print $2}'
- For lines matching a regular expression, outputs the second field. Uses a sequence of whitespace as a field separator by default. Thus, outputs "cd".
echo abcd,e |awk -F, '/b.*d/ {print $2}'
- For lines matching a regular expression, outputs the second field, using comma as the field separator due to the -F option. Thus, outputs "e".
echo abcd,e |awk '{print toupper($0)}'
- Outputs all the lines in uppercase. For lowercase, use "tolower".
echo a b c d | awk '{print $NF, $(NF-1)}'
- Outputs the last field and next-to-last field; NF is the number of fields.
echo ab cd | awk 'NF>=2'
- Outputs the lines where the number of fields is 2 or more.
echo ab cd | awk '{print NF}'
- For each line, outputs the number of fields in it.
echo ab cd | awk 'length($2) > 1'
- Outputs all lines such that the length of the 2nd field is greater than one.
echo ab cd | awk '$2 ~ /cd/'
- Outputs all lines whose 2nd field matches the regular expression.
echo ab cd | awk '$1 ~ /ab/ && $2 ~ /cd/'
- Like above, but with two subconditions connected by "&&". Supports some other logical operators known from the C programming language.
cat file.txt | awk 'NR >= 100 && NR <= 500'
- Outputs the lines (records) whose line numbers are in the specified range. Thus, acts as a line number filter.
cat file.txt | awk 'NR == 1 || /reg.*pattern/'
- Outputs the first line and lines matching the regular expression.
echo abcd | awk '{gsub(/ab/,"ef");print}'
- Replacement aka substitution like in sed command; "g" in gsub stands for global.
awk 'BEGIN{for(i=1;i<6;i++) print sqrt(i)}'
- Outputs the square roots of integers in 1, ..., 5. The use of BEGIN{} makes sure the code gets executed regardless of there being any input lines fed to awk. The for loop uses the familiar C language syntax, and sqrt is one of multiple math functions available.
awk 'BEGIN{printf "%i %.2f\n", 1, (2+3)/7}'
- Function printf familiar from the C language is supported, as a statement not requiring surrounding brackets around the arguments.
awk 'BEGIN{for(i=1;i<9;i++) a[i]=i^2; for(i in a) print i, a[i]}'
- Outputs a couple integers and their squares with the help of awk associative arrays AKA maps or dictionaries. The sequential order of the output is indeterminate, as is usual with associative array. Awk has no direct analogue of arrays and lists known from other programming languages.
cat file.txt | awk '{sum+=$1} END{print sum}'
- Outputs the sum of the values in 1st field (column) using END keyword.
awk 'BEGIN{system("dir")}'
- Runs an external command, dir; disabled in sandbox mode.
awk 'function abs(x) {return x < 0 ? -x : x} BEGIN{print abs(-4)}'
- Defines and uses absolute value function. Shows use of the ternary operator known from the C language.

Links:

awk, opengroup.org
awk man page, man.cat-v.org
The GNU Awk User’s Guide, gnu.org

comm

Outputs lines common to two files or unique to them, provided the files are sorted. If the files are not sorted, the output is indeterminate. Options control the manner of identification, e.g. outputting only common lines.

Examples:

seq 1 5 > file1; seq 3 7 > file2
- Outputs sequences 1...5 and 3...7 of integers to files to support the following examples of comm usage.
comm file1 file2
- If the files are sorted, outputs three columns, 1st column with lines unique to file1, 2nd column with lines unique to file2 and 3rd column with lines common to both files. The columns are tab-separated by default, but the tab is only used in the lead indents, and each line of the output has only one column filled.
comm -23 file1 file2
- If the files are sorted, outputs lines in file1 that are not in file2. Thus, performs set difference on lines. The switches indicate columns to be omitted from output.
comm -12 file1 file2
- If the files are sorted, outputs lines that are in both files. Thus, performs set intersection on lines.
printf "1\n1\n" > file1; printf "1\n" > file2; comm file1 file2
- The 1st line is ranked as common to both files while the 2nd line of file1 is ranked as unique to file1. Thus, considers not only line content but rather treats a duplicate line as a distinct item to match.
seq 1 5 > file1; seq 3 7 | comm file1 -
- Uses dash (-) to indicate standard input.

Links:

comm, opengroup.org
comm man page, man.cat-v.org
7.4 comm in GNU Coreutils manual, gnu.org

csplit

Splits input into output files. The split can be driven by the number of lines and by a regex match.

Links:

csplit, opengroup.org
5.4 csplit in GNU Coreutils manual, gnu.org

cut

Outputs selected columns ("fields") from lines in text files, with specifiable column separator. See also Cut Wikibook.

Examples:

cut -f1 file.txt
- Outputs the 1st field of each line, using tab as the field separator.
echo a:b | cut -d: -f2
- Outputs the 2nd field of each line, using colon as the field separator.
echo a b c | cut -d" " -f1,3
echo a b c d e | cut -d" " -f1-3,5
echo a b c | cut -d" " -f3,2,1
- Outputs "a b c", disregarding the reversed order after -f.
echo a b c d | cut -d" " -f2-
- Outputs the 2nd and every later field, so "b c d".
echo abcd | cut -c3,4
- Instead of fields, treats characters. Thus, outputs "cd".
echo abcdefgh | cut -c1-3,6-8
- Outputs abcfgh

Links:

cut, opengroup.org
cut man page, man.cat-v.org
8.1 cut in GNU Coreutils manual, gnu.org

expand

Converts tabs to spaces, defaulting to 8 spaces per tab. See also #unexpand.

Links:

expand, opengroup.org
9.2 expand in GNU Coreutils manual, gnu.org

fmt

Formats text, including reflowing paragraphs to a specific maximum number of characters per line. Does not seem covered by POSIX:

Links:

4.1 fmt in GNU Coreutils manual, gnu.org
fmt man page, freebsd.org

fold

Limits the maximum length of a line in a manner different from #fmt.

Links:

fold, opengroup.org
4.3 fold in GNU Coreutils manual, gnu.org

iconv

Converts between character encodings.

Examples:

iconv -f ISO-8859-2 -t UTF-8 < in.txt > out.txt
- Converts from (-f) ISO-8859-2 to (-t) UTF-8.

Links:

iconv, opengroup.org
iconv in libiconv documentation, gnu.org
software/libiconv, gnu.org - lists supported encodings
iconv, man7.org
iconv, freebsd.org
iconv, wikipedia.org

join

Combines lines from files based of their fields, assuming the files are sorted on the fields used for joining.

Links:

join, opengroup.org
join man page, man.cat-v.org
8.3 join in GNU Coreutils manual, gnu.org

nl

Adds line numbers.

Links:

nl, opengroup.org
3.3 nl in GNU Coreutils manual, gnu.org

paste

For multiple files, joins lines corresponding by line number as if each file were a column of a table and each file line a row of the table.

Links:

paste, opengroup.org
paste man page, man.cat-v.org
8.2 paste in GNU Coreutils manual, gnu.org

pr

Formats input for printing, including pagination with header and footer.

Links:

pr, opengroup.org
pr man page, man.cat-v.org
4.2 pr in GNU Coreutils manual, gnu.org

sed

sed, a stream editor, is noted for its text replacement capability with regular expression support, but can do more. You can learn more in Sed Wikibook.

Oneliner examples of substitution:

sed "s/concieve/conceive/" myfile.txt
- Replaces the first occurrence of "concieve" on each line.
sed "s/concieve/conceive/g" myfile.txt
- Replaces all occurrences, because of "g" at the end.
sed "s/concieve/conceive/g;s/recieve/receive/g" myfile.txt
- Does two replacements.
echo "abccbd" | sed "s/a$[bc]*$d/\1/g"
- Outputs "bccb". Uses $ and $ to mark a group and \1 to refer to the group in the replacement part.
- Possibly works only with GNU sed; to be verified.
echo "abccbd" | sed -r "s/a([bc]*)d/\1/g"
- In GNU sed, it does the same thing as the previous example, just that the use of -r to switch on extended regular expressions has obviated the need to place backslash before "(" to indicate grouping.
- The -r switch is available in GNU sed, and unavailable in the original Unix sed.
echo "a b" | sed -r "s/a\s*b/ab/g"
- In GNU sed, Outputs "ab". Uses "\s" to denote whitespace, and "*" to let the previous character group be iterated any number of times. Needs -r to enable extended regex in GNU sed.
sed "s/\x22/'/g" myfile.txt
- In GNU sed, replaces each quotation mark with a single quote. \x22 refers to the character whose hexadecimal ASCII value is 22, which is the quotation mark.
echo Hallo | sed "s/hallo/hello/gi"
- Ignores character case, because of "i" at the end. Does not preserve capitalization, outputting "hello" rather than "Hello".
echo a2 | sed "s/[[:alpha:]]/z/g"
- Outputs z2, using "[[:alpha:]]", which stands for any letter. Notice that the character class is listed as "[:alpha:]" in manuals, with single "[".

Links:

sed, opengroup.org
sed man page, man.cat-v.org
sed, a stream editor - GNU manual, gnu.org

sort

Sorts lines in files, outputting the sorted lines and leaving the input intact.

Examples:

sort file.txt
- Sorts the file alphabetically.
sort file.txt file2.txt
- Sorts the lines of two files alphabetically, outputting a single sorted stream of lines from the two files.
cat file.txt | sort
- Sorts the input stream created by cat. Thus, equivalent to sort file.txt.
sort -n file.txt
- Sorts the file numerically. Thus, 12 comes after 2, which it does not alphabetically.
sort -r file.txt
- Sorts the file in the reverse order. Thus, b comes before a.
sort -k5,5 file.txt
- Sorts the file by the 5th field (column) via -k.
sort -t, -k5,5 file.txt
- As above, using comma (,) as the field separator via -t.
sort -k5,5 -k3,3 file.txt
- Sorts the file first by the 5th field, then by the 3rd field.
sort -k5,5 -k3,3n file.txt
- As above, but when sorting by the 3rd field, do so numerically via appended "n".
sort -k5 file.txt
- Sorts the file first by the 5th field, and then subsequently all the remaining fields, ignoring 1-4th fields for the sorting purposes.
sort -u file.txt
- Sorts the file, removing duplicate lines, thereby ensuring each output line is unique.
sort -u -k5,5 file.txt
- Sorts the file by the 5th field, keeping only one line from each set of lines having the same key, where the key is the 5th field.

Links:

sort, opengroup.org
sort man page, man.cat-v.org
7.1 sort in GNU Coreutils manual, gnu.org

spell

Performs spell checking. Seems absent from POSIX.

Links:

spell man page, man.cat-v.org
GNU spell project, savannah.gnu.org

tr

Performs a character-by-character mapping or "translation", and more. Yields greater brevity than sed for some tasks.

Examples:

echo "a:b:c:d" | tr : \\n
- Splits into multiple lines by colon (:). The colon will not be in the output.
echo "a b c d" | tr " " \\n
- Splits into multiple lines by space.
echo "abba" | tr ab cd
- Replaces a with c and b with d. Thus, yields cddc.
echo "a,b:c,d:e" | tr ,: :,
- Swaps commas with colons. Thus, yields a:b,c:d,e.
echo "a b c d" | tr -d " "
- Removes spaces from the input, outputting abcd. -d stands for delete.
echo "a,b,c:d:e" | tr -dc ,:
- Keeps only the commas and colons. -c stands for complement. Thus, yields ,,::.
echo "a,,,b,c::d" | tr -s ,:
- Replaces sequences of commas with a single comma, and sequences of colon with a single colon. -s stands for squeeze. Thus, yields a,b,c:d.

Links:

tr, opengroup.org
tr man page, man.cat-v.org
9.1 tr in GNU Coreutils manual, gnu.org

unexpand

Converts spaces to tabs, defaulting to 8 spaces per tabs.

Links:

unexpand, opengroup.org
9.3 unexpand in GNU Coreutils manual, gnu.org

uniq

Outputs single lines out of each same-line bloks, and more. Ideally used with the input sorted. You can learn more in Uniq wikibook.

Examples:

uniq file.txt
- Of adjacent one or more identical lines, outputs only one of them.
sort file.txt | uniq
- Sorts and then, of identical lines, outputs only one of them. Sorting ensures that originally non-adjacent lines that were identical become adjacent.
sort file.txt | uniq -u
- Outputs singleton blocks only.
sort file.txt | uniq -d
- Outputs a single line per multiline block, filtering out singletons.
sort file.txt | uniq -c
- Precedes each line with the block size.
sort file.txt | uniq -c | sort
- Precedes each line with the block size, and sorts the result by the block size. The sorting works well despite being alphabetical since uniq outputs the size indented by spaces to make it work.
sort file.txt | uniq -u -d
- In GNU uniq, outputs nothing, since each of the two options acts as a filter, and combined they filter out everything.

Links:

uniq, opengroup.org
uniq man page, man.cat-v.org
7.3 uniq in GNU Coreutils manual, gnu.org