/unix-text-commands

Unix Text Processing Command Reference

Unix Text Processing Command Reference

Nathan Schneider, 2013-01-29

This is intended as a quick reference for text processing commands built into Unix. It is terse and not necessarily comprehensive—YMMV.

Suggestions? Contact the author or submit a pull request.

Notes about the commands below:

  • None of these commands actually modify the input files; rather, they manipulate input text and produce output text, typically writing to standard output.

  • ALLCAPS indicates a metavariable.

  • The descriptions are selective. For more comprehensive documentation of options, see the command’s man page. For tutorials and examples, search the Web.

Some tutorials and references:

More powerful tools for advanced text processing operations:

No input stream

Repeats a line (by default, y) infinitely.

  • yes LINE | head -n 10 repeats LINE 10 times.

One or more input files/streams

Concatenate the input files together in sequence.

  • -s: suppress/squeeze multiple consecutive blank lines
  • -n: number all lines (cf. nl)
  • -b: number non-blank lines

Like cat, but for gzipped files.

Like cat, in reverse: lines are printed in reverse order. (GNU but not BSD.)

  • To print the last line of (contiguous) groups sharing all but the first 2 fields in common: tac FILE | uniq -f 2 | tac

Counts lines/words/characters in the specified file(s), individually and in total. By default, displays lines, then words, then characters.

  • -l: count lines
  • -w: count words
  • -c: count characters

Typically a single input stream

Encoding

Determines the encoding of a text file, or indicates that the argument is a directory or pipe.

Converts the encoding of a text file.

  • iconv -f ISO-8859-1 -t UTF-8 FILE converts from ISO-8859-1 to UTF-8

Filtering/extracting by position

Extracts fields from a file, based on delimiters or character positions.

  • cut -f1 FILE retrieves the first (tab-separated) column from the file
  • cut -d' ' -f1,3 FILE retrieves the first and third space-separated tokens from each line
  • cut -d'
    ' -f20-30 FILE (with a line break) supposedly retrieves the 20th-30th lines of the file, though this doesn’t seem to work in OS X. Equivalently: head -n 30 | tail -n 20
  • -s to omit lines without any delimiter

Extracts a certain amount of text from the beginning or end of a file.

If multiple files are matched by the argument(s), a header indicating the filename will be displayed.

  • -n N: number of lines to retrieve (default: 10)
  • -c N: number of characters (bytes) to retrieve
  • tail -n +N, tail -c +N: N indicates an offset relative to the beginning of the file; the rest of the file after that offset will be extracted
  • head -n -N, head -c -N: offset relative to the end of the file (GNU but not BSD implementation)
  • head -n 100 FILE | tail -n 1 retrieves the 100th line of the file
  • tail -f FILE monitors the end of the file, writing to stdout as the file is appended to

Filtering/extracting by content

Filters out duplicate lines of input.

  • -c: prefix each line with a count
  • -i: case-insensitive
  • -f N: ignore the first N (whitespace-separated) fields of each line
  • -s N: ignore the first N characters of each line
  • -w N: ignore all but the first N characters of each line
  • Note: If some parts of the line are ignored, the kept and discarded lines may differ. The first line with a given “key” will be the one that is kept.
  • other options for filtering repeated or non-repeated lines

grep + friends

Searches text by regular expression.

  • -i: case-insensitive
  • -o: only show the matched part of the line (if multiple matches on an input line, these will be on separate output lines)
  • -w: match only whole words
  • -l: list only files in which matches were found
  • -r: recursive
  • -n: include matching lines and line numbers
  • -v (--invert-match): filter out matches
  • -c (--count): give counts of the matches within each file instead of the matches themselves
  • -E or egrep: extended regex syntax: unescaped +, (, and ) serve as operators
  • -F or fgrep: literal string matching (no regexes)
  • -H: suppress filename when displaying matches
  • zgrep searches zip files
  • bzgrep searches bz files

Augmenting

Adds line numbers to a file. (Cf. cat -n.) Options control formatting and counting of the line numbers, including:

  • -v STARTNUM: initial counter value (default: 1)
  • -i INCREMENT (default: 1)
  • -w WIDTH: number of characters to be occupied by line numbers (default: 6)
  • -s SEP: separator to follow every line number (default: tab)
  • -b t: number non-empty lines (default); -b a: number all lines; -b pREGEX: number lines matching the regular expression pattern REGEX

Reordering

Sorts the input lines.

  • PUNCTUATION/SPECIAL CHARS MAY BE IGNORED depending on the value of the LC_COLLATE environment variable
  • -f: ignores case
  • -n: “string numerical value”
  • -g: general numeric sort
  • -i: ignores nonprinting characters
  • -r: reverse
  • -u: unique
  • -k: sort key (field offsets)
  • -t: field delimiter
  • To sort by the first column, keeping only the last record for each: tac FILE | sort -k1,1 -u

Randomly permutes the input lines. (GNU but not BSD)

rev

Reverses each line of the file.

Rearranging (changing spacing)

expand, unexpand

Convert tabs to spaces and vice versa.

Wrap input text so no line is more than a specified number of characters wide.

  • -w WIDTH: wrap lines in a file to be no more than WIDTH characters long (default: 80)
  • -s: breaks lines at word spaces

column

Formats text into columns (by default, based on whitespace delimiters).

Dividing

split FILE OUTPREFIX breaks up the input into smaller files by size.

Output files are named by the specified prefix and some number of lowercase alphabetic characters (configurable with -a; defaults to 2, i.e. aa, ab, etc.). In some implementations -d can be provided to request decimal rather than alphabetic suffixes. One of the following may be provided to determine the splitting behavior:

  • -l NUMLINES: number of lines in each output file (default: 1000)
  • -b BYTESIZE: size of each output file; BYTESIZE can even be in kilobytes (10k) or megabytes (10m)

Breaks up the input into smaller files by content.

Main arguments are the file, followed by one or more patterns indicating split points. Each pattern may be a line number, a regexp (optionally with a line offset), or a number of lines followed by {REPEATS} to indicate REPEATS blocks of the specified number of lines. The line matching the pattern begins a new output file. Output files are numbered with decimal digits.

  • -f OUTPREFIX (default: xx)
  • -n NUMDIGITS (default: 2)

Replacing

Translates characters, e.g. lowercasing text in a file or replacing newlines with spaces.

Text substitution by regular expression matching.

  • sed 's/K\.? ?V?\.? ?/K/g' FILE replaces all matches of the pattern in the file
  • sed '/^$/d' FILE filters out blank lines of the file
  • Depending on the implementation, sed may or may not not support backslash-denoted characters and character classes such as \t, \s, and [[:space:]]. (\t and \s are supported in GNU but not BSD implementations.) Tabs and newlines can be entered as literals.
  • Unless -E is specified, the plus operator and parenthesized subexpressions MUST HAVE BACKSLASH ESCAPES when using sed/grep: e.g. \(.*\).\+

Two input streams

Call them FILE1 and FILE2, respectively.

Compare two files line-by-line. (Cf. diff3 for three files.) Can also compare directories.

  • -i: ignore case
  • -w: ignore whitespace
  • -B: ignore blank lines
  • -I REGEX: ignore differences among lines that all match the regular expression pattern REGEX
  • -x REGEX: exclude files that match the pattern
  • -r: recursively compare subdirectories
  • -s: report identical files (file comparison only)
  • -u: unified diff display format
  • -y: side-by-side display (cf. sdiff)
  • --suppress-common-lines

Other options are useful when comparing code, e.g. -p, -F, -D, and -E.

sdiff

Compare files side-by-side, as with diff -y. Many of the diff options are available; -s is short for --suppress-common-lines.

Displays lines common or unique to two SORTED files, organized into columns according to their commonality.

  • -1 to suppress the first column (lines only in FILE1)
  • -2 to suppress the second column (lines only in FILE2)
  • -3 to suppress the third column (lines common to both files)
  • -i for case-insensitive comparison
  • if the files are sorted, join -t'
    ' FILE1 FILE2 is much faster than comm -1 -2 FILE1 FILE2

Merges the lines of two sorted text files based on the presence of a common field.

  • -t CHAR to indicate a delimiter (by default, it is any sequence of whitespace). For instance, join -t'
    ' FILE1 FILE2 lists all common lines, assuming the two files are sorted.

Typically two or more input streams

Combines/merges lines of multiple files.

  • paste FILE1 FILE2 … joins corresponding lines with tabs (vertically)
  • paste -s FILE1 FILE2 … joins horizontally, with corresponding lines of the input files displayed as columns
  • paste -d'\n' FILE1 FILE2 … interleaves lines of the files

Three input streams

diff3

Like diff, but for three files.