/ioccc-ag

Best Utility (grep) - IOCCC 1993 Winner

Primary LanguageCBSD Zero Clause License0BSD

ag - Anthony's Grep

1. SYNOPSIS

ag pattern [file...]

2. DESCRIPTION

The AG utility shall search input files, selecting lines matching a pattern. An input line shall be selected if the pattern, treated as a subset of Extended Regular Expressions as described by POSIX.2 Standard section 2.8.4, matches any part of the line; a null pattern shall match every line. Each selected input line shall be written to the standard output.

Regular expression matching shall be based on text lines, which are separated by a <newline> character. A regular expression cannot contain a <newline> and there is no way for a pattern to match a <newline> found in the input.

3. OPTIONS

None.

4. OPERANDS

  • pattern Specify one pattern to be used during the search for input. See Extended Regular Exprssions below on what constitutes a valid pattern.

  • file A pathname of a file to be searched for the pattern. If no file operands are specified or the file operand is dash (-), the standard input shall be used.

5. EXTERNINAL INFLUENCES

5.1. STANDARD INPUT

The standard input shall be used only if no file operands are specifed. See Input Files.

5.2. INPUT FILES

The input files shall be text files.

5.3. ENVIRONMENT VARIABLES

No environment variables are used.

5.4. ASYNCHRONOUS EVENTS

Default.

6. EXTERNAL EFFECTS

6.1. STANDARD OUTPUT

For each selected input line from a file, a single output line shall be written:

"%s:%s\n", file, line

Standard input will have the filename dash (-).

6.2. STANDARD ERROR

Standard error is used for diagnostic messages only.

6.3. OUTPUT FILES

None.

7. EXTENDED DESCRIPTION

7.1. EXTENDED REGULAR EXPRESSIONS

7.1.1. MATCHING A SINGLE CHARACTER

An ERE ordinary character, a special character preceded by a backslash, or a period shall match a single character. A bracket expression shall match a single character. An ERE matching a single character enclosed in parentheses shall match the same as the ERE without parentheses would have matched.

7.1.1.1. ERE ORDINARY CHARACTERS

An oridinary character is an ERE that matches itself. An ordinary character is any character in the supported character set, except for the ERE special characters listed in 7.1.1.2. The interpretation of an ordinary character preceded by a backslash (\) is the ordinary character.

7.1.1.2. ERE SPECIAL CHARACTERS

An ERE special character has special properties in certain contexts. Outside of those contexts, or when preceded by a backslash, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they have their special meaning are:

. [ \ ( The period, left-bracket, backslash, and left- parenthesis are special except when used in a bracket expression. Outside a bracket expression, a left-parenthesis immediately followed by a right-parenthesis will be treated as a null expression.

) The right-parenthesis is special when matched with a preceding left-parenthesis, both outside of a bracket expression.

* ? The asterisk and question-mark are special except when used in a bracket expression. Results are undefined if this character appears first in an ERE, or immediately following a vertial-line, circumflex, or left-parenthesis.

| The vertical-line is special except when used in a bracket expression. A vertical-line appearing first or last in an ERE or immediately following a vertical-line or left-parenthesis, or preceding a right-parenthesis, produces a null expression.

^ The circumflex shall be special when used as an anchor or, as the first character of a bracket expression.

$ The dollar-sign shall be special when used as an anchor.

7.1.1.3 PERIODS IN ERE

A period (.), when used outside of a bracket expression, is an ERE that shall match any character in the supported character set except NUL.

7.1.2. ERE BRACKET EXPRESSION

A bracket expression (an expression enclosed in square brackets, []), is an ERE that matches a single character contained in the nonempty set of characters represented by the bracket expression.

The following rules and definitions apply to bracket expressions:

  1. A bracket expression is either a matching list expression or a nonmatching list expression. It consists of one or more characters and/or range expressions. The right-bracket (]) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list [after an initial circumflex (^), if any]. Otherwise, it shall terminate the bracket expression. The special characters period (.), asterisk (*), left-bracket ([), and backslash (\) shall lose their special meaning within a bracket expression.

  2. A matching list expression specifies a list that shall match any one of the expressions represented in the list. The first character of the list shall not be the circumflex. For example, [abc] is an ERE that matches any of a, b, or c.

  3. A nonmatching list expression begins with a circumflex (^), and specifies a list that shall match any character except for the expressions represented in the list after the leading circumflex. For example, [^abc] is an ERE that matches any character except a, b, or c. The circumflex shall have this special meaning only when it occurs first in the list, immediately following the left-bracket.

  4. A range expression represents the set of characters that fall between two elements in the current collation sequence, inclusive. It shall be expressed as the starting point and the ending point separated by a hyphen (-).

    In the following, all examples assume the collation sequence specified for the POSIX Locale (a.k.a ASCII Character Set), unless another collation sequence is sepecifically defined by the hardware.

    The starting range point and the ending range point shall be a collating element (character). The ending range point shall collate higher than the starting range point; otherwise the expression shall be treated as invalid.

    The interpretation of a range expression where the ending range point is the starting range point of a subsequent range expression is undefined.

    The hyphen character shall be treated as itself if it occurs first (after an initial ^, if any) or last in the list, or as an ending range point in a range expression. As examples, the expressions [-ac] and [ac-] are equivalent and match any of the characters a, c, or -; the expressions [^-ac] and [^ac-] are equivalent and match ant character except a, c, -; the expression [%--] matches any of the characters between % and - inclusive; the expression [--@] matches any of the characters between - and @ inclusive; and the expression [a--@] is invalid, because the letter a follows the symbol - in the POSIX Locale.

7.1.3. ERE MATCHING MULTIPLE CHARACTERS

The following rules shall be used to construct EREs matching multiple characters from EREs matching a single character:

  1. A concatenation of EREs shall match the concatenation of the character sequences matched by each component of the ERE. A concatenation of EREs enclosed in parentheses shall match whatever the concatenation without parentheses matches. For example, both the ERE cd and the ERE (cd) are matched by the third and forth character of the string abcdefabcdef.

  2. When an ERE matching a single character, or a concatention of EREs enclosed in parentheses is followed by the special character asterisk (*), together with that asterisk it shall match what zero or more consecutive occurences of the ERE would match. For example, the ERE b*c matches the first character in the string cabbbcde, and the ERE b*cd matches the third through seventh characters in the string cabbbcdebbbbbbcdbc. And, [ab]* and [ab][ab] are equivalent when matching the string ab.

  3. When an ERE matching a single character or an ERE enclosed in parentheses is followed by the special character question-mark (?), together with that question mark it shall match what zero or one consecutive occurences of the ERE would match. For example, the ERE b?c matches the second character in the string acabbbcde.

The behaviour of multiple adjacent duplication symbols (*, ?) produces undefined results.

7.1.4. ERE ALTERATION

Two EREs separated by the special character vertical-line (|) shall match a string that is matched by either. For example, the ERE a((bc)|d) matches the string abc and the string ad. Single characters, or expressions matching single characters, separated by the vertical-line and enclosed in parentheses, shall be treated as an ERE matching a single character.

7.1.5. ERE PRECEDENCE

The order of precedence shall be as shown in the table below, from high to low:

escaped characters			\<special character>
bracket expression			[ ]
grouping				( )
single-character-ERE duplication	* ?
concatenation
anchoring				^ $
alteration				|

For example, the ERE abba|cde matches either the string abba or the string cde (because concatenation has a higher order of precedence than alteration).

7.1.6. ERE EXPRESSION ANCHORING

An ERE can be limited to matching strings that begin or end a line; this is called anchoring. The circumflex and dollar-sign special characters shall be considered ERE anchors in the following contexts:

  1. A circumflex (^) outside a bracket expression shall anchor the (sub)expression it begins to the beginning of a string; such a (sub)expression can match only a sequence starting at the first character of a string. For example, the EREs ^ab and (^ab) match ab in the string abcdef, but fail to match in the string cdefab, and the ERE a^b is valid, but can never match because the character a prevents the expression ^b from matching the first character of a string.

  2. A dollar-sign ($) outside a bracket expression shall anchor the (sub)expression it ends to the end of a string; such a (sub)expression can match only a sequence ending at the last character of a string. For example, the EREs ef$ and (ef$) match ef in the string abcdef, but fail to match in the string cdefab, and the ERE e$f is valid, but can never match because the character f prevents the expression e$ from matching the last character of a string.

8. EXIT STATUS

The following exit codes will be returned:

  • 0 Successful match.
  • 1 No match.
  • 2 Usage error.
  • 3 General error.

9. INSTALLATION

AG can be built on any system providing at least K&R C. It has been tested on

  • SunOS with GCC
  • ATARI Mega ST with Sozobon C
  • PC clone with Turbo C and WatCom C
  • Interactive UNIX System V/386 release 3.2

For most machines, the compile command line should be

cc -O -o ag ag.c

PC class machines may require that medium or large memory model be used and that ARRAY be set to 256. For Watcom C the command line would be

wcl -ml -DARRAY=256 ag.c

The value ARRAY represents the size of the compiled pattern array and the size of the array used to record matched states. The default value choosen should handle most patterns. ARRAY can be overridden on the command line with -DARRAY=nn, where 255 < nn < INT_MAX.

10. REFERENCES

  • Aho, Sethi, and Ullman, "Compilers - Principles, Techniques, and Tools", Addison-Wesley, 86, ISBN 0-201-10088-6, chapter 3

  • Kernighan and Plauger, "Software Tools in Pascal". Addison-Wesley, 81, ISBN 0-201-10342-7, chapter 5

  • Webb Miller, "A Software Tools Sampler", Prentice Hall, 87, ISBN 0-13-822305-X, chaper 4

  • POSIX.2 Standard, 2.8.4 Extended Regular Expressions, 4.28 Grep

11. BUGS

The expression (a*)* compiles but loops forever.

There is no check for trailing backslash (\) in the pattern.

There is no check for unbalanced brackets. Omitting a closing bracket will generate a "Patern too long" error, which is not the real error.

12. NOTICES

Public Domain 1992, 1993 by Anthony Howe. No warranty.