cgrep [ -help ]
cgrep [ -version ]
In addition, cgrep allows a regular expression to be used to define a search universe and reports elements of the search universe that contain (or alternately do not contain) occurrences of the pattern. Useful search universes include email messages, news articles and similar components of structured documents.
The behavior of cgrep differs substantially from that of grep(1) and other related utilities. Those utilities perform matching only within a line, and only lines containing the pattern may be reported. The approach taken by cgrep increases the usefulness of regular expression search, allowing matching across lines and allowing non-text and binary files to be searched. Examples are given toward the end of this man page to illustrate some of the possibilities.
Regular expressions are written in a notation based on that of egrep(1) and POSIX 1003.2 (excluding internationalization features) with some additions. These additions include an intersection operator, escape sequences for non-printable characters, and a macro facility.
cgrep begins its execution by reading and processing macros defined in the file $HOME/.cgreprc. It then processes its command line arguments, reading and processing in order any macro definition files specified on the command line. Finally, each input file is read and searched, with occurrences of the pattern reported as they are encountered. As each occurrence of the pattern is printed it may be optionally delimited with user-defined start and end tags. If no input file is specified, standard input is read.
In addition to the standard POSIX 1003.2 operators, we accept '&' for the intersection of two regular expressions. The precedence of the intersection operator is the same as that of union ('|'). The union and intersect operators associate left to right.
The characters `<' and `>' may be used to match the beginning and end of file respectively.
We make one addition to the character classes defined by POSIX 1003.2: Within a bracket expression, the sequence `[:print:]' matches any printable character. Character class membership is based on the ctype(3) macros.
Escape sequences for non-printable characters follow the syntax of ANSI C, including the sequences for hexadecimal and octal constants. Escape sequences undefined by ANSI C represent the literal character following the '\'. In particular, an escape consisting of a `\' followed by any punctuation character may be used to represent the literal punctuation mark, avoiding any special meaning of the character.
Support for macros is provided. Macros calls come in two flavors: fast and tedious. A fast call consists of an `@' character followed by an single alphabetic character. A tedious macro call has the form:
[@name(parameter0, parameter1, ...)]
where each of the up to 9 parameters is a regular expression.
If the macro requires no parameters, the bracket-enclosed parameter list
is omitted completely.
Be careful not to put any extra whitespace in the parameter list,
this extra whitespace will be counted as part of the parameter.
An un-parameterized macro definition has the form:
name=regular-expression
and parameterized macro definition has the form:
name#n=regular-expression
where the number of parameters is indicated by a single digit following the
`#' character.
Within the body of a parameterized macro, the actual parameters may be
referenced as `#1' through `#9'.
A macro name must start with a alphabetic character, and may include
only alphanumeric characters and the character `_'.
Be careful not put any extra whitespace after the '='; this whitespace counts
as part of the regular expression.
cgrep '^.*United[[:space:]]*States.*$' constitution.txt
will print the lines of text that contain it.
The command
cgrep -list 'the\nthe' *.txt
checks for a typing error that's hard to spot visually and prints the names of
the files that contain it.
The command
cgrep -insensitive -U '/\*.*\*/' POSIX cgrep.c
prints all the comments in the C source file
cgrep.c,
that contain the string ``posix'' in any combination of lower and
upper case letters (under some mild assumptions).
The command
cgrep '[^[:print:]][[:print:]]{4,}[\n\0]' a.out
reports strings of four or more printable characters ending in a newline or
null character that appear in the executable file
a.out.
Each match is printed on a separate line.
If the
-binary
flag were specified, the resulting matches would be run together without
separating newlines.
Each match is started by an unprintable character and may contain
superfluous null characters.
The output could be piped to
cgrep -binary '[[:print:]\n]'
to strip these unprintable characters
(or the
tr(1)
command could be used for the same purpose).
As a final example,
cgrep
may be used
to search a mail file and extract mail based on
patterns in the sender or subject lines, or in other parts of the header
and body.
Standard macros for handling mail may be defined in the
$HOME/.cgreprc
file:
Mail=^From .*(^From |>) From#1=^From:[^$]*#1 Re#1=^Subject:[^$]*#1The command
cgrep -U '[@Mail]' '(.*[@From([Cc]owan)].*)&(.*[@Re(brewpubs)].*)' mboxwould then extract all mail messages in the file mbox that are from Cowan and are on the subject of brewpubs. It's then necessary to pipe the output through
sed '/^From $/d'or equivalently
cgrep -V '^.*$' '^From $'to strip out the extra characters needed to detect the end of each mail message and create a validly formatted mail file.
POSIX 1003.2, section 2.8 (Regular Expression Notation).
Charles L. A. Clarke and Gordon V. Cormack. On the use of Regular Expressions for Searching Text. University of Waterloo Computer Science Department Technical Report number CS-95-07, University of Waterloo, Waterloo, Ontario N2L 3G7, Canada. February 1995. ftp://plg.uwaterloo.ca/pub/mt/TechReports/CS-95-07/regexp.ps
The syntax for macros is ugly. An undefined macro is reported as a syntax error.
This man page needs to be extended with a complete and precise description of the regular expression format.
The software is an alpha release. Report bugs to mt@plg.uwaterloo.ca.