gptx
- GNU permuted index generator
This is the 0.2 alpha release of gptx
, the GNU version of a
permuted index generator. This software has the main goal of providing
a ptx
almost compatible replacement, able to handle small
files quickly, while providing a platform for more development.
This version reimplements and extends standard ptx
. Among other
things, it can produce a readable KWIC (keywords in their context)
without the need of nroff
, there is also an option to produce
TeX compatible output. This version does not yet handle huge input
files, that is, those files which do not fit in memory all at once.
Please note that an overall renaming of all options is foreseeable. In fact, GNU ptx specifications are not frozen yet.
ptx
mode is different.
This tool reads a text file and essentially produces a permuted index, with each keyword in its context. The calling sketch is one of:
gptx [option]... [input]... >output
or:
ptx [option]... [input [output]]
These are two different versions of one program. When using ptx
instead of gptx
, this implies built-in ptx
compatibility
mode, disallowing extensions, introducing some limitations, and changing
several of the program's default option values. This documentation
describes both modes of operation. See See section ptx
compatibility mode for an explicit
list of differences.
As usual, each option is represented by an hyphen followed by a single letter. Some options require a parameter in the form of a decimal number or a file name, in which case the parameter follows the option after some whitespace. Option letters may be grouped and tied together as a string which follows only one hyphen; if one of several of them require parameters, they should follow the combined options in the order of appearance of individual letters in the string. Individual options are explained below.
When not in ptx
compatibility mode, there may be zero, one
or several parameters after the options. If there is no parameters, the
program reads the standard input. If there is one or several
parameters, they give the name of input files, which are all read in
turn; as if all the input files were concatenated. However, there is a
full contextual break between each file; and when automatic referencing
is requested, file names and line numbers refer to individual text input
files. In all cases, the program produces the permuted index onto the
standard output.
When in ptx
compatibility mode, besides the options, there may be
zero, one or two parameters. If there is no parameters, the program
reads the standard input and produces the permuted index onto the
standard output. If there is only one parameter, it names the text file
to be read instead of the standard input. If two parameters are given,
they give respectively the name of the file to read and the name of the
file to produce. Be careful to note that, in this case, the
contents of file given by the second parameter is destroyed; this
behaviour is dictated by compatibility; GNU standards discourage output
parameters not introduced by an option.
Note that for any file named as the value of an option or as an input text file, a single dash - may be used, in which case standard input is assumed. However, it would not make sense to use this convention more than once per program invocation.
-C
As it is setup now, the program assumes that the input file is coded using 8-bit ISO 8859-1 code, also known as Latin-1 character set, unless if it is compiled for MS-DOS, in which case it uses the character set of the IBM-PC. Compared to 7-bit ASCII, the set of characters which are letters is then different, this fact alters the behaviour of regular expression matching. Thus, the default regular expression for a keyword allows foreign or diacriticized letters. Keyword sorting, however, is still crude; it obeys the underlying character set ordering quite blindly.
-f
-b file
-W
for describing
which characters make up words. This option introduces the name of a
file which contains a list of characters which cannot be part of
one word, this file is called the Break file. Any character which
is not part of the Break file is a word constituent. If both options
-b
and -W
are specified, then -W
has precedence and
-b
is ignored.
In normal mode, the only way to avoid newline as a break character is to
write all the break characters in the file with no newline at all, not
even at the end of the file. In ptx
compatibility mode, spaces,
tabs and newlines are always considered as break characters even if not
included in the Break file.
-i file
-S
option.
If not specified, there might be a default Ignore file. Default Ignore
files are not necessarily the same in normal mode or in ptx
compatibility mode. Unless changed by the local installation, there is
no default Ignore file in normal mode, and the Ignore file is
/usr/lib/eign
in ptx
compatibility mode. If you want to
deactivate a default Ignore file, use /dev/null
instead.
-o file
-S
option.
There is no default for the Only file. In the case there are both an
Only file and an Ignore file, a word will be subject to be a keyword
only if it is given in the Only file and not given in the Ignore file.
-r
-S
.
Using this option, the program does not try very hard to remove
references from contexts in output, but it succeeds in doing so
when the context ends exactly at the newline. If option
-r
is used with -S
default value, or when in ptx
compatibility mode, this condition is always met and references are
completely excluded from the output contexts.
-S regexp
ptx
compatibility mode or if
-r
option is used, end of lines are used; in this case, the
regexp used is very simple:
\nIn normal mode and if
-r
option is not used, by default, end of
sentences are used; the precise regex is imported from GNU emacs:
[.?!][]\"')}]*\\($\\|\t\\| \\)[ \t\n]*An empty REGEXP is equivalent to completly disabling end of line or end of sentence recognition. In this case, the whole file is considered to be a single big line or sentence. The user might want to disallow all truncation flag generation as well, through option
-F ""
. On
regular expression writing and usage, see See section Syntax of Regular Expressions.
When the keywords happen to be near the beginning of the input line or
sentence, this often creates an unused area at the beginning of the
output context line; when the keywords happen to be near the end of the
input line or sentence, this often creates an unused area at the end of
the output context line. The program tries to fill those unused areas
by wrapping around context in them; the tail of the input line or
sentence is used to fill the unused area on the left of the output line;
the head of the input line or sentence is used to fill the unused area
on the right of the output line.
This option is not available when the program is operating ptx
compatibility mode.
-W regexp
ptx
compatibility mode, a word is anything which
ends with a space, a tab or a newline; the regexp used is [^
\t\n]+
.
In normal mode, a word is a sequence of letters; the
regexp used is \w+
.
An empty REGEXP is equivalent to not using this option, letting the
default dive in. On regular expression writing and usage, see
See section Syntax of Regular Expressions.
This option is not available when the program is operating ptx
compatibility mode.
Output format is mainly controlled by -O
and -T
options,
described in the table below. However, when neither -O
nor
-T
is selected, and if we are not running in ptx
compatibility mode, the program choose an output format suited for a
dumb terminal. This is the default format when working in normal mode.
Each keyword occurrence is output to the center of one line, surrounded
by its left and rigth contexts. Each field is properly justified, so
the concordance output could readily be observed. As a special feature,
if automatic references are selected by option -A
and are output
before the left context, that is, if option -R
is not
selected, then a colon is added after the reference; this nicely
interfaces with GNU Emacs next-error
processing. In this default
output format, each white space character, like newline and tab, is
merely changed to exactly one space, with no special attempt to compress
consecutive spaces. This might change in the future. Except for those
white space characters, every other character of the underlying set of
256 characters is transmitted verbatim.
Output format is further controlled by the following options.
-g number
-w number
-R
. If this option is not
selected, that is, when references are output before the left context,
the output maximum width takes into account the maximum length of all
references. If this options is selected, that is, when references are
output after the right context, the output maximum width does not take
into account the space taken by references, nor the gap that precedes
them.
-A
-A
and -r
are selected, then
the input reference is still read and skipped, but the automatic
reference is used at output time, overriding the input reference.
This option is not available when the program is operating ptx
compatibility mode.
-R
-R
is not used, any
reference produced by the effect of options -r
or -A
are
given to the far right of output lines, after the right context. In
default output format, when option -R
is specified, references
are rather given to the beginning of each output line, before the left
context. For any other output format, option -R
is almost
ignored, except for the fact that the width of references is not
taken into account in total output width given by -w
whenever
-R
is selected.
This option is not explicitely selectable when the program is operating
in ptx
compatibility mode. However, in this case, it is always
implicitely selected.
-F string
-S
. But there is a maximum
allowed output line width, changeable through option -w
, which is
further divided into space for various output fields. When a field has
to be truncated because cannot extend until the beginning or the end of
the current line to fit in the, then a truncation occurs. By default,
the string used is a single slash, as in -F /
.
string may have more than one character, as in -F ...
.
Also, in the particular case string is empty (-F ""
),
truncation flagging is disabled, and no truncation marks are appended in
this case.
This option is not available when the program is operating ptx
compatibility mode.
-O
nroff
or troff
processing. Each output line will look like:
.xx "tail" "before" "keyword_and_after" "head" "ref"so it will be possible to write an `.xx' roff macro to take care of the output typesetting. This is the default output format when working in
ptx
compatibility mode.
In this output format, each non-graphical character, like newline and
tab, is merely changed to exactly one space, with no special attempt to
compress consecutive spaces. Each quote character: " is doubled
so it will be correctly processed by nroff
or troff
. All
characters having their eight bit set are turned into spaces in this
version. It is expectable that diacriticized characters will be
correctly expressed in roff
terms if I learn how to do this. So,
let me know how to improve this special character processing.
This option is not available when the program is operating ptx
compatibility mode. In fact, it then becomes the default and sole output
format.
-T
\xx {tail}{before}{keyword}{after}{head}{ref}so it will be possible to write write a
\xx
definition to take
care of the output typesetting. Note that when references are not being
produced, that is, neither option -A
nor option -r
is
selected, the last parameter of each \xx
call is inhibited.
In this output format, some special characters, like $, %,
&, # and _ are automatically protected with a
backslash. Curly brackets {, } are also protected with a
backslash, but also enclosed in a pair of dollar signs to force
mathematical mode. The backslash itself produces the sequence
\backslash{}
. Circumflex and tilde diacritics produce the
sequence ^\{ }
and ~\{ }
respectively. Other
diacriticized characters of the underlying character set produce an
appropriate TeX sequence as far as possible. The other non-graphical
characters, like newline and tab, and all others characters which are
not part of ASCII, are merely changed to exactly one space, with no
special attempt to compress consecutive spaces. Let me know how to
improve this special character processing for TeX.
This option is not available when the program is operating ptx
compatibility mode.
Regular expressions have a syntax in which a few characters are special constructs and the rest are ordinary. An ordinary character is a simple regular expression which matches that character and nothing else. The special characters are `$', `^', `.', `*', `+', `?', `[', `]' and `\'; no new special characters will be defined. Any other character appearing in a regular expression is ordinary, unless a `\' precedes it.
For example, `f' is not a special character, so it is ordinary, and therefore `f' is a regular expression that matches the string `f' and no other string. (It does not match the string `ff'.) Likewise, `o' is a regular expression that matches only `o'.
Any two regular expressions a and b can be concatenated. The result is a regular expression which matches a string if a matches some amount of the beginning of that string and b matches the rest of the string.
As a simple example, we can concatenate the regular expressions `f' and `o' to get the regular expression `fo', which matches only the string `fo'. Still trivial. To do something nontrivial, you need to use one of the special characters. Here is a list of them.
Note: for historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. For example, `*foo' treats `*' as ordinary since there is no preceding expression on which the `*' can act. It is poor practice to depend on this behavior; better to quote the special character anyway, regardless of where is appears.
For the most part, `\' followed by any character matches only that character. However, there are several exceptions: characters which, when preceded by `\', are special constructs. Such characters are always ordinary when encountered on their own. Here is a table of `\' constructs.
Here is a complicated regexp, used by Emacs to recognize the end of a sentence together with any whitespace that follows. It is given in Lisp syntax to enable you to distinguish the spaces from the tab characters. In Lisp syntax, the string constant begins and ends with a double-quote. `\"' stands for a double-quote as part of the regexp, `\\' for a backslash as part of the regexp, `\t' for a tab and `\n' for a newline.
"[.?!][]\"')]*\\($\\|\t\\| \\)[ \t\n]*"
This contains four parts in succession: a character set matching period, `?' or `!'; a character set matching close-brackets, quotes or parentheses, repeated any number of times; an alternative in backslash-parentheses that matches end-of-line, a tab or two spaces; and a character set matching whitespace characters, repeated any number of times.
ptx
compatibility mode
This section outlines the differences between this program and standard
ptx
. For someone used to standard ptx
, here are some
points worth noticing when not using ptx
compatibility mode:
troff
or
nroff
. By default, output is rather formatted for a dumb
terminal. troff
or nroff
output may still be selected
through option -O
.
-R
option is used, the maximum reference
width is subtracted from the total output line width. In ptx
compatibility mode, width of references are not taken into account in
the output line width computations.
ptx
compatibility mode. However, standard
ptx
does not accept 8-bit characters, a few control characters
are rejected, and the tilde ~ is condemned.
ptx
compatibility mode.
However, standard ptx
processes only the first 200 characters in
each line.
ptx
compatibility mode, the break
characters default to space, tab and newline only.
ptx
compatibility mode. Even in
ptx
mode, there are some slight disposition glitches this
program does not completely reproduce, even if it comes quite close.
ptx
compatibility mode is not the same
as in normal mode. In default installation, default Ignore files are
`/usr/lib/eign' in ptx
compatibility mode, and nothing in
normal mode.
ptx
disallows specifying both the Ignore file and the
Only file at the same time. This version allows both, and specifying an
Only file does not inhibit processing the Ignore file.
This software is meant to evolve towards a concordance package for GNU, which should ideally be able to tackle true, real, big concordance jobs, while staying fast and of easy for little jobs. Several packages of this kind are awfully slow, I'm trying to keep speed in mind. I am interested in interactive query, but postpone burdening myself too much too soon about it.
Here is a What To Do Next list, in expected execution order.
sort
has been released recently, and could evolve with gptx
.