The mkid
program builds the ID database. To do this it must scan
each of the files included in the database. This takes some time, but
once the work is done the query programs run very rapidly.
The mkid
program knows how to scan a variety of of files. For
example, it knows how to skip over comments and strings in a C program,
only picking out the identifiers used in the code.
Identifiers are not the only thing included in the database. Numbers are also scanned and included in the database indexed by their binary value. Since the same number can be written many different ways (47, 0x2f, 057 in a C program for instance), this feature allows you to find hard coded uses of constants without regard to the radix used to specify them.
All the places in this document where identifiers are written about should really mention identifiers and numbers, but that gets fairly clumsy after a while, so you should always keep in mind that numbers are included in the database as well as identifiers.
-v
] [-Sscanarg
] [-aarg-file
] [-
] [-fout-file
] [-u
] [files
...]
-v
-Sscanarg
-S
option is used to specify arguments to the various language
scanners. See section Scanner Arguments, for details.
-aarg-file
-
-
by itself means read arguments from stdin.
-fout-file
ID
(in the current directory), but you may specify any name. The file names
stored in the database will be stored relative to the directory containing
the database, so if you move the database after creating it, you may have
trouble finding files unless they remain in the same relative position.
-u
-u
option updates an existing database by rescanning any files
that have changed since the database was written. Unfortunately you cannot
incrementally add new files to a database.
files
Scanner arguments all start with -S
. Scanner arguments are used to tell
mkid
which language scanner to use for which files, to pass language
specific options to the individual scanners, and to get some limited
online help about scanner options.
Mkid
usually determines which language scanner to use on a file
by looking at the suffix of the file name. The suffix starts at the last
`.' in a file name and includes the `.' and all remaining
characters (for example the suffix of `fred.c' is `.c'). Not
all files have a suffix, and not all suffixes are bound to a specific
language by mkid. If mkid
cannot determine what language a file
is, it will use the language bound to the `.default' suffix. The
plain text scanner is normally bound to `.default', but the
-S
option can be used to change any language bindings.
There are several different forms for scanner options:
-S.<suffix>=<language>
Mkid
determines which language scanner to use on a file by examining the
file name suffix. The `.' is part of the suffix and must be specified
in this form of the -S
option. For example `-S.y=c' tells
mkid
to use the `c' language scanner for all files ending in
the `.y' suffix.
-S.<suffix>=?
Mkid
has several built in suffixes it already recognizes. Passing
a `?' will cause it to print the language it will use to scan files
with that suffix.
-S?=<language>
-S?=?
mkid
.
-S<language>-<arg>
-S
option is used to pass arbitrary arguments to the language scanners.
-S<language>?
-S<new language>/<builtin language>/<filter command>
If you run mkid -S?=?
you will find bindings for a number of
languages; unfortunately pascal, though mentioned in the list, is not
actually supported. The supported languages are documented below
(1).
The C scanner is probably the most popular. It scans identifiers out of C programs, skipping over comments and strings in the process. The normal `.c' and `.h' suffixes are automatically recognized as C language, as well as the more obscure `.y' (yacc) and `.l' (lex) suffixes.
The -S
options recognized by the C scanner are:
-Sc-s<character>
$
in identifiers, so you could say -Sc-s$
to
accept that dialect).
-Sc-u
-Sc+u
The plain text scanner is designed for scanning documents. This is
typically the scanner used when adding custom scanners, and several
custom scanners are built in to mkid
and defined in terms of filters
and the text scanner. A troff scanner runs deroff
over the file
then feeds the result to the text scanner. A compressed man page scanner
runs pcat
piped into col -b
, and a TeX scanner runs
detex
.
Options:
-Stext+a<character>
-Stext-a<character>
-Stext+s<character>
-Stext-s<character>
Assemblers come in several flavors, so there are several options to control scanning of assembly code:
-Sasm-c<character>
-Sasm+u
-Sasm-u
-Sasm+a<character>
-Sasm-a<character>
-Sasm+p
-Sasm-p
-Sasm+C
-Sasm-C
There are two ways to add new scanners to mkid
. The first is to
modify the code in `getscan.c' and add a new `scan-*.c' file
with the code for your scanner. This is not too hard, but it requires
relinking and installing a new version of mkid
, which might be
inconvenient, and would lead to the proliferation of mkid
versions.
The second technique uses the -S<lang>/<lang>/<filter>
form
of the -S
option to specify a new language scanner. In this form
the first language is the name of the new language to be defined,
the second language is the name of an existing language scanner to
be invoked on the output of the filter command specified as the
third component of the -S
option.
The filter is an arbitrary shell command. Somewhere in the filter string,
a %s
should occur. This %s
is replaced by the name of the
source file being scanned, the shell command is invoked, and whatever
comes out on stdout is scanned using the builtin scanner.
For example, no scanner is provided for texinfo files (like this one). If I wished to index the contents of this file, but avoid indexing the texinfo directives, I would need a filter that stripped out the texinfo directives, but left the remainder of the file intact. I could then use the plain text scanner on the remainder. A quick way to specify this might be:
'-S/texinfo/text/sed s,@[a-z]*,,g < %s'
This defines a new language scanner (texinfo) defined in terms of
a sed
command to strip out texinfo directives (at signs followed
by letters). Once the directives are stripped, the remaining text is run
through the plain text scanner.
This is just an example, to do a better job I would actually need to
delete some lines (such as those beginning with @end
) as well
as deleting the @
directives embedded in the text.
The simplest example of mkid
is something like:
mkid *.[chy]
This will build an ID database indexing all the
identifiers and numbers in the `.c', `.h', and `.y' files
in the current directory. Because those suffixes are already known to
mkid
as C language files, no other special arguments are required.
From a simple example, lets go to a more complex one. Suppose you want
to build a database indexing the contents of all the man pages.
Since mkid
already knows how to deal with `.z' files, let's
assume your system is using the compress
program to store
compressed cattable versions of the man pages. The
compress
program creates files with a .Z
suffix, so
mkid
will have to be told how to scan `.Z' files. The
following code shows how to combine the find
command with the
special scanner arguments to mkid
to generate the required ID
database:
cd /usr/catman find . -name '*.Z' -print | mkid '-Sman/text/uncompress -c < %s' -S.Z=man -
This example first switches to the `/usr/catman' directory where
the compressed man pages are stored. The find
command then
finds all the `.Z' files under that directory and prints their
names. This list is piped into the mkid
program. The -
argument by itself (at the end of the line) tells mkid
to read
arguments (in this case the list of file names) from stdin. The
first -S
argument defines a new language (man) in terms of
the uncompress
utility and the existing text scanner. The second
-S
argument tells mkid
to treat all `.Z' files as
language man. In practice, you might find the mkid
arguments need to be even more complex, something like:
mkid '-Sman/text/uncompress -c < %s | col -b' -S.Z=man -
This will take the additional step of getting rid of any underlining and backspacing which might be present in the compressed man pages.