Glimpseindex provides three indexing options: a tiny index (2-3% of the total size of all files), a small index (7-8%) and a medium-size index (20-30%). Search times are normally better with larger indexes. To index all your files, you say glimpseindex ~ for tiny index (where ~ stands for the home directory), glimpseindex -o ~ for small index, and glimpseindex -b ~ for medium.
Mail glimpse-request@cs.arizona.edu to be added to the glimpse mailing list. Mail glimpse@cs.arizona.edu to report bugs, ask questions, discuss tricks for using glimpse, etc. (this is a moderated mailing list with very little traffic, mostly announcements). HTML version of these manual pages can be found in http://glimpse.cs.arizona.edu/glimpseindexhelp.html Also, see the glimpse home pages in http://glimpse.cs.arizona.edu/
glimpseindex -o ~
The index consists of several files (described in detail below), all with the prefix .glimpse_ stored in the user's home directory (unless otherwise specified with the -H option). Files with one of the following suffixes are not indexed: ".o", ".gz", ".Z", ".z", ".hqx", ".zip", ".tar". (Unless the -z option is used, see below.) In addition, glimpseindex attempts to determine whether a file is a text file and does not index files that it thinks are not text files. Numbers are not indexed unless the -n option is used. It is possible to prevent specified files from being indexed by adding their names to the .glimpse_exclude file (described below). The -o option builds a larger index than without it (typically about 7-8% vs. 2-3% without -o) allowing for a faster search (1-5 times faster). The -b builds an even larger index and allows an even faster search some of the time (-b is helpful mostly when large files are present). There is an incremental indexing option -f, which updates an existing index by determining which files have been created or modified since the index was built and adding them to the index (see -f). Glimpseindex is reasonably fast, taking about 20 minutes to index 100MB from scratch (on a SUN Sparc 5) and 2-4 minutes to update an existing index. (Your mileage may vary.) It is also possible to increment the index by adding a specific file (the -a option).
Once an index is built, searching for pattern is as easy as saying
glimpse pattern
(See man glimpse for all glimpse's options and features.)
Glimpse does not automatically index files. You have to tell it to do it. This can be done manually, but a better way is to set it to run every night. It is probably a good idea to run glimpseindex manually for the first time to be sure it works properly. The following is a simple script to run glimpseindex every night. We assume that this script is stored in a file called glimpse.script:
glimpseindex -o -t -w 5000 ~ >& .glimpse_out
at -m 0300 glimpse.script
(It might be interesting to collect all the outputs of glimpse by changing >& to >>& so that the file .glimpse_out maintains a history. In this case the file must be created before the first time >>& is used. If you use ksh, replace '>&' with '2>&1'.)
Glimpseindex stores the names of all the files that it indexed in the file .glimpse_filenames. Each file is listed by its full path name as obtained at the time the files were indexed. For example, /usr1/udi/file1. Glimpse uses this full name when it performs the search, so the name must match the current name. This may become a problem when the indexing and the search are done from different machines (e.g., through NFS), which may cause the path names to be different. For example, /tmp_mnt/R/xxx/xxx/usr1/udi/file1. (The same is true for several other .glimpse files. See below.)
Glimpseindex does not follow symbolic links unless they are explicitly included in the .glimpse_include file (described below).
Glimpseindex makes an effort to identify non-text files such as binary files, compressed files, uuencoded files, postscript files, binhex files, etc. These files are automatically not indexed. In addition, all files whose names end with `.o', `.gz', `.Z', `.z', `.hqx', `.zip', or `.tar' will not be indexed (unless they are specifically included in .glimpse_include - see below).
The options for glimpseindex are as follows:
(separated by tabs) then before indexing any file that matches the pattern "*.Z" (same syntax as the one for .glimpse_exclude) the command listed is executed first (assuming input is from stdin, which is why uncompress needs <) and its output (assuming it goes to stdout) is indexed. The file itself is not changed (i.e., it stays compressed). Then if glimpse -z is used, the same program is used on these files on the fly. Any program can be used (we run 'exec'). For example, one can filter out parts of files that should not be indexed. Glimpseindex tries to apply all filters in .glimpse_filters in the order they are given. For example, if you want to uncompress a file and then extract some part of it, put the compression command (the example above) first and then another line that specifies the extraction. Note that this can slow down the search because the filters need to be run before files are searched.
All files used by glimpse are located at the directory(ies) where the index(es) is (are) stored and have .glimpse_ as a prefix. The first two files (.glimpse_exclude and .glimpse_include) are optionally supplied by the user. The other files are built and read by glimpse.
contains a list of files that glimpseindex is explicitly told to ignore. In general, the syntax of .glimpse_exclude/include is the same as that of agrep (or any other grep). The lines in the .glimpse_exclude file are matched to the file names, and if they match, the files are excluded. Notice that agrep matches to parts of the string! e.g., agrep /ftp/pub will match /home/ftp/pub and /ftp/pub/whatever. So, if you want to exclude /ftp/pub/core, you just list it, as is, in the .glimpse_exclude file. If you put "/home/ftp/pub/cdrom" in .glimpse_exclude, every file name that matches that string will be excluded, meaning all files below it. You can use ^ to indicate the beginning of a file name, and $ to indicate the end of one, and you can use * and ? in the usual way. For example /ftp/*html will exclude /ftp/pub/foo.html, but will also exclude /home/ftp/pub/html/whatever; if you want to exclude files that start with /ftp and end with html use ^/ftp*html$ Notice that putting a * at the beginning or at the end is redundant (in fact, in this case glimpseindex will remove the * when it does the indexing). No other meta characters are allowed in .glimpse_exclude (e.g., don't use .* or # or |). Lines with * or ? must have no more than 30 characters. Notice that, although the index itself will not be indexed, the list of file names (.glimpse_filenames) will be indexed unless it is explicitly listed in .glimpse_exclude.
See the description above for the -z option.
contains a list of files that glimpseindex is explicitly told to include in the index even though they may look like non-text files. Symbolic links are followed by glimpseindex only if they are specifically included here. The syntax is the same as the one for .glimpse_exclude (see there). If a file is in both .glimpse_exclude and .glimpse_include it will be excluded unless -i is used.
contains the list of all indexed file names, one per line. This is an ASCII file that can also be used with agrep to search for a file name leading to a fast find command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that have 'count' in their name (including anywhere on the path from the index). Setting the following alias in the .login file may be useful:
alias findfile 'glimpse -h \!:1 ~/.glimpse_filenames'
contains the index. The index consists of lines, each starting with a word followed by a list of block numbers (unless the -o or -b options are used, in which case each word is followed by an offset into the file .glimpse_partitions where all pointers are kept). The block/file numbers are stored in binary form, so this is not an ASCII file.
contains the output of the -w option (see above).
contains the partition of the indexed space into blocks and, when the index is built with the -o or -b options, some part of the index. This file is used internally by glimpse and it is a non-ASCII file.
contains some statistics about the makeup of the index. Useful for some advanced applications and customization of glimpse.
type{17}: Directory-Listing md5{32}: 3858c73d68616df0ed58a44d306b12baAny string can serve as an attribute name. Glimpse "pattern;type=Directory-Listing" will search for "pattern" only in files whose type is "Directory-Listing". The file itself is considered to be one "object" and its name/url appears as the first attribute with an "@" prefix; e.g., @FILE { http://xxx... } The scope of Boolean operations changes from records (lines) to whole files when structured queries are used in glimpse (since individual query terms can look at different attributes and they may not be "covered" by the record/line). Note that glimpse can only search for patterns in the value parts of the SOIF file: there are some attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's internal routines. See http://harvest.cs.colorado.edu/harvest/user-manual/ for more detailed information of the SOIF format.
U. Manber and S. Wu, "GLIMPSE: A Tool to Search Through Entire File Systems," Usenix Winter 1994 Technical Conference, San Francisco (January 1994), pp. 23-32. Also, Technical Report #TR 93-34, Dept. of Computer Science, University of Arizona, October 1993 (a postscript file is available by anonymous ftp at cs.arizona.edu:reports/1993/TR93-34.ps).
S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Communications of the ACM 35 (October 1992), pp. 83-91.
The index of glimpse is word based. A pattern that contains more than one word cannot be found in the index. The way glimpse overcomes this weakness is by splitting any multi-word pattern into its set of words and looking for all of them in the index. For example, glimpse 'linear programming' will first consult the index to find all files containing both linear and programming, and then apply agrep to find the combined pattern. This is usually an effective solution, but it can be slow for cases where both words are very common, but their combination is not.
The index of glimpse stores all patterns in lower case. When glimpse searches the index it first converts all patterns to lower case, finds the appropriate files, and then searches the actual files using the original patterns. So, for example, glimpse ABCXYZ will first find all files containing abcxyz in any combination of lower and upper cases, and then searches these files directly, so only the right cases will be found. One problem with this approach is discovering misspellings that are caused by wrong cases. For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz (because the pattern is converted to lower case); it will find that there are matches with no errors, and will go to those files to search them directly, this time with the original upper cases. If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error. Another problem is speed. If you search for "ATT", it will look at the index for "att". Unless you use -w to match the whole word, glimpse may have to search all files containing, for example, "Seattle" which has "att" in it.
There is no size limit for simple patterns and simple patterns with Boolean AND. More complicated patterns are currently limited to approximately 30 characters. Lines are limited to 1024 characters. Records are limited to 48K, and may be truncated if they are larger than that. The limit of record length can be changed by modifying the parameter Max_record in agrep.h.
Each line in .glimpse_exclude or .glimpse_include that contains a * or a ? must not exceed 30 characters length.
Glimpseindex does not index words of size > 64.
A medium-size index (-b) may lead to actually slower query times if the files are all very small.
Under -b, it may be impossible to make the stop list empty. Glimpseindex is using the "sort" routine, and all occurrences of a word appear at some point on one line. Sort is limiting the size of lines it can handle (the value depends on the platform; ours is 16KB). If the lines are too big, the word is added to the stop list.
Please send bug reports or comments to glimpse@cs.arizona.edu.