Go to the first, previous, next, last section, table of contents.

A Library of awk Functions

This chapter presents a library of useful awk functions. The sample programs presented later (see section Practical awk Programs) use these functions. The functions are presented here in a progression from simple to complex.

section Extracting Programs from Texinfo Source Files, presents a program that you can use to extract the source code for these example library functions and programs from the Texinfo source for this book. (This has already been done as part of the gawk distribution.)

If you have written one or more useful, general purpose awk functions, and would like to contribute them for a subsequent edition of this book, please contact the author. See section Reporting Problems and Bugs, for information on doing this. Don't just send code, as you will be required to either place your code in the public domain, publish it under the GPL (see section GNU GENERAL PUBLIC LICENSE), or assign the copyright in it to the Free Software Foundation.

Simulating gawk-specific Features

The programs in this chapter and in section Practical awk Programs, freely use features that are specific to gawk. This section briefly discusses how you can rewrite these programs for different implementations of awk.

Diagnostic error messages are sent to `/dev/stderr'. Use `| "cat 1>&2"' instead of `> "/dev/stderr"', if your system does not have a `/dev/stderr', or if you cannot use gawk.

A number of programs use nextfile (see section The nextfile Statement), to skip any remaining input in the input file. section Implementing nextfile as a Function, shows you how to write a function that will do the same thing.

Finally, some of the programs choose to ignore upper-case and lower-case distinctions in their input. They do this by assigning one to IGNORECASE. You can achieve the same effect by adding the following rule to the beginning of the program:

# ignore case
{ $0 = tolower($0) }

Also, verify that all regexp and string constants used in comparisons only use lower-case letters.

Implementing nextfile as a Function

The nextfile statement presented in section The nextfile Statement, is a gawk-specific extension. It is not available in other implementations of awk. This section shows two versions of a nextfile function that you can use to simulate gawk's nextfile statement if you cannot use gawk.

Here is a first attempt at writing a nextfile function.

# nextfile -- skip remaining records in current file

# this should be read in before the "main" awk program

function nextfile()    { _abandon_ = FILENAME; next }

_abandon_ == FILENAME  { next }

This file should be included before the main program, because it supplies a rule that must be executed first. This rule compares the current data file's name (which is always in the FILENAME variable) to a private variable named _abandon_. If the file name matches, then the action part of the rule executes a next statement, to go on to the next record. (The use of `_' in the variable name is a convention. It is discussed more fully in section Naming Library Function Global Variables.)

The use of the next statement effectively creates a loop that reads all the records from the current data file. Eventually, the end of the file is reached, and a new data file is opened, changing the value of FILENAME. Once this happens, the comparison of _abandon_ to FILENAME fails, and execution continues with the first rule of the "real" program.

The nextfile function itself simply sets the value of _abandon_ and then executes a next statement to start the loop going.(16)

This initial version has a subtle problem. What happens if the same data file is listed twice on the command line, one right after the other, or even with just a variable assignment between the two occurrences of the file name?

In such a case, this code will skip right through the file, a second time, even though it should stop when it gets to the end of the first occurrence. Here is a second version of nextfile that remedies this problem.

# nextfile -- skip remaining records in current file
# correctly handle successive occurrences of the same file
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May, 1993

# this should be read in before the "main" awk program

function nextfile()   { _abandon_ = FILENAME; next }

_abandon_ == FILENAME {
      if (FNR == 1)
          _abandon_ = ""
      else
          next
}

The nextfile function has not changed. It sets _abandon_ equal to the current file name and then executes a next satement. The next statement reads the next record and increments FNR, so FNR is guaranteed to have a value of at least two. However, if nextfile is called for the last record in the file, then awk will close the current data file and move on to the next one. Upon doing so, FILENAME will be set to the name of the new file, and FNR will be reset to one. If this next file is the same as the previous one, _abandon_ will still be equal to FILENAME. However, FNR will be equal to one, telling us that this is a new occurrence of the file, and not the one we were reading when the nextfile function was executed. In that case, _abandon_ is reset to the empty string, so that further executions of this rule will fail (until the next time that nextfile is called).

If FNR is not one, then we are still in the original data file, and the program executes a next statement to skip through it.

An important question to ask at this point is: "Given that the functionality of nextfile can be provided with a library file, why is it built into gawk?" This is an important question. Adding features for little reason leads to larger, slower programs that are harder to maintain.

The answer is that building nextfile into gawk provides significant gains in efficiency. If the nextfile function is executed at the beginning of a large data file, awk still has to scan the entire file, splitting it up into records, just to skip over it. The built-in nextfile can simply close the file immediately and proceed to the next one, saving a lot of time. This is particularly important in awk, since awk programs are generally I/O bound (i.e. they spend most of their time doing input and output, instead of performing computations).

Assertions

When writing large programs, it is often useful to be able to know that a condition or set of conditions is true. Before proceeding with a particular computation, you make a statement about what you believe to be the case. Such a statement is known as an "assertion." The C language provides an <assert.h> header file and corresponding assert macro that the programmer can use to make assertions. If an assertion fails, the assert macro arranges to print a diagnostic message describing the condition that should have been true but was not, and then it kills the program. In C, using assert looks this:

#include <assert.h>

int myfunc(int a, double b)
{
     assert(a <= 5 && b >= 17);
     ...
}

If the assertion failed, the program would print a message similar to this:

prog.c:5: assertion failed: a <= 5 && b >= 17

The ANSI C language makes it possible to turn the condition into a string for use in printing the diagnostic message. This is not possible in awk, so this assert function also requires a string version of the condition that is being tested.

# assert -- assert that a condition is true. Otherwise exit.
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May, 1993

function assert(condition, string)
{
    if (! condition) {
        printf("%s:%d: assertion failed: %s\n",
            FILENAME, FNR, string) > "/dev/stderr"
        _assert_exit = 1
        exit 1
    }
}

END {
    if (_assert_exit)
        exit 1
}

The assert function tests the condition parameter. If it is false, it prints a message to standard error, using the string parameter to describe the failed condition. It then sets the variable _assert_exit to one, and executes the exit statement. The exit statement jumps to the END rule. If the END rules finds _assert_exit to be true, then it exits immediately.

The purpose of the END rule with its test is to keep any other END rules from running. When an assertion fails, the program should exit immediately. If no assertions fail, then _assert_exit will still be false when the END rule is run normally, and the rest of the program's END rules will execute. For all of this to work correctly, `assert.awk' must be the first source file read by awk.

You would use this function in your programs this way:

function myfunc(a, b)
{
     assert(a <= 5 && b >= 17, "a <= 5 && b >= 17")
     ...
}

If the assertion failed, you would see a message like this:

mydata:1357: assertion failed: a <= 5 && b >= 17

There is a problem with this version of assert, that it may not be possible to work around. An END rule is automatically added to the program calling assert. Normally, if a program consists of just a BEGIN rule, the input files and/or standard input are not read. However, now that the program has an END rule, awk will attempt to read the input data files, or standard input (see section Startup and Cleanup Actions), most likely causing the program to hang, waiting for input.

Just a note on programming style. You may have noticed that the END rule uses backslash continuation, with the open brace on a line by itself. This is so that it more closely resembles the way functions are written. Many of the examples in this chapter and the next one use this style. You can decide for yourself if you like writing your BEGIN and END rules this way, or not.

Translating Between Characters and Numbers

One commercial implementation of awk supplies a built-in function, ord, which takes a character and returns the numeric value for that character in the machine's character set. If the string passed to ord has more than one character, only the first one is used.

The inverse of this function is chr (from the function of the same name in Pascal), which takes a number and returns the corresponding character.

Both functions can be written very nicely in awk; there is no real reason to build them into the awk interpreter.

# ord.awk -- do ord and chr
#
# Global identifiers:
#    _ord_:        numerical values indexed by characters
#    _ord_init:    function to initialize _ord_
#
# Arnold Robbins
# arnold@gnu.ai.mit.edu
# Public Domain
# 16 January, 1992
# 20 July, 1992, revised

BEGIN    { _ord_init() }

function _ord_init(    low, high, i, t)
{
    low = sprintf("%c", 7) # BEL is ascii 7
    if (low == "\a") {    # regular ascii
        low = 0
        high = 127
    } else if (sprintf("%c", 128 + 7) == "\a") {
        # ascii, mark parity
        low = 128
        high = 255
    } else {        # ebcdic(!)
        low = 0
        high = 255
    }

    for (i = low; i <= high; i++) {
        t = sprintf("%c", i)
        _ord_[t] = i
    }
}

Some explanation of the numbers used by chr is worthwhile. The most prominent character set in use today is ASCII. Although an eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only defines characters that use the values from zero to 127.(17) At least one computer manufacturer that we know of uses ASCII, but with mark parity, meaning that the leftmost bit in the byte is always one. What this means is that on those systems, characters have numeric values from 128 to 255. Finally, large mainframe systems use the EBCDIC character set, which uses all 256 values. While there are other character sets in use on some older systems, they are not really worth worrying about.

function ord(str,    c)
{
    # only first character is of interest
    c = substr(str, 1, 1)
    return _ord_[c]
}

function chr(c)
{
    # force c to be numeric by adding 0
    return sprintf("%c", c + 0)
}

#### test code ####
# BEGIN    \
# {
#    for (;;) {
#        printf("enter a character: ")
#        if (getline var <= 0)
#            break
#        printf("ord(%s) = %d\n", var, ord(var))
#    }
# }

An obvious improvement to these functions would be to move the code for the _ord_init function into the body of the BEGIN rule. It was written this way initially for ease of development.

There is a "test program" in a BEGIN rule, for testing the function. It is commented out for production use.

Merging an Array Into a String

When doing string processing, it is often useful to be able to join all the strings in an array into one long string. The following function, join, accomplishes this task. It is used later in several of the application programs (see section Practical awk Programs).

Good function design is important; this function needs to be general, but it should also have a reasonable default behavior. It is called with an array and the beginning and ending indices of the elements in the array to be merged. This assumes that the array indices are numeric--a reasonable assumption since the array was likely created with split (see section Built-in Functions for String Manipulation).

# join.awk -- join an array into a string
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May 1993

function join(array, start, end, sep,    result, i)
{
    if (sep == "")
       sep = " "
    else if (sep == SUBSEP) # magic value
       sep = ""
    result = array[start]
    for (i = start + 1; i <= end; i++)
        result = result sep array[i]
    return result
}

An optional additional argument is the separator to use when joining the strings back together. If the caller supplies a non-empty value, join uses it. If it is not supplied, it will have a null value. In this case, join uses a single blank as a default separator for the strings. If the value is equal to SUBSEP, then join joins the strings with no separator between them. SUBSEP serves as a "magic" value to indicate that there should be no separation between the component strings.

It would be nice if awk had an assignment operator for concatenation. The lack of an explicit operator for concatenation makes string operations more difficult than they really need to be.

Turning Dates Into Timestamps

The systime function built in to gawk returns the current time of day as a timestamp in "seconds since the Epoch." This timestamp can be converted into a printable date of almost infinitely variable format using the built-in strftime function. (For more information on systime and strftime, see section Functions for Dealing with Time Stamps.)

An interesting but difficult problem is to convert a readable representation of a date back into a timestamp. The ANSI C library provides a mktime function that does the basic job, converting a canonical representation of a date into a timestamp.

It would appear at first glance that gawk would have to supply a mktime built-in function that was simply a "hook" to the C language version. In fact though, mktime can be implemented entirely in awk.

Here is a version of mktime for awk. It takes a simple representation of the date and time, and converts it into a timestamp.

The code is presented here intermixed with explanatory prose. In section Extracting Programs from Texinfo Source Files, you will see how the Texinfo source file for this book can be processed to extract the code into a single source file.

The program begins with a descriptive comment and a BEGIN rule that initializes a table _tm_months. This table is a two-dimensional array that has the lengths of the months. The first index is zero for regular years, and one for leap years. The values are the same for all the months in both kinds of years, except for February; thus the use of multiple assignment.

# mktime.awk -- convert a canonical date representation
#                into a timestamp
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May 1993

BEGIN    \
{
    # Initialize table of month lengths
    _tm_months[0,1] = _tm_months[1,1] = 31
    _tm_months[0,2] = 28; _tm_months[1,2] = 29
    _tm_months[0,3] = _tm_months[1,3] = 31
    _tm_months[0,4] = _tm_months[1,4] = 30
    _tm_months[0,5] = _tm_months[1,5] = 31
    _tm_months[0,6] = _tm_months[1,6] = 30
    _tm_months[0,7] = _tm_months[1,7] = 31
    _tm_months[0,8] = _tm_months[1,8] = 31
    _tm_months[0,9] = _tm_months[1,9] = 30
    _tm_months[0,10] = _tm_months[1,10] = 31
    _tm_months[0,11] = _tm_months[1,11] = 30
    _tm_months[0,12] = _tm_months[1,12] = 31
}

The benefit of merging multiple BEGIN rules (see section The BEGIN and END Special Patterns) is particularly clear when writing library files. Functions in library files can cleanly initialize their own private data and also provide clean-up actions in private END rules.

The next function is a simple one that computes whether a given year is or is not a leap year. If a year is evenly divisible by four, but not evenly divisible by 100, or if it is evenly divisible by 400, then it is a leap year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be.

# decide if a year is a leap year
function _tm_isleap(year,    ret)
{
    ret = (year % 4 == 0 && year % 100 != 0) ||
            (year % 400 == 0)

    return ret
}

This function is only used a few times in this file, and its computation could have been written in-line (at the point where it's used). Making it a separate function made the original development easier, and also avoids the possibility of typing errors when duplicating the code in multiple places.

The next function is more interesting. It does most of the work of generating a timestamp, which is converting a date and time into some number of seconds since the Epoch. The caller passes an array (rather imaginatively named a) containing six values: the year including century, the month as a number between one and 12, the day of the month, the hour as a number between zero and 23, the minute in the hour, and the seconds within the minute.

The function uses several local variables to precompute the number of seconds in an hour, seconds in a day, and seconds in a year. Often, similar C code simply writes out the expression in-line, expecting the compiler to do constant folding. E.g., most C compilers would turn `60 * 60' into `3600' at compile time, instead of recomputing it every time at run time. Precomputing these values makes the function more efficient.

# convert a date into seconds
function _tm_addup(a,    total, yearsecs, daysecs,
                         hoursecs, i, j)
{
    hoursecs = 60 * 60
    daysecs = 24 * hoursecs
    yearsecs = 365 * daysecs

    total = (a[1] - 1970) * yearsecs

    # extra day for leap years
    for (i = 1970; i < a[1]; i++)
        if (_tm_isleap(i))
            total += daysecs

    j = _tm_isleap(a[1])
    for (i = 1; i < a[2]; i++)
        total += _tm_months[j, i] * daysecs

    total += (a[3] - 1) * daysecs
    total += a[4] * hoursecs
    total += a[5] * 60
    total += a[6]

    return total
}

The function starts with a first approximation of all the seconds between Midnight, January 1, 1970,(18) and the beginning of the current year. It then goes through all those years, and for every leap year, adds an additional day's worth of seconds.

The variable j holds either one or zero, if the current year is or is not a leap year. For every month in the current year prior to the current month, it adds the number of seconds in the month, using the appropriate entry in the _tm_months array.

Finally, it adds in the seconds for the number of days prior to the current day, and the number of hours, minutes, and seconds in the current day.

The result is a count of seconds since January 1, 1970. This value is not yet what is needed though. The reason why is described shortly.

The main mktime function takes a single character string argument. This string is a representation of a date and time in a "canonical" (fixed) form. This string should be "year month day hour minute second".

# mktime -- convert a date into seconds,
#            compensate for time zone

function mktime(str,    res1, res2, a, b, i, j, t, diff)
{
    i = split(str, a, " ")    # don't rely on FS

    if (i != 6)
        return -1

    # force numeric
    for (j in a)
        a[j] += 0

    # validate
    if (a[1] < 1970 ||
        a[2] < 1 || a[2] > 12 ||
        a[3] < 1 || a[3] > 31 ||
        a[4] < 0 || a[4] > 23 ||
        a[5] < 0 || a[5] > 59 ||
        a[6] < 0 || a[6] > 61 )
            return -1

    res1 = _tm_addup(a)
    t = strftime("%Y %m %d %H %M %S", res1)

    if (_tm_debug)
        printf("(%s) -> (%s)\n", str, t) > "/dev/stderr"

    split(t, b, " ")
    res2 = _tm_addup(b)

    diff = res1 - res2

    if (_tm_debug)
        printf("diff = %d seconds\n", diff) > "/dev/stderr"

    res1 += diff

    return res1
}

The function first splits the string into an array, using spaces and tabs as separators. If there are not six elements in the array, it returns an error, signaled as the value -1. Next, it forces each element of the array to be numeric, by adding zero to it. The following `if' statement then makes sure that each element is within an allowable range. (This checking could be extended further, e.g., to make sure that the day of the month is within the correct range for the particular month supplied.) All of this is essentially preliminary set-up and error checking.

Recall that _tm_addup generated a value in seconds since Midnight, January 1, 1970. This value is not directly usable as the result we want, since the calculation does not account for the local timezone. In other words, the value represents the count in seconds since the Epoch, but only for UTC (Universal Coordinated Time). If the local timezone is east or west of UTC, then some number of hours should be either added to, or subtracted from the resulting timestamp.

For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west of (behind) UTC. It is only four hours behind UTC if daylight savings time is in effect. If you are calling mktime in Atlanta, with the argument "1993 5 23 18 23 12", the result from _tm_addup will be for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to add another four hours worth of seconds to the result.

How can mktime determine how far away it is from UTC? This is surprisingly easy. The returned timestamp represents the time passed to mktime as UTC. This timestamp can be fed back to strftime, which will format it as a local time; i.e. as if it already had the UTC difference added in to it. This is done by giving "%Y %m %d %H %M %S" to strftime as the format argument. It returns the computed timestamp in the original string format. The result represents a time that accounts for the UTC difference. When the new time is converted back to a timestamp, the difference between the two timestamps is the difference (in seconds) between the local timezone and UTC. This difference is then added back to the original result. An example demonstrating this is presented below.

Finally, there is a "main" program for testing the function.

BEGIN  {
    if (_tm_test) {
        printf "Enter date as yyyy mm dd hh mm ss: "
        getline _tm_test_date
    
        t = mktime(_tm_test_date)
        r = strftime("%Y %m %d %H %M %S", t)
        printf "Got back (%s)\n", r
    }
}

The entire program uses two variables that can be set on the command line to control debugging output and to enable the test in the final BEGIN rule. Here is the result of a test run. (Note that debugging output is to standard error, and test output is to standard output.)

$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1
-| Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10
error--> (1993 5 23 15 35 10) -> (1993 05 23 11 35 10)
error--> diff = 14400 seconds
-| Got back (1993 05 23 15 35 10)

The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993. The first line of debugging output shows the resulting time as UTC--four hours ahead of the local time zone. The second line shows that the difference is 14400 seconds, which is four hours. (The difference is only four hours, since daylight savings time is in effect during May.) The final line of test output shows that the timezone compensation algorithm works; the returned time is the same as the entered time.

This program does not solve the general problem of turning an arbitrary date representation into a timestamp. That problem is very involved. However, the mktime function provides a foundation upon which to build. Other software can convert month names into numeric months, and AM/PM times into 24-hour clocks, to generate the "canonical" format that mktime requires.

Managing the Time of Day

The systime and strftime functions described in section Functions for Dealing with Time Stamps, provide the minimum functionality necessary for dealing with the time of day in human readable form. While strftime is extensive, the control formats are not necessarily easy to remember or intuitively obvious when reading a program.

The following function, gettimeofday, populates a user-supplied array with pre-formatted time information. It returns a string with the current time formatted in the same way as the date utility.

# gettimeofday -- get the time of day in a usable format
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain, May 1993
#
# Returns a string in the format of output of date(1)
# Populates the array argument time with individual values:
#    time["second"]       -- seconds (0 - 59)
#    time["minute"]       -- minutes (0 - 59)
#    time["hour"]         -- hours (0 - 23)
#    time["althour"]      -- hours (0 - 12)
#    time["monthday"]     -- day of month (1 - 31)
#    time["month"]        -- month of year (1 - 12)
#    time["monthname"]    -- name of the month
#    time["shortmonth"]   -- short name of the month
#    time["year"]         -- year within century (0 - 99)
#    time["fullyear"]     -- year with century (19xx or 20xx)
#    time["weekday"]      -- day of week (Sunday = 0)
#    time["altweekday"]   -- day of week (Monday = 0)
#    time["weeknum"]      -- week number, Sunday first day
#    time["altweeknum"]   -- week number, Monday first day
#    time["dayname"]      -- name of weekday
#    time["shortdayname"] -- short name of weekday
#    time["yearday"]      -- day of year (0 - 365)
#    time["timezone"]     -- abbreviation of timezone name
#    time["ampm"]         -- AM or PM designation

function gettimeofday(time,    ret, now, i)
{
    # get time once, avoids unnecessary system calls
    now = systime()

    # return date(1)-style output
    ret = strftime("%a %b %d %H:%M:%S %Z %Y", now)

    # clear out target array
    for (i in time)
        delete time[i]

    # fill in values, force numeric values to be
    # numeric by adding 0
    time["second"]       = strftime("%S", now) + 0
    time["minute"]       = strftime("%M", now) + 0
    time["hour"]         = strftime("%H", now) + 0
    time["althour"]      = strftime("%I", now) + 0
    time["monthday"]     = strftime("%d", now) + 0
    time["month"]        = strftime("%m", now) + 0
    time["monthname"]    = strftime("%B", now)
    time["shortmonth"]   = strftime("%b", now)
    time["year"]         = strftime("%y", now) + 0
    time["fullyear"]     = strftime("%Y", now) + 0
    time["weekday"]      = strftime("%w", now) + 0
    time["altweekday"]   = strftime("%u", now) + 0
    time["dayname"]      = strftime("%A", now)
    time["shortdayname"] = strftime("%a", now)
    time["yearday"]      = strftime("%j", now) + 0
    time["timezone"]     = strftime("%Z", now)
    time["ampm"]         = strftime("%p", now)
    time["weeknum"]      = strftime("%U", now) + 0
    time["altweeknum"]   = strftime("%W", now) + 0

    return ret
}

The string indices are easier to use and read than the various formats required by strftime. The alarm program presented in section An Alarm Clock Program, uses this function.

The gettimeofday function is presented above as it was written. A more general design for this function would have allowed the user to supply an optional timestamp value that would have been used instead of the current time.

Noting Data File Boundaries

The BEGIN and END rules are each executed exactly once, at the beginning and end respectively of your awk program (see section The BEGIN and END Special Patterns). We (the gawk authors) once had a user who mistakenly thought that the BEGIN rule was executed at the beginning of each data file and the END rule was executed at the end of each data file. When informed that this was not the case, the user requested that we add new special patterns to gawk, named BEGIN_FILE and END_FILE, that would have the desired behavior. He even supplied us the code to do so.

However, after a little thought, I came up with the following library program. It arranges to call two user-supplied functions, beginfile and endfile, at the beginning and end of each data file. Besides solving the problem in only nine(!) lines of code, it does so portably; this will work with any implementation of awk.

# transfile.awk
#
# Give the user a hook for filename transitions
#
# The user must supply functions beginfile() and endfile()
# that each take the name of the file being started or
# finished, respectively.
#
# Arnold Robbins, arnold@gnu.ai.mit.edu, January 1992
# Public Domain

FILENAME != _oldfilename \
{
    if (_oldfilename != "")
        endfile(_oldfilename)
    _oldfilename = FILENAME
    beginfile(FILENAME)
}

END   { endfile(FILENAME) }

This file must be loaded before the user's "main" program, so that the rule it supplies will be executed first.

This rule relies on awk's FILENAME variable that automatically changes for each new data file. The current file name is saved in a private variable, _oldfilename. If FILENAME does not equal _oldfilename, then a new data file is being processed, and it is necessary to call endfile for the old file. Since endfile should only be called if a file has been processed, the program first checks to make sure that _oldfilename is not the null string. The program then assigns the current file name to _oldfilename, and calls beginfile for the file. Since, like all awk variables, _oldfilename will be initialized to the null string, this rule executes correctly even for the first data file.

The program also supplies an END rule, to do the final processing for the last file. Since this END rule comes before any END rules supplied in the "main" program, endfile will be called first. Once again the value of multiple BEGIN and END rules should be clear.

This version has same problem as the first version of nextfile (see section Implementing nextfile as a Function). If the same data file occurs twice in a row on command line, then endfile and beginfile will not be executed at the end of the first pass and at the beginning of the second pass. This version solves the problem.

# ftrans.awk -- handle data file transitions
#
# user supplies beginfile() and endfile() functions
#
# Arnold Robbins, arnold@gnu.ai.mit.edu. November 1992
# Public Domain

FNR == 1 {
    if (_filename_ != "")
        endfile(_filename_)
    _filename_ = FILENAME
    beginfile(FILENAME)
}

END  { endfile(_filename_) }

In section Counting Things, you will see how this library function can be used, and how it simplifies writing the main program.

Processing Command Line Options

Most utilities on POSIX compatible systems take options or "switches" on the command line that can be used to change the way a program behaves. awk is an example of such a program (see section Command Line Options). Often, options take arguments, data that the program needs to correctly obey the command line option. For example, awk's `-F' option requires a string to use as the field separator. The first occurrence on the command line of either `--' or a string that does not begin with `-' ends the options.

Most Unix systems provide a C function named getopt for processing command line arguments. The programmer provides a string describing the one letter options. If an option requires an argument, it is followed in the string with a colon. getopt is also passed the count and values of the command line arguments, and is called in a loop. getopt processes the command line arguments for option letters. Each time around the loop, it returns a single character representing the next option letter that it found, or `?' if it found an invalid option. When it returns -1, there are no options left on the command line.

When using getopt, options that do not take arguments can be grouped together. Furthermore, options that take arguments require that the argument be present. The argument can immediately follow the option letter, or it can be a separate command line argument.

Given a hypothetical program that takes three command line options, `-a', `-b', and `-c', and `-b' requires an argument, all of the following are valid ways of invoking the program:

prog -a -b foo -c data1 data2 data3
prog -ac -bfoo -- data1 data2 data3
prog -acbfoo data1 data2 data3

Notice that when the argument is grouped with its option, the rest of the command line argument is considered to be the option's argument. In the above example, `-acbfoo' indicates that all of the `-a', `-b', and `-c' options were supplied, and that `foo' is the argument to the `-b' option.

getopt provides four external variables that the programmer can use.

optind
The index in the argument value array (argv) where the first non-option command line argument can be found.
optarg
The string value of the argument to an option.
opterr
Usually getopt prints an error message when it finds an invalid option. Setting opterr to zero disables this feature. (An application might wish to print its own error message.)
optopt
The letter representing the command line option. While not usually documented, most versions supply this variable.

The following C fragment shows how getopt might process command line arguments for awk.

int
main(int argc, char *argv[])
{
    ...
    /* print our own message */
    opterr = 0;
    while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) {
        switch (c) {
        case 'f':    /* file */
            ...
            break;
        case 'F':    /* field separator */
            ...
            break;
        case 'v':    /* variable assignment */
            ...
            break;
        case 'W':    /* extension */
            ...
            break;
        case '?':
        default:
            usage();
            break;
        }
    }
    ...
}

As a side point, gawk actually uses the GNU getopt_long function to process both normal and GNU-style long options (see section Command Line Options).

The abstraction provided by getopt is very useful, and would be quite handy in awk programs as well. Here is an awk version of getopt. This function highlights one of the greatest weaknesses in awk, which is that it is very poor at manipulating single characters. Repeated calls to substr are necessary for accessing individual characters (see section Built-in Functions for String Manipulation).

The discussion walks through the code a bit at a time.

# getopt -- do C library getopt(3) function in awk
#
# arnold@gnu.ai.mit.edu
# Public domain
#
# Initial version: March, 1991
# Revised: May, 1993

# External variables:
#    Optind -- index of ARGV for first non-option argument
#    Optarg -- string value of argument to current option
#    Opterr -- if non-zero, print our own diagnostic
#    Optopt -- current option letter

# Returns
#    -1     at end of options
#    ?      for unrecognized option
#    <c>    a character representing the current option

# Private Data
#    _opti  index in multi-flag option, e.g., -abc

The function starts out with some documentation: who wrote the code, and when it was revised, followed by a list of the global variables it uses, what the return values are and what they mean, and any global variables that are "private" to this library function. Such documentation is essential for any program, and particularly for library functions.

function getopt(argc, argv, options,    optl, thisopt, i)
{
    optl = length(options)
    if (optl == 0)        # no options given
        return -1

    if (argv[Optind] == "--") {  # all done
        Optind++
        _opti = 0
        return -1
    } else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) {
        _opti = 0
        return -1
    }

The function first checks that it was indeed called with a string of options (the options parameter). If options has a zero length, getopt immediately returns -1.

The next thing to check for is the end of the options. A `--' ends the command line options, as does any command line argument that does not begin with a `-'. Optind is used to step through the array of command line arguments; it retains its value across calls to getopt, since it is a global variable.

The regexp used, /^-[^: \t\n\f\r\v\b]/, is perhaps a bit of overkill; it checks for a `-' followed by anything that is not whitespace and not a colon. If the current command line argument does not match this pattern, it is not an option, and it ends option processing.

    if (_opti == 0)
        _opti = 2
    thisopt = substr(argv[Optind], _opti, 1)
    Optopt = thisopt
    i = index(options, thisopt)
    if (i == 0) {
        if (Opterr)
            printf("%c -- invalid option\n",
                                  thisopt) > "/dev/stderr"
        if (_opti >= length(argv[Optind])) {
            Optind++
            _opti = 0
        } else
            _opti++
        return "?"
    }

The _opti variable tracks the position in the current command line argument (argv[Optind]). In the case that multiple options were grouped together with one `-' (e.g., `-abx'), it is necessary to return them to the user one at a time.

If _opti is equal to zero, it is set to two, the index in the string of the next character to look at (we skip the `-', which is at position one). The variable thisopt holds the character, obtained with substr. It is saved in Optopt for the main program to use.

If thisopt is not in the options string, then it is an invalid option. If Opterr is non-zero, getopt prints an error message on the standard error that is similar to the message from the C version of getopt.

Since the option is invalid, it is necessary to skip it and move on to the next option character. If _opti is greater than or equal to the length of the current command line argument, then it is necessary to move on to the next one, so Optind is incremented and _opti is reset to zero. Otherwise, Optind is left alone and _opti is merely incremented.

In any case, since the option was invalid, getopt returns `?'. The main program can examine Optopt if it needs to know what the invalid option letter actually was.

    if (substr(options, i + 1, 1) == ":") {
        # get option argument
        if (length(substr(argv[Optind], _opti + 1)) > 0)
            Optarg = substr(argv[Optind], _opti + 1)
        else
            Optarg = argv[++Optind]
        _opti = 0
    } else
        Optarg = ""

If the option requires an argument, the option letter is followed by a colon in the options string. If there are remaining characters in the current command line argument (argv[Optind]), then the rest of that string is assigned to Optarg. Otherwise, the next command line argument is used (`-xFOO' vs. `-x FOO'). In either case, _opti is reset to zero, since there are no more characters left to examine in the current command line argument.

    if (_opti == 0 || _opti >= length(argv[Optind])) {
        Optind++
        _opti = 0
    } else
        _opti++
    return thisopt
}

Finally, if _opti is either zero or greater than the length of the current command line argument, it means this element in argv is through being processed, so Optind is incremented to point to the next element in argv. If neither condition is true, then only _opti is incremented, so that the next option letter can be processed on the next call to getopt.

BEGIN {
    Opterr = 1    # default is to diagnose
    Optind = 1    # skip ARGV[0]

    # test program
    if (_getopt_test) {
        while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
            printf("c = <%c>, optarg = <%s>\n",
                                       _go_c, Optarg)
        printf("non-option arguments:\n")
        for (; Optind < ARGC; Optind++)
            printf("\tARGV[%d] = <%s>\n",
                                    Optind, ARGV[Optind])
    }
}

The BEGIN rule initializes both Opterr and Optind to one. Opterr is set to one, since the default behavior is for getopt to print a diagnostic message upon seeing an invalid option. Optind is set to one, since there's no reason to look at the program name, which is in ARGV[0].

The rest of the BEGIN rule is a simple test program. Here is the result of two sample runs of the test program.

$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
-| c = <a>, optarg = <>
-| c = <c>, optarg = <>
-| c = <b>, optarg = <ARG>
-| non-option arguments:
-|         ARGV[3] = <bax>
-|         ARGV[4] = <-x>

$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
-| c = <a>, optarg = <>
error--> x -- invalid option
-| c = <?>, optarg = <>
-| non-option arguments:
-|         ARGV[4] = <xyz>
-|         ARGV[5] = <abc>

The first `--' terminates the arguments to awk, so that it does not try to interpret the `-a' etc. as its own options.

Several of the sample programs presented in section Practical awk Programs, use getopt to process their arguments.

Reading the User Database

The `/dev/user' special file (see section Special File Names in gawk) provides access to the current user's real and effective user and group id numbers, and if available, the user's supplementary group set. However, since these are numbers, they do not provide very useful information to the average user. There needs to be some way to find the user information associated with the user and group numbers. This section presents a suite of functions for retrieving information from the user database. See section Reading the Group Database, for a similar suite that retrieves information from the group database.

The POSIX standard does not define the file where user information is kept. Instead, it provides the <pwd.h> header file and several C language subroutines for obtaining user information. The primary function is getpwent, for "get password entry." The "password" comes from the original user database file, `/etc/passwd', which kept user information, along with the encrypted passwords (hence the name).

While an awk program could simply read `/etc/passwd' directly (the format is well known), because of the way password files are handled on networked systems, this file may not contain complete information about the system's set of users.

To be sure of being able to produce a readable, complete version of the user database, it is necessary to write a small C program that calls getpwent. getpwent is defined to return a pointer to a struct passwd. Each time it is called, it returns the next entry in the database. When there are no more entries, it returns NULL, the null pointer. When this happens, the C program should call endpwent to close the database. Here is pwcat, a C program that "cats" the password database.

/*
 * pwcat.c
 *
 * Generate a printable version of the password database
 *
 * Arnold Robbins
 * arnold@gnu.ai.mit.edu
 * May 1993
 * Public Domain
 */

#include <stdio.h>
#include <pwd.h>

int
main(argc, argv)
int argc;
char **argv;
{
    struct passwd *p;

    while ((p = getpwent()) != NULL)
        printf("%s:%s:%d:%d:%s:%s:%s\n",
            p->pw_name, p->pw_passwd, p->pw_uid,
            p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);

    endpwent();
    exit(0);
}

If you don't understand C, don't worry about it. The output from pwcat is the user database, in the traditional `/etc/passwd' format of colon-separated fields. The fields are:

Login name
The user's login name.
Encrypted password
The user's encrypted password. This may not be available on some systems.
User-ID
The user's numeric user-id number.
Group-ID
The user's numeric group-id number.
Full name
The user's full name, and perhaps other information associated with the user.
Home directory
The user's login, or "home" directory (familiar to shell programmers as $HOME).
Login shell
The program that will be run when the user logs in. This is usually a shell, such as Bash (the Gnu Bourne-Again shell).

Here are a few lines representative of pwcat's output.

$ pwcat
-| root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
-| nobody:*:65534:65534::/:
-| daemon:*:1:1::/:
-| sys:*:2:2::/:/bin/csh
-| bin:*:3:3::/bin:
-| arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
-| miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
...

With that introduction, here is a group of functions for getting user information. There are several functions here, corresponding to the C functions of the same name.

# passwd.awk -- access password file information
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May 1993

BEGIN {
    # tailor this to suit your system
    _pw_awklib = "/usr/local/libexec/awk/"
}

function _pw_init(    oldfs, oldrs, olddol0, pwcat)
{
    if (_pw_inited)
        return
    oldfs = FS
    oldrs = RS
    olddol0 = $0
    FS = ":"
    RS = "\n"
    pwcat = _pw_awklib "pwcat"
    while ((pwcat | getline) > 0) {
        _pw_byname[$1] = $0
        _pw_byuid[$3] = $0
        _pw_bycount[++_pw_total] = $0
    }
    close(pwcat)
    _pw_count = 0
    _pw_inited = 1
    FS = oldfs
    RS = oldrs
    $0 = olddol0
}

The BEGIN rule sets a private variable to the directory where pwcat is stored. Since it is used to help out an awk library routine, we have chosen to put it in `/usr/local/libexec/awk'. You might want it to be in a different directory on your system.

The function _pw_init keeps three copies of the user information in three associative arrays. The arrays are indexed by user name (_pw_byname), by user-id number (_pw_byuid), and by order of occurrence (_pw_bycount).

The variable _pw_inited is used for efficiency; _pw_init only needs to be called once.

Since this function uses getline to read information from pwcat, it first saves the values of FS, RS, and $0. Doing so is necessary, since these functions could be called from anywhere within a user's program, and the user may have his or her own values for FS and RS.

The main part of the function uses a loop to read database lines, split the line into fields, and then store the line into each array as necessary. When the loop is done, _pw_init cleans up by closing the pipeline, setting _pw_inited to one, and restoring FS, RS, and $0. The use of _pw_count will be explained below.

function getpwnam(name)
{
    _pw_init()
    if (name in _pw_byname)
        return _pw_byname[name]
    return ""
}

The getpwnam function takes a user name as a string argument. If that user is in the database, it returns the appropriate line. Otherwise it returns the null string.

function getpwuid(uid)
{
    _pw_init()
    if (uid in _pw_byuid)
        return _pw_byuid[uid]
    return ""
}

Similarly, the getpwuid function takes a user-id number argument. If that user number is in the database, it returns the appropriate line. Otherwise it returns the null string.

function getpwent()
{
    _pw_init()
    if (_pw_count < _pw_total)
        return _pw_bycount[++_pw_count]
    return ""
}

The getpwent function simply steps through the database, one entry at a time. It uses _pw_count to track its current position in the _pw_bycount array.

function endpwent()
{
    _pw_count = 0
}

The endpwent function resets _pw_count to zero, so that subsequent calls to getpwent will start over again.

A conscious design decision in this suite is that each subroutine calls _pw_init to initialize the database arrays. The overhead of running a separate process to generate the user database, and the I/O to scan it, will only be incurred if the user's main program actually calls one of these functions. If this library file is loaded along with a user's program, but none of the routines are ever called, then there is no extra run-time overhead. (The alternative would be to move the body of _pw_init into a BEGIN rule, which would always run pwcat. This simplifies the code but runs an extra process that may never be needed.)

In turn, calling _pw_init is not too expensive, since the _pw_inited variable keeps the program from reading the data more than once. If you are worried about squeezing every last cycle out of your awk program, the check of _pw_inited could be moved out of _pw_init and duplicated in all the other functions. In practice, this is not necessary, since most awk programs are I/O bound, and it would clutter up the code.

The id program in section Printing Out User Information, uses these functions.

Reading the Group Database

Much of the discussion presented in section Reading the User Database, applies to the group database as well. Although there has traditionally been a well known file, `/etc/group', in a well known format, the POSIX standard only provides a set of C library routines (<grp.h> and getgrent) for accessing the information. Even though this file may exist, it likely does not have complete information. Therefore, as with the user database, it is necessary to have a small C program that generates the group database as its output.

Here is grcat, a C program that "cats" the group database.

/*
 * grcat.c
 *
 * Generate a printable version of the group database
 *
 * Arnold Robbins, arnold@gnu.ai.mit.edu
 * May 1993
 * Public Domain
 */

#include <stdio.h>
#include <grp.h>

int
main(argc, argv)
int argc;
char **argv;
{
    struct group *g;
    int i;

    while ((g = getgrent()) != NULL) {
        printf("%s:%s:%d:", g->gr_name, g->gr_passwd,
                                            g->gr_gid);
        for (i = 0; g->gr_mem[i] != NULL; i++) {
            printf("%s", g->gr_mem[i]);
            if (g->gr_mem[i+1] != NULL)
                putchar(',');
        }
        putchar('\n');
    }
    endgrent();
    exit(0);
}

Each line in the group database represent one group. The fields are separated with colons, and represent the following information.

Group Name
The name of the group.
Group Password
The encrypted group password. In practice, this field is never used. It is usually empty, or set to `*'.
Group ID Number
The numeric group-id number. This number should be unique within the file.
Group Member List
A comma-separated list of user names. These users are members of the group. Most Unix systems allow users to be members of several groups simultaneously. If your system does, then reading `/dev/user' will return those group-id numbers in $5 through $NF. (Note that `/dev/user' is a gawk extension; see section Special File Names in gawk.)

Here is what running grcat might produce:

$ grcat
-| wheel:*:0:arnold
-| nogroup:*:65534:
-| daemon:*:1:
-| kmem:*:2:
-| staff:*:10:arnold,miriam,andy
-| other:*:20:
...

Here are the functions for obtaining information from the group database. There are several, modeled after the C library functions of the same names.

# group.awk -- functions for dealing with the group file
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# May 1993

BEGIN    \
{
    # Change to suit your system
    _gr_awklib = "/usr/local/libexec/awk/"
}

function _gr_init(    oldfs, oldrs, olddol0, grcat, n, a, i)
{
    if (_gr_inited)
        return

    oldfs = FS
    oldrs = RS
    olddol0 = $0
    FS = ":"
    RS = "\n"

    grcat = _gr_awklib "grcat"
    while ((grcat | getline) > 0) {
        if ($1 in _gr_byname)
            _gr_byname[$1] = _gr_byname[$1] "," $4
        else
            _gr_byname[$1] = $0
        if ($3 in _gr_bygid)
            _gr_bygid[$3] = _gr_bygid[$3] "," $4
        else
            _gr_bygid[$3] = $0

        n = split($4, a, "[ \t]*,[ \t]*")
        for (i = 1; i <= n; i++)
            if (a[i] in _gr_groupsbyuser)
                _gr_groupsbyuser[a[i]] = \
                    _gr_groupsbyuser[a[i]] " " $1
            else
                _gr_groupsbyuser[a[i]] = $1

        _gr_bycount[++_gr_count] = $0
    }
    close(grcat)
    _gr_count = 0
    _gr_inited++
    FS = oldfs
    RS = oldrs
    $0 = olddol0
}

The BEGIN rule sets a private variable to the directory where grcat is stored. Since it is used to help out an awk library routine, we have chosen to put it in `/usr/local/libexec/awk'. You might want it to be in a different directory on your system.

These routines follow the same general outline as the user database routines (see section Reading the User Database). The _gr_inited variable is used to ensure that the database is scanned no more than once. The _gr_init function first saves FS, RS, and $0, and then sets FS and RS to the correct values for scanning the group information.

The group information is stored is several associative arrays. The arrays are indexed by group name (_gr_byname), by group-id number (_gr_bygid), and by position in the database (_gr_bycount). There is an additional array indexed by user name (_gr_groupsbyuser), that is a space separated list of groups that each user belongs to.

Unlike the user database, it is possible to have multiple records in the database for the same group. This is common when a group has a large number of members. Such a pair of entries might look like:

tvpeople:*:101:johny,jay,arsenio
tvpeople:*:101:david,conan,tom,joan

For this reason, _gr_init looks to see if a group name or group-id number has already been seen. If it has, then the user names are simply concatenated onto the previous list of users. (There is actually a subtle problem with the code presented above. Suppose that the first time there were no names. This code adds the names with a leading comma. It also doesn't check that there is a $4.)

Finally, _gr_init closes the pipeline to grcat, restores FS, RS, and $0, initializes _gr_count to zero (it is used later), and makes _gr_inited non-zero.

function getgrnam(group)
{
    _gr_init()
    if (group in _gr_byname)
        return _gr_byname[group]
    return ""
}

The getgrnam function takes a group name as its argument, and if that group exists, it is returned. Otherwise, getgrnam returns the null string.

function getgrgid(gid)
{
    _gr_init()
    if (gid in _gr_bygid)
        return _gr_bygid[gid]
    return ""
}

The getgrgid function is similar, it takes a numeric group-id, and looks up the information associated with that group-id.

function getgruser(user)
{
    _gr_init()
    if (user in _gr_groupsbyuser)
        return _gr_groupsbyuser[user]
    return ""
}

The getgruser function does not have a C counterpart. It takes a user name, and returns the list of groups that have the user as a member.

function getgrent()
{
    _gr_init()
    if (++gr_count in _gr_bycount)
        return _gr_bycount[_gr_count]
    return ""
}

The getgrent function steps through the database one entry at a time. It uses _gr_count to track its position in the list.

function endgrent()
{
    _gr_count = 0
}

endgrent resets _gr_count to zero so that getgrent can start over again.

As with the user database routines, each function calls _gr_init to initialize the arrays. Doing so only incurs the extra overhead of running grcat if these functions are used (as opposed to moving the body of _gr_init into a BEGIN rule).

Most of the work is in scanning the database and building the various associative arrays. The functions that the user calls are themselves very simple, relying on awk's associative arrays to do work.

The id program in section Printing Out User Information, uses these functions.

Naming Library Function Global Variables

Due to the way the awk language evolved, variables are either global (usable by the entire program), or local (usable just by a specific function). There is no intermediate state analogous to static variables in C.

Library functions often need to have global variables that they can use to preserve state information between calls to the function. For example, getopt's variable _opti (see section Processing Command Line Options), and the _tm_months array used by mktime (see section Turning Dates Into Timestamps). Such variables are called private, since the only functions that need to use them are the ones in the library.

When writing a library function, you should try to choose names for your private variables so that they will not conflict with any variables used by either another library function or a user's main program. For example, a name like `i' or `j' is not a good choice, since user programs often use variable names like these for their own purposes.

The example programs shown in this chapter all start the names of their private variables with an underscore (`_'). Users generally don't use leading underscores in their variable names, so this convention immediately decreases the chances that the variable name will be accidentally shared with the user's program.

In addition, several of the library functions use a prefix that helps indicate what function or set of functions uses the variables. For example, _tm_months in mktime (see section Turning Dates Into Timestamps), and _pw_byname in the user data base routines (see section Reading the User Database). This convention is recommended, since it even further decreases the chance of inadvertent conflict among variable names. Note that this convention can be used equally well both for variable names and for private function names too.

While I could have re-written all the library routines to use this convention, I did not do so, in order to show how my own awk programming style has evolved, and to provide some basis for this discussion.

As a final note on variable naming, if a function makes global variables available for use by a main program, it is a good convention to start that variable's name with a capital letter. For example, getopt's Opterr and Optind variables (see section Processing Command Line Options). The leading capital letter indicates that it is global, while the fact that the variable name is not all capital letters indicates that the variable is not one of awk's built-in variables, like FS.

It is also important that all variables in library functions that do not need to save state are in fact declared local. If this is not done, the variable could accidentally be used in the user's program, leading to bugs that are very difficult to track down.

function lib_func(x, y,    l1, l2)
{
    ...
    use variable some_var  # some_var could be local
    ...                   # but is not by oversight
}

A different convention, common in the Tcl community, is to use a single associative array to hold the values needed by the library function(s), or "package." This significantly decreases the number of actual global names in use. For example, the functions described in section Reading the User Database, might have used PW_data["inited"], PW_data["total"], PW_data["count"] and PW_data["awklib"], instead of _pw_inited, _pw_awklib, _pw_total, and _pw_count.

The conventions presented in this section are exactly that, conventions. You are not required to write your programs this way, we merely recommend that you do so.


Go to the first, previous, next, last section, table of contents.