Previous: introduction Up: ../chrrtn.html Next: support-criteria
The problems with FORTRAN 66 Hollerith data are well-known, and
although the KARxxx routines largely removed them, when Hollerith
support is no longer available, FORTRAN 77 CHARACTER data will have to
be used.
In the view of the author, the definition of CHARACTER data in the 1977
FORTRAN Standard was very poorly done, and has done significant harm to
FORTRAN software portability. This is a strong statement, and it bears
some explanation.
First of all, the Hollerith data type is dropped from the 1977
Standard. This means that a very large body of existing FORTRAN
software which uses character data, even in an at-present widely
portable fashion, may require extensive changes to run with a FORTRAN
77 compiler, unless manufacturers can be pressed to continue support of
character data stored in Hollerith constants and variables.
The 1977 standard prohibits all storage equivalencing, either via
COMMON and EQUIVALENCE statements, or by FUNCTION or SUBROUTINE
argument associations, between CHARACTER data and all other FORTRAN
data types. This is in sharp contrast to the usual lax implementations
of FORTRAN for all other data types. This was necessary to enable
FORTRAN 77 to support character strings of indefinite length, so that
declarations of the form
SUBROUTINE A (B)
CHARACTER B*(*)
could be permitted, allowing CHARACTER variables to inherit a string
length from a calling program. This forces a compiler to generate code
to pass to a called routine the address of a string descriptor
containing size information as well the actual address of the character
data. Also, on word-addressed machines, CHARACTER data may begin in
the middle of a word, so storage equivalencing could be problematic.
Second, standardized library support of character data in the form of
useful utility routines is non-existent in the 1977 Standard, apart
from the ICHAR and CHAR functions for converting between INTEGER and
CHARACTER form.
Third, null character strings, that is, strings of zero length, are not
permitted. Null strings are in fact quite useful, and indeed, even
necessary in some applications. In particular, a null string cannot be
simulated by any string of non-zero length.
Fourth, the 1977 Standard does not specify the character set to be
used. The fact that many manufacturers employ their private versions
of character sets, each with its own special character repertoire and
collating sequence, only continues to perpetrate additional machine
dependence upon FORTRAN users.
Fifth, the 1977 Standard in allowing declarations of the form
CHARACTER*n did not specify what minimum 'n' should be supported by a
standard conforming compiler. One might hope that this would not be
less than the number of characters that could reside in the host
machine's (possibly virtual) address space. At the least, one might
conclude that an assignment of the form "A='long string'" spanning the
permitted 19 continuation lines would be permitted.
Alas, few compilers permit even this much, and string length
limitations of 128, 256, and 512 are common, and only a few (e.g. ElXsi
and DEC-20) set the limit at the machine address space size.
Interestingly, the 1977 Standard clearly states that a CHARACTER*n
argument passed to a subprogram can be legally received as an array of
n CHARACTER*1 values, and vice versa. Since none of the compilers seem
to put a limit on array sizes, it is odd that they do so on string
lengths. The reason of course is the peculiar requirement of the
Standard that the LEN() function be able to return the declared length
of its argument string; no such function is provided for obtaining the
declared dimension of an array. Most implementations therefore
represent a CHARACTER variable by a string descriptor containing a
length field and an address field, and both of these have fixed sizes
allotted to them. It seems foolish that although most architectures
now require 24 or more bits for the address field, only 7, 8, 9 or 16
should be allocated for the length field to "save storage".
Sixth, although the 1977 Standard removed many of the unreasonable
restrictions on where expressions could be permitted in FORTRAN source
code, it introduced a new one in the form of prohibiting taking a
substring of a constant or an expression!
If one examines string support and typical use thereof in languages
like PL/1 and C, two characteristics become evident. First of all,
strings whose length can vary dynamically (up to some compile time
limit set by the user, not by the compiler) are supported, and the null
string is legal. Having varying length strings without a null string
is like having integers without a zero; how else can something be
initialized to empty? FORTRAN 77, Pascal, Modula/2, and Ada, all make
the mistake of requiring fixed length strings, and in Pascal and Ada,
because of their strong typing, strings of different lengths have
different types, and are therefore not conformant.
Second, individual characters can be processed as equivalents of small
integers equal to their position in the host character set. Thus, in
C, one can convert a lower case letter to upper case by adding the
expression 'A' - 'a', without having to know precisely what the
equivalent integers are. In additions, printable representations of
commonly used non-printable characters, such as backspace, tab,
newline, carriage return, formfeed, and so on, are provided, so that
one can easily construct strings which span lines or contain control
characters. The integer equivalents make it possible to index arrays
by character values, making for efficient lookup. C in particular
makes good use of this in its standard library for determining whether
characters are letters, digits, printable, upper case or lower case,
etc.
FORTRAN 77 has the ICHAR() function which is supposed to return an
integer ordinal greater or equal to 0, representing the position of the
argument character in the host character set. However, the Standard
defines only letters, digits, and thirteen special characters, a total
of only 49. This means that a processor is free to implement whatever
it likes for arguments to ICHAR() which are not among these. It could
even legally raise a fatal error in such a case. Most implementations
do in fact return an integer for all possible characters which can be
stored in the host CHARACTER storage unit, but the sign of the integer
is not guaranteed to be positive.
On older architectures with 36 (IBM 70xx, Univac 1108), 48 (Burroughs),
or 60 (CDC) bit words, 6-bit characters were common, and an even number
fit into the host words. This only permits 64 different characters,
which is not enough to have both letter cases. The ISO/ASCII character
set has 128 different character values and can represent both upper
case and lower case letters. On the 36-bit DEC-10 and -20 machines,
these are stored as five 7-bit characters per word, with one unused
bit. On the 36-bit Univax 11xx machines, newer compilers store four
9-bit characters per word, with the two high order bits of each
character unused. Most newer architectures are based on an address
unit of an 8-bit byte, or have a word size which is a multiple of this
(e.g. 64-bit Cray words). The EBCDIC character set used by IBM has 256
characters to make complete use of the byte storage unit. With the
ASCII character set, however, only 7 bits are required, and something
has to be done about the extra bit in an 8-bit byte. Prime makes it
one and treats the byte as an unsigned integer, so their ASCII ordinals
go from 128 to 255 (a violation of the ANSI and ISO Standards, I
believe). Other machines ignore it, and still others use it as a sign
bit. In the latter case, ICHAR() can return values 0 .. 127 when the
high bit is zero, then -128 .. -1 when it is set.
In summary then, one cannot be sure in FORTRAN 77 whether CHARACTER
data can be used to access every bit in memory (it does not on the
DEC-10 and -20, or on any machine which ignores the high-order bits),
or whether ICHAR() can be used to obtain an integer which can
confidently be used as an array index.