The String
class is designed to extend GNU C++ to support
string processing capabilities similar to those in languages like
Awk. The class provides facilities that ought to be convenient
and efficient enough to be useful replacements for char*
based processing via the C string library (i.e., strcpy,
strcmp,
etc.) in many applications. Many details about String
representations are described in the Representation section.
A separate SubString
class supports substring extraction
and modification operations. This is implemented in a way that
user programs never directly construct or represent substrings,
which are only used indirectly via String operations.
Another separate class, Regex
is also used indirectly via String
operations in support of regular expression searching, matching, and the
like. The Regex class is based entirely on the GNU Emacs regex
functions. See section `Syntax of Regular Expressions' in GNU Emacs Manual, for a full
explanation of regular expression syntax. (For implementation details,
see the internal documentation in files `regex.h' and
`regex.c'.)
Strings are initialized and assigned as in the following examples:
String x; String y = 0; String z = "";
String x = "Hello"; String y("Hello");
String x = 'A'; String y('A');
String u = x; String v(x);
String u = x.at(1,4); String v(x.at(1,4));
String x("abc", 2);
String x = dec(20);
char*
function.
There are no directly accessible forms for declaring SubString variables.
The declaration Regex r("[a-zA-Z_][a-zA-Z0-9_]*");
creates
a compiled regular expression suitable for use in String
operations described below. (In this case, one that matches any
C++ identifier). The first argument may also be a String.
Be careful in distinguishing the role of backslashes in quoted
GNU C++ char* constants versus those in Regexes. For example, a Regex
that matches either one or more tabs or all strings beginning
with "ba" and ending with any number of occurrences of "na"
could be declared as Regex r = "\\(\t+\\)\\|\\(ba\\(na\\)*\\)"
Note that only one backslash is needed to signify the tab, but
two are needed for the parenthesization and virgule, since the
GNU C++ lexical analyzer decodes and strips backslashes before
they are seen by Regex.
There are three additional optional arguments to the Regex constructor that are less commonly useful:
fast (default 0)
fast
may be set to true (1) if the Regex should be
"fast-compiled". This causes an additional compilation step that
is generally worthwhile if the Regex will be used many times.
bufsize (default max(40, length of the string))
transtable (default none == 0)
As a convenience, several Regexes are predefined and usable in any program. Here are their declarations from `String.h'.
extern Regex RXwhite; // = "[ \n\t]+" extern Regex RXint; // = "-?[0-9]+" extern Regex RXdouble; // = "-?\\(\\([0-9]+\\.[0-9]*\\)\\| // \\([0-9]+\\)\\| // \\(\\.[0-9]+\\)\\) // \\([eE][---+]?[0-9]+\\)?" extern Regex RXalpha; // = "[A-Za-z]+" extern Regex RXlowercase; // = "[a-z]+" extern Regex RXuppercase; // = "[A-Z]+" extern Regex RXalphanum; // = "[0-9A-Za-z]+" extern Regex RXidentifier; // = "[A-Za-z_][A-Za-z0-9_]*"
Most String
class capabilities are best shown via example.
The examples below use the following declarations.
String x = "Hello"; String y = "world"; String n = "123"; String z; char* s = ","; String lft, mid, rgt; Regex r = "e[a-z]*o"; Regex r2("/[a-z]*/"); char c; int i, pos, len; double f; String words[10]; words[0] = "a"; words[1] = "b"; words[2] = "c";
The usual lexicographic relational operators (==, !=, <, <=, >, >=
)
are defined. A functional form compare(String, String)
is also
provided, as is fcompare(String, String)
, which compares
Strings without regard for upper vs. lower case.
All other matching and searching operations are based on some form of the
(non-public) match
and search
functions. match
and
search
differ in that match
attempts to match only at the
given starting position, while search
starts at the position, and
then proceeds left or right looking for a match. As seen in the following
examples, the second optional startpos
argument to functions using
match
and search
specifies the starting position of the
search: If non-negative, it results in a left-to-right search starting at
position startpos
, and if negative, a right-to-left search starting
at position x.length() + startpos
. In all cases, the index returned
is that of the beginning of the match, or -1 if there is no match.
Three String functions serve as front ends to search
and match
.
index
performs a search, returning the index, matches
performs
a match, returning nonzero (actually, the length of the match) on success,
and contains
is a boolean function performing either a search or
match, depending on whether an index argument is provided:
x.index("lo")
x.index("l", 2)
x.index("l", -1)
x.index("l", -3)
pos = r.search("leo", 3, len, 0)
char*
string of length 3,
starting at position 0, also placing the length of the match
in reference parameter len.
x.contains("He")
x.contains("el", 1)
contains
,
if present, means to match the substring only at that position,
and not to search elsewhere in the string.
x.contains(RXwhite);
RXwhite
is a global whitespace Regex.
x.matches("lo", 3)
x.matches(r)
int f = x.freq("l")
Substrings may be extracted via the at
, before
,
through
, from
, and after
functions.
These behave as either lvalues or rvalues.
z = x.at(2, 3)
x.at(2, 2) = "r"
x.at("He") = "je";
x.at("l", -1) = "i";
z = x.at(r)
z = x.before("o")
x.before("ll") = "Bri";
z = x.before(2)
z = x.after("Hel")
z = x.through("el")
z = x.from("el")
x.after("Hel") = "p";
z = x.after(3)
z = " ab c"; z = z.after(RXwhite)
x[0] = 'J';
common_prefix(x, "Help")
common_suffix(x, "to")
z = x + s + ' ' + y.at("w") + y.after("w") + ".";
x += y;
cat(x, y, z)
cat(z, y, x, x)
y.prepend(x);
z = replicate(x, 3);
z = join(words, 3, "/")
z = "this string has five words"; i = split(z, words, 10, RXwhite);
int nmatches x.gsub("l","ll")
z = x + y; z.del("loworl");
z = reverse(x)
z = upcase(x)
z = downcase(x)
z = capitalize(x)
x.reverse(), x.upcase(), x.downcase(), x.capitalize()
cout << x
cout << x.at(2, 3)
cin >> x
x.length()
s = (const char*)x
char*
char array. This
coercion is useful for sending a String as an argument to any
function expecting a const char*
argument (like
atoi
, and File::open
). This operator must be
used with care, since the conversion returns a pointer
to String
internals without copying the characters:
The resulting (char*)
is only valid until
the next String operation, and you must not modify it.
(The conversion is defined to return a const
value so that GNU C++ will produce warning and/or error
messages if changes are attempted.)