Suppose that four elementary steps are selected at path optimization
time. Then recode
will split itself into four different tasks
interconnected with pipes, logically equivalent to:
step1 <input | step2 | step3 | step4 >output
The main driver constructs, while initializing all conversion modules, a table giving all the conversion routines available (single steps) and for each, the starting charset and the ending charset. If we consider these charsets as being the nodes of a directed graph, each single step may be considered as oriented arc from one node to the other. A cost is attributed to each arc: for example, a high penality is given to single steps which are prone to loosing characters, a low penality is given to those which need studying more than one input character for producing an output character, etc.
Given a starting code and a goal code, recode
computes the most
economical route through the elementary recodings, that is, the best
sequence of conversions that will transform the input charset into the
final charset. To speed up execution, recode
looks for
subsequences of conversions which are simple enough to be merged, it
then dynamically creates new single steps, of course, use them.
A double step is a sequence of two single steps, the output of the
first being the special charset rfc1345
(which is not directly
available to the user), the input of the second single step being also
rfc1345
. A special machinery dynamically produces efficient,
reversible, mergeable single steps out of these double steps.
The main part of recode
is written in C, as are most single
steps. A few single steps need to recognize sequences of multiple
characters, they are often better written in flex
.
It is easy for a programmer to add a new charset to recode
. All
it requires is making a few functions kept in a single `.c' file,
adjusting `Makefile.in', and remaking recode
.
One of the function should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not loose too much information while converting. The other function should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Once again, select any charset for which you will not loose too much information while converting.
If, for any of these two functions, you have to read multiple bytes of
the old charset before recognizing the character to produce, you might
prefer programming it in flex
in a separate `.l' file.
Prototype your C or flex
files after one of those which exist
already, so to keep the sources uniform. Besides, at make
time,
all `.l' files are automatically merged into a single big one by
the script `mergelex.awk', which requires sources to follow some
rules. Mimetism is a simple approach which relieves me of explaining
all these rules!
Each of your source files should have its own initialization function,
named module_charset
, which is meant to be executed
quickly, once, prior to any recoding. It should declare the name of
your charsets and the single steps (or elementary recodings) you
provide, by calling declare_step
one or more times. Besides the
charset names, declare_step
expects a description of the recoding
quality (see `recode.h') and two functions you also provide.
The first such function has the purpose of allocating structures,
preconditionning conversion tables, etc. It is also the usual way of
further modifying the STEP
structure. This function is executed
only if and when the single step is retained in an actual recoding
sequence. If you do not need such delayed initialization, merely use
NULL for the function argument.
The second function executes the elementary recoding on a whole file. There are a few cases when you can spare writing this function:
file_one_to_one
, but have a delayed initialization for presetting
the field one_to_one
to the predefined value one_to_same
.
file_one_to_one
, but have a delayed initialization for
presetting the STEP
field one_to_one
with your table.
file_one_to_many
, but have a delayed initialization for
presetting the STEP
field one_to_many
with your table.
If you have a recoding table handy in a suitable format but do not use
one of the predefined recoding functions, it is still a good idea to use
a delayed initialization to save it anyway, because recode
option
-h
will take advantage of this information when available.
Finally, edit `Makefile.in' to add the source file name of your
routines to the C_STEPS
or L_STEPS
macro definition,
depending on the fact your routines is written in C or in flex
.
For C files only, also modify the STEPOBJS
macro definition.