word2x Sources

General layout

The top-level directory contains the front end and small things that do not (yet) merit a directory of their own. Things that have not yet moved out of this directory are polluting it too :-)

The inc subdirectory contains all the include files and is in imminent danger of losing some of them. In particular header files specific to indivual reader stages are likely to move into the reader subdirectory.

compat contians objects that implement things that word2x wants but are not present in various vendor`s C libraries. lib contains objects used by various bits of word2x, some of which are migrating are now deemed not general purpose enough to stay in the library and migrated to the source file that needs them.

The doc directory contains some developer orientated documentation. samples contains various word documents, emailed to me by various people having trouble converting them. devel_progs contains a few gooides you might find useful (in particular there is a hex dump program that you might prefer to od).

RCS directories contain RCS sources, some of which are none to flattering about the ability of yours truly :-)

Reader stages

The reader directory contains sources for reading a word file and turning in into a stream of tokens. reader.cc implements a few shared reader functions, argparse.cc is covered below, and tok_misc.cc contains the code to print tokens in a programmer friendly manner.

junk_filter.cc is class dervived from streambuf for stripping junik out of word documents and reducing them to text. collect.cc is a token source that combines entire paragraphs from the input stream, currently usuyally generated by junk_filter.cc.

The other reader stages process token streams dervived from some source dervived from tok_trans (any tok_src subclass is such a class, incidently).

extract_emebed.cc splits out embedded formating from unprocessed paragraphs. 99% of the formating isstored seperately by the reaming 1% is easy enough to interpret.
start_end.cc insert <DOCUMENT, START> before the first token and <DOCUMENT, END> after the last token.
old_table.cc is the old table processing code
table_sz.cc determines the size of a table
table_fill.cc fills in the entries, and desigend to feed from instance of table_sz.
table_stuff.cc puts blank entries to make up whole rows and is a useful postprocessor from either old_table or a table_sz, table_fill pipeline.
eqn.cc handles old-style inlined maths and abuse by word users.
eqnarr.cc handles multiple consequecutive paragraphs of maths and treats them as LaTeX eqnarray candidates.
list.cc detects and converts numbered, bulleted, etc lists (defined by having the numebrs, letter, or bullets, enough items and not too much speration between items).
sections.cc tries to catch section heading and treat them as headings.
dlopen.xcc is an attempt to support stages linked in at run time using the dlopen(2) functions, supported by many unicies.
null.cc is a stub to things that are not yet implemented.
tokwall.cc queues up tokens and forwards them in big chunks.
tap.cc prints the tokens going thorugh it and does no processing.

eqnarr.cc and sections.cc are not implemented yet and currently replaced by a stub. I will fill them in eventually but hope someone does so before I get around to this stuff.

Everything with tunable values allows people to twiddle them as options using the modular option handling.

The front end

The front end of word2x is implemented in word2x.cc, driver.cc and mainopts.cc

word2x.cc is a front end of the program and gets thinner as more stuff is deemed worthy of a seperate source file. --digest is handled specially in this file. The global options are delegated to mainopts.cc

driver.cc constructs the driver using the specification given to --digest or "default". make_stages calkls make_stage with each comma seperated stage. If the stage is an alias the make_stage calls make_stages with the expansion of the alias. Currently cicular alias expansion is not detected.

mainopts.cc implements the global option using the fancy argument parsing features found in arguent.cc. The --digest option is a dummy to stop getopt_long comlaining about this option.

Infrastructure

Currently there are two major pieces of infrastcture, modlar argument parsing and the modlar reader stages.

argument.cc implements the fancy argumnet parsing. The only major code is the argparse class. The constructor collects all the arguments together. parse_args calls getopt_long and forwards the arguments to the appropiate modules. Finally the destructor frees all the allocated data, etc.

The reader stages are derived classes from tok_trans. Token sources are derived from tok_src which is dervived from tok_trans. Almost all everything is implemented as inline code in inc/reader.h.

reader/reader.cc implements a few shared reader functions. argparse.cc treats arguments passed to modules as lst of long format options, wiht the -- prefix removed, seprated by commas with arguments seperated by = signs. Modules are free to treat the argument as they see fit but currently all modules with tunable parameters feed it to argparse.

Development top page

Duncan Simpson

Last modified: Tue Apr 28 20:05:41 BST 1998