Word2x FAQ

The FAQ is intended to answers questions people have asked me at least semi-frequently or I feel need airing.

Question index

  1. What is word2x?
  2. How do I compile word2x?
  3. What hacking is required to compile word2x?
  4. word2x emits strange messages when processing the document.
  5. word2x does not handle tab characters
  6. Some of my document is in the wrong order or missing
  7. word2x generates no output.
  8. word2x output duplicates some of the text.
  9. word2x output includes some junk.
  10. word2x dumps core given <a given document>
  11. word2x misses all the structure
  12. Word2x output prints differently from the original word document.
  13. What is currently happenning?

Quesions

What is word2x?

word2x is a program for extracting the text from wod .DOC files (RTF is not supported at present, see below) and hopefully retiaining most of the structure of the document. Even when word2x misses the structure it a much better solution than using the tarditional UNIX tools designed for binaries, for example strings.

The present version's understanding of the binary formatting information is approximately nil, so there is a considerable body of code designed to guess the structure. Given my test documents word2x figures out most of the structure.

I am aware of some documents that lack all the hints word2x looks for and appear as plain text. If anyone has such a document and a fix that does not mess up other documents, I would like to know.

How do I compile word2x?

Compiling the development version is left as a challenge and may not be possible. It is a snapshot of my current development source tree, with a few things left out (for legal or ethical reasons). Complains about things being incomplete will not be accepted.

The last complete version should produce a makefile and config.h file that will allow to type "make" and let make do the rest. Microsoft users will need gmake due to the lack of features in nmake. There may be some fixes for things that have arisen in the source directory and these have been applied to the source here. The first fix to word2x-0.002 fixes some header file problems that managed not to affect most people.

What hacking is required to compile word2x?

Hopefully none :-) I am attempting to eliminate the platform specific problems (compiler bugs, no ranlib on SGI, ANSI violations, etc). Various platforms still have problems that have not been fixed yet.

Word2x generates strange messages

This answer also explains why you get junk it the extracted text.

These probably indicate confusing or violations of the programs assumptions about how word format works. Less debugged and bullet proof bits of code generate more messages.

This problem is due to the fact that the reader strips out almost all of the binary formatting information and then attempts to figure out everything from the remaining informaion. If the code dumps core then it is a bug and should be fixed.

The extraction code sometimes picks up bits of junk that in other parts of the document. The real fix is an OLE reading library which makes it possible to eliminate almost all the problems. Hint: If you want this quickly then finish the coding off for me.

Setting the parameters to more aggresive values will cut out more junk at the risk of missing text you wnated to keep as well.

Word2x does not handle tabs

This is a known feature and will be fixed when I thin of a good method of dealing with the uses and abuses of tabs. Handling tabs requires some multiple token conversion into some apropiate alternative tokens. If anyone has any brilliant ideas how this wshould operate please tell me. Junking leading tabs at the beginning of a paragraph, since *TeX gets it right anyway, is likely to happen moderately soon.

Some of my document is in the wrong order or missing

word2x is currently largerly a text striper on serious drugs. If you want the binary format understood to deal with things like heading that gets extracted from the binary information at the end of the document in the right place, send me the code, licenced under something liberal enough for me to redistrubute under the GPL.

If you want word basic or VB to work, then reverse-engineer the P-code and send me enough to find the macro references and run the run P-code. Anti-reverse enigneering clauses are illegal in Europe (as is much of the rest of shrinkwrap everywhere---in particular the standard disclaimers almost certaintly contravene the laws about products being fit for their purpose, etc).

The tests are "Can I distribute your stuff under the GPL?" and "Does it work?". If either answer is no then I can not accept code.

word2x output duplicates some of the text

The document was fast saved. A version of word2x that understands OLE (relively close) and fast save information (not likely soon) is the real fix. In the mean time use your favorite text editor. word2x is a lot faster than using strings and emacs (and similar methods). Even with the format documents I will need plenty of test samples. (Hint: contributions that deal with this stuff will be accepted).

word2x generates no output

First ensure that it really is word binary format. word2x does not handle RTF but there is no shortage of tools that do handle RTF. If you send me an RTF to the central format reader then word2x will support RTF. Such a beast is below bottom of the wish list in priority order.

Assuming this is not the problem note that word documents can be big and contain no text. Assuming you the document has enough text then it might contain to much junk at the beginning which kills the start recognition code. The netx major release will hopefully eliminate many problem by understing OLE archives. (Rather a lot changed and not everything has been updated yet).

Tuning the junk filtration parameters to less aggresive values might fix the problem but is liable to reduce the signal-to-noise (text to junk) of documents that do work. The new infrastructure making it easy to allow setting of tuning parameters on the command line will fix the problem that changes take a recompile.

Assuming it is not one of the known problems (RTF, not word binary format, etc) it is presumably a limitation I would like to fix. Please make sure the document has not got any information that should be kept from me and email it to word2x@duncan.telstar.net.

word2x dumps core given <a given document>

Core dumps and mistranslation of characters in the output is definately a bug. Please send bug inducing documents to word2x@duncan.telstar.net (after making sure I can be allowed to read them). Bug reports with fixes, ideally unified context diffs, are appreciated. When reporting mistranslation problems please tell me what the correct tarnslation should be...I can not determine this with non-exstant langauge knowledge or PSI powers (which I lack, sadly).

word2x misses the structure of the document

As explained above, word2x strips out most of the formating information. It then looks for clues to identify chapters, sections, etc. There are 3 broad categories of clues used.

These tests are subject to sanity test so lists do not get indentified as headings and vica-versa. The code seems to be quite good at guessing correctly. Note that this may "fix" some documents without structure that really should have been structured.

If the clues are not around then I am afriad you will have to read the document and do what a mere program could not manage. This is a lot quicker than extracting the document using primitive means (strings, a text editor that handles arbitary files, for example emacs, and combinations of these tools; I speak from experience).

word2x output prints differently

word2x is not intended to produce the same output as word. There is some code to "fix" various dire abuses typical of word users (in particular when writing maths). If the structure of the document is the same the program did what it was intended to do. Specific fonts and things like that are deliberately ignored.

What is currently happening?

Current activity is rather constrained by my copious free time and the need to spend some of it sleeping :-) The current focus is word2x 2 (aka. word2x ng) which is not very visible here. Activities include