This page collects puvlically avialable information about word binary format. "The file format's Handbook" (G.Born) has information on Word for DOS and Write format. (Write format is a close relative of word format). Their are some major differences between word 6 and word 7+ format.
Word .doc
files are OLE archives. A good source of
information and PERL code for processing these archives are the LOALA
pages. The development code includes much of a C++ implementation
of an OLE library (Microsoft's library is useless in Un*x
environment where it is not avialable). Some publically avialable
DCE code is included, for the distributed features. C source for
converting Unix dates to Windows 32 dates, as found in OLE
archives, and vica-versa is included in the OLE library.
The main file (the WordDocument stream in the archives) is divided into 128-byte blocks that are entirely used for a single purpose (this is easy to discover and no suprise given write and word for DOS format). Excess space in blocks is apparently filled with 0.
If you have information not on this page that is publically avaialable email to Duncan for verification and inclusion here. I have the real information but it is subejct to a non-disclosure agreement. The format is baroque, and explains the size of the word binary.