Word File Format

J

James Knowles

Hi,

I am trying to build a document indexing tool for a .NET application. I am
fine with the indexing bit and but I cannot find any real information on
Converting word documents into Text. I do not want to use the word object and
application as I will be indexing around about 40,000 word documents. So I
want to be able to read the file directly and extract the text and index.
Does anyone know where I can find out about the word file format or can point
me in a direction were I can find out more information on this.

Thanks for any help,

James
 
J

Jezebel

That's not a practical approach. The Word file format is not publically
available, and is in any case a strange and complex beast. It's encrypted
for security reasons (separately from password encryption) to prevent an
outside app manipulating it, eg to insert malicious macro code; it's
polymorphic (open a file, change it, undo the change, save it, and you may
end up with a *completely* different file); it contains a heap of stuff
apart from the text (macros, stypes, properties, version data, bookmarks,
etc); and the text may be fragmented across multiple storyranges which are
not necessarily contiguous within the file.

So even if you succeeded in decoding the file format, it's unlikely that
your code would be quicker than using the Word object anyway.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top