Extracting words from ".doc" file

M

Maciej Bliziñski

Hello!

Let's say I have some ".doc" files that I want to extract words from. I
don't need doc2txt utility, as I don't need any formatting extracted.
The only thing I need is words separated in some way (spaces, other
characters, whtever). The other thing is that I need those words
encoded in ISO-8859-2, as they contain polish letters (like ó).

Everything should be done on Linux server, so I will have to parse them
with my own utility. When I open doc file in text editor, I can see lots
of rubbish, and the text, but letters are separated with some binary
byte, it looks like this.

^@w^@o^@r^@d^@s^@ ^@a^@r^@e^@^@h^@e^@r^@e^@

I will need those letter put together if I'm going to extract
words from the file:

"words are here"

There's no matter if there will be any rubbish around the words

"#$^#@$%&^@$words@#$%#$are@#$#@$%here#$%^"
^^^^^ ^^^ ^^^^

because this kind of output is just fine for me.

What I need is to know how to transform the binary doc file into file
that will contain words in ISO-8859-2. The words will be then found with
the regular expression:

([a-zA-Z0-9±æê³ñ󶿼¡ÆÊ£ÑÓ¦¯¬]{3,})

There are polish letters between 9 and ].

The program will be written in Python on Linux. Any help will be greatly
appreciated.

Regards,
Maciej Bliziñski
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top