Extracting words from ".doc" file

Maciej Bliziñski · Nov 14, 2003

Hello!

Let's say I have some ".doc" files that I want to extract words from. I
don't need doc2txt utility, as I don't need any formatting extracted.
The only thing I need is words separated in some way (spaces, other
characters, whtever). The other thing is that I need those words
encoded in ISO-8859-2, as they contain polish letters (like ó).

Everything should be done on Linux server, so I will have to parse them
with my own utility. When I open doc file in text editor, I can see lots
of rubbish, and the text, but letters are separated with some binary
byte, it looks like this.

^@w^@o^@r^@d^@s^@ ^@a^@r^@e^@^@h^@e^@r^@e^@

I will need those letter put together if I'm going to extract
words from the file:

"words are here"

There's no matter if there will be any rubbish around the words

"#$^#@$%&^@$words@#$%#$are@#$#@$%here#$%^"
^^^^^ ^^^ ^^^^

because this kind of output is just fine for me.

What I need is to know how to transform the binary doc file into file
that will contain words in ISO-8859-2. The words will be then found with
the regular expression:

([a-zA-Z0-9±æê³ñó¶¿¼¡ÆÊ£ÑÓ¦¯¬]{3,})

There are polish letters between 9 and ].

The program will be written in Python on Linux. Any help will be greatly
appreciated.

Regards,
Maciej Bliziñski

Help needed! Extracting Heading 3 and Contents from a Word Doc to Excel	0	Dec 21, 2021
VBA to Find Key Words in Word doc then Extract Content to Excel.xlsm file	3	Dec 21, 2016
Extracting specific words from a document	12	Jan 28, 2014
format the following paragraph beneath found text	0	Jun 1, 2014
Compare two files and update data from another file base on words ina cell separated by commas	0	Dec 9, 2009
extracting images from *.doc files?	1	Sep 17, 2009
Determine page count from Word Doc File	2	Sep 11, 2007
Moving Text from Word to Excel (and having it set into 2 columns)	5	Apr 2, 2010

Extracting words from ".doc" file

Maciej Bliziñski

Ask a Question

Similar Threads