Word (.doc) -> Text (.txt) conversion

R

robster278

I'm running a large site which received many .doc files daily which
need to be converted into plain text, the purpose of which is to supply
data for the site's search engine which then points users to the
appropriate .doc files. The raw text is also needed to populate HTML
preview pages for the .doc files.

My question is simple - what is the easiest (preferably server side)
method for converting .doc files into raw text. I'm running Windows
server so I presume that .doc API and script commands would be fairly
easy to implement. If a server side solution is impossible then a
locally executed method could fit the bill too. It just needs to be
QUICK and AUTOMATED. I'm not going to copy and paste from MS Word into
Notepad for 10 hours every day.
Any suggestions kindly requested.

Rob Ponting
(e-mail address removed)
 
H

Helmut Weber

Hi Rob,
probably the only and at the same time the easiest way,
could be to open a doc and save it as txt.

Though there are some problems to take care of,
like handling the question whether you want to loose formatting.
And if the docs are rather complex,
the resulting txt-file will be a mess anyway, and
doing some editing by hand will be unavoidable.

All toghether, if it comes to long time automation,
I think you would need a programming language like VB
or some other, and should not try to use Word as an
automation server.

With VB, I am scanning directories in regular intervals,
check, whether there are any docs, start word, process the docs,
save them as txt to some other place), and remove all processed
files from that directory. In theory, this could run endlessly.
Though, in fact, 4 weeks without any crash, was the best I got so far,
and my docs are all very simple and very short.

Greetings from Bavaria, Germany
Helmut Weber, MVP
"red.sys" & chr(64) & "t-online.de"
Word XP, Win 98
http://word.mvps.org/
 
D

Dave Lett

Hi Rob,

Using vba, you could read all the files into an array as described in "How
to read the filenames of all the files in a directory into an array" at
http://word.mvps.org/faqs/macrosvba/ReadFilesIntoArray.htm and then use the
FileCopy statement to copy the .doc file as a .txt file (you can even change
the directory if you want). The FileCopy statement would take the form of

FileCopy source:="C:\Test\test.doc",
destination:="C:\Test\TextOnly\test.txt"

HTH,
Dave
 
C

Chuck

Unfortunately just copying files and renaming them with new extensions
doesn't actually convert the files from Word document to Text format. I've
tried the FileCopy code in Dave’s message using a short Word document
containing a footnote and on opening the resulting .txt file there's a lot of
Word code but not a lot of actual document text content. Results might vary
depending on the presence or absence of anything other than simple text in
Word document (eg footnotes, paragraph numbering, shapes, etc) but without
some sort of conversion process the results are in any case likely to be
unreadable.

Helmut's suggestion (looping through directories, opening documents --
probably as objects -- then saving them in .txt format) is probably going to
work best, bearing in mind that saving in .txt format will lose all
footnotes, shapes etc.

Another method might be to loop through the documents, opening them, copying
the contents and then using paste unformatted to dump the text into new
documents (or text files) which might better preserve CrLfs as well as
paragraph numbering (paste unformatted removes auto paragraph numbering but
leaves the paragraph number itself as text).

Here’s a bit of code that might give you some ideas (the FileLocked sub that
is called from the DocToText sub is from the MVP site). Please note that I
can't warrant this code free from bugs and since it contains a Kill command
it shouldn't be run on any live files unless you have safe and secure backup
copies. Also, I don't know how well Word would handle hundreds or thousands
of iterations without a restart.


Sub DocToTxt()

Dim oWord As Object
Dim oOldDoc As Document
Dim oNewDoc As Document
Dim i As Long
Dim strInputFileName As String
Dim strOutputFileName As String
Dim strSourceDir As String
Dim strOutputDir As String

On Error GoTo errorhandler

Set oWord = CreateObject("Word.Application")

'set source and output locations
strSourceDir = "C:\temp\"
strOutputDir = "C:\temp\"

With oWord.Application.FileSearch
.FileName = "*.doc"
.LookIn = strSourceDir
.Execute
For i = 1 To .FoundFiles.Count
If InStr(1, CStr(.FoundFiles(i)), "~") = 1 Then
'do nothing, it's a temp/hidden file
Else
If Not FileLocked(.FoundFiles(i)) Then
oWord.Documents.Open .FoundFiles(i)
Set oOldDoc = oWord.Documents(.FoundFiles(i))
Set oNewDoc = oWord.Documents.Add
With oOldDoc
strInputFileName = .Name
.Content.Copy
.Close savechanges:=wdDoNotSaveChanges
End With
With oNewDoc
.Range.PasteSpecial datatype:=wdPasteText
strOutputFileName = Left(strInputFileName, _
Len(strInputFileName) - 4) & ".txt"
.SaveAs FileName:=strOutputDir & _
strOutputFileName, _
fileformat:=wdFormatDOSTextLineBreaks
.Close
End With
'delete source file - rem out the Kill line if you
'don't want to delete it
Kill strSourceDir & strInputFileName
Set oOldDoc = Nothing
Set oNewDoc = Nothing
End If
End If
Next i
End With

Set oWord = Nothing

Exit Sub

errorhandler:

Select Case Err.Number

Case 4605
Resume Next

Case Else
MsgBox Err.Number & " " & Err.Description
Exit Sub

End Select

End Sub

Function FileLocked(strFileName As String) As Boolean

On Error Resume Next

' If the file is already opened by another process,
' and the specified type of access is not allowed,
' the Open operation fails and an error occurs.
Open strFileName For Binary Access Read Lock Read As #1
Close #1

' If an error occurs, the document is currently open.
If Err.Number <> 0 Then
FileLocked = True
Err.Clear
End If

End Function
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top