Docs scanned then OCR into word - can't change format

R

Roceo

Version: 2008
Operating System: Mac OS X 10.5 (Leopard)

OK, it has to be something simple. I'm writing a book and need to get old text documents saved electronically so I can edit and rewrite. Copy shop here has done a great job, scanned all into Word docs and emailed to me. I download and save and start to edit. Oh no!!! Clicking into the text reveals a shaded box around groups or paragraphs of text. The box will move if I drag the edge. But I don't want it! I just want to copy the text onto a clean, editable page that I can then format for my manuscript. But the boxes go with no matter what I try. Yes, the OCR to Word is a re-editing nightmare because words run together, weird symbols, etc but I can handle that.....just not the boxes. Any clues anyone?????
 
J

John McGhie

Sadly, that's what OCR'd text looks like when you get it. The fun starts
now.

Typical OCR software is determined to preserve the POSITIONING of the text
on the page, and it does that by outputting the recognised character strings
to Text Boxes.

Everything lines up and looks pretty much like the original did, until you
come to edit it. Then you find the text is divided into these random
"boxes".

The bad news is that they are difficult to get rid of...

The REALLY bad news is that they are not necessarily implanted in the
document in the correct sequence. The OCR machine adds the text boxes in
the order that it scans the page: this may not be the same sequence or even
direction that the text flows.

OK: There are two things you can try. So Make two copies of your file to
begin with. Both methods destroy the input file, and you will need the
original to refer to, for each method, so you need two copies.

Quick Method:
1) File> Save As...
2) Change "Format" to Plain Text
3) OK, and close the document.
4) Re-open the text version.
5) File>Save As and change the Format to Word Document.

There: the job is done. However: you now need to compare the re-opened file
CAREFULLY with the original. Look for missed words, and phrases in the
wrong place.

What this method does is save the file into a format that can not contain
Text Boxes. So the text boxes are removed and only the text within them is
saved.

If the text is in a useable order (not too many bits missed or in the wrong
place) this is the best method to use. It's far quicker than:

Slow Method:
1) Click in each text box
2) Choose Format>Text Box>Text Box>Convert to Frame
3) Ignore the warning and click OK
4) Choose Format>Frame>Remove Frame
5) Click OK

This method first converts the text box into a normal paragraph, surrounded
by a positioning frame. It then converts the Frame into a normal paragraph
without a frame.

A Text Box is a "Graphic Element" and resides in the Graphics Layer of the
document: it's not in the text at all. The second step converts the text
box to a paragraph. The fourth step removes the positioning information.

This method has the advantage that each text box is created as a paragraph
below the existing paragraphs. All you need to do is delete the paragraph
marks to join the text up.

The disadvantage is that you have to go through this sequence for EVERY text
box on EVERY page. Sorry: but if the OCR produced text that is badly out of
sequence, this is probably quicker than using the Quick method and then
trying to fix the mess.

Make sure you turn on your Show/Hide button on the Standard toolbar so you
can see paragraph marks and spaces, or you will go mad trying to do this.

Get back to us if there is anything that is not clear...

Cheers


Version: 2008
Operating System: Mac OS X 10.5 (Leopard)

OK, it has to be something simple. I'm writing a book and need to get old text
documents saved electronically so I can edit and rewrite. Copy shop here has
done a great job, scanned all into Word docs and emailed to me. I download and
save and start to edit. Oh no!!! Clicking into the text reveals a shaded box
around groups or paragraphs of text. The box will move if I drag the edge. But
I don't want it! I just want to copy the text onto a clean, editable page that
I can then format for my manuscript. But the boxes go with no matter what I
try. Yes, the OCR to Word is a re-editing nightmare because words run
together, weird symbols, etc but I can handle that.....just not the boxes. Any
clues anyone?????

--
Don't wait for your answer, click here: http://www.word.mvps.org/

Please reply in the group. Please do NOT email me unless I ask you to.

John McGhie, Microsoft MVP, Word and Word:Mac
Sydney, Australia. mailto:[email protected]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top