Advice on Find and Replace and save as text

R

R Jolly

Hi,

I'd like to create a macro that will change all instances of italic
and bold text in a document to <i>italic text</i> and <b>bold
text</b>. Then save the document as plain text. I'm using Word v.X

I'm new to this, so any advice appreciated.

First, I've got the find and replace bug(s) reported repeatedly in
this newsgroup. I had thought that a search for format italic and '*'
in the find box would work. But that finds just one character at a
time. Trying *{1,} has no effect. I can get a whole word with <*>, but
not strings of words.

Secondly I get word unexpectedly quitting frequently while trying out
these regexes.

Is there any solution to this other than upgrading to Word 2004? If I
learn enough VB to do it would that help? If so, any advice on good
resources?

Finally I'll give the big picture in case anyone has advice on that.
I've got a series of documents made from a single template. They are
to be converted to xml. The documents are fairly well structured,
consisting mostly of labeld text areas. My plan is to save as text and
use perl or python to parse the text and write the xml.

The only information that I care about which will be lost in a save as
text is the bold and italic formatting. The rest can be infererred
from text content.

I'd appreciate any comments or advice,

Richard
 
J

John McGhie

Hi Richard:

My first piece of advice would be to simply save the document out of Word as
a "Web Page (HTML)".

It's lying: the result is XML, not HTML.

That way, you will save yourself a hell of a lot of time.

Word 2004 will do a nicer job. But both products actually write XML, not
HTML.

It is a very rich XML, you will indeed get a whole lot of entities you don't
need. But you will also get the complete document structure and your bold
and italic tags.

For a complete job, wait for Office 2004 Professional. That will contain
Virtual PC 7, which will allow you to load Word 2003. The Enterprise
Edition of Office System 2003 contains a full XML implementation.

It will end up cheaper than the time you will spend writing a parser :)

Cheers


Hi,

I'd like to create a macro that will change all instances of italic
and bold text in a document to <i>italic text</i> and <b>bold
text</b>. Then save the document as plain text. I'm using Word v.X

I'm new to this, so any advice appreciated.

First, I've got the find and replace bug(s) reported repeatedly in
this newsgroup. I had thought that a search for format italic and '*'
in the find box would work. But that finds just one character at a
time. Trying *{1,} has no effect. I can get a whole word with <*>, but
not strings of words.

Secondly I get word unexpectedly quitting frequently while trying out
these regexes.

Is there any solution to this other than upgrading to Word 2004? If I
learn enough VB to do it would that help? If so, any advice on good
resources?

Finally I'll give the big picture in case anyone has advice on that.
I've got a series of documents made from a single template. They are
to be converted to xml. The documents are fairly well structured,
consisting mostly of labeld text areas. My plan is to save as text and
use perl or python to parse the text and write the xml.

The only information that I care about which will be lost in a save as
text is the bold and italic formatting. The rest can be infererred
from text content.

I'd appreciate any comments or advice,

Richard

--

Please reply to the newsgroup to maintain the thread. Please do not email
me unless I ask you to.

John McGhie <[email protected]>
Consultant Technical Writer
Sydney, Australia +61 4 1209 1410
 
D

Daiya Mitchell

In reply to someone who wanted to change italicized text to _italic_ on a
WinWord group, two MVPs posted the directions below. I believe you should be
able to adapt them to your purposes. Tested on WinWord, but since they
don't involve wildcards, I think they should work. The directions are the
same, just presented a little differently.

DM

MVP Klaus Linke:

Just replace italic formatting with _^&_

To do that, open the "Find > Replace" dialog.

In "Find what:", hit Ctrl+i, or go to "More > Format > Font" and check
"italic".
Leave the "Find what" box empty.

In "Replace with", type an underscore, then click on "Special > Find what
text" to insert ^&, then type another underscore.
Or simply type in "_^&_".

You might remove the italic formatting while you're at it: Hit Ctrl+i twice
in "Replace with:".
The text below the "Replace with" box will change to "Font: Not italic".


MVP Beth Melton:

You should be able to use Find/Replace for this:

- Find: (leave text box blank)
Format: Italic
- Replace: _^&_
Format: Not Italic
- Turn on "Find whole words only"
- Click Replace All

To specify the formats click the "More" command at the bottom of the
dialog box and then click the Format/Font command. You should see
Italic and Not Italic in the Font Style list.

In the Replace text string ^& refers to the "Find What" text string
(found under the Special command).

Your replacement text will be the Find What string with an underscore
at the beginning and end, and the Italic format will be removed.
 
R

R Jolly

Daiya Mitchell said:
In reply to someone who wanted to change italicized text to _italic_ on a
WinWord group, two MVPs posted the directions below. I believe you should be
able to adapt them to your purposes. Tested on WinWord, but since they
don't involve wildcards, I think they should work. The directions are the
same, just presented a little differently.

[snip instructions]

Works perfectly. I had assumed that I needed wildcards to access the
'Find What Text', but of course that was not true.

Thanks, Richard
 
R

R Jolly

John McGhie said:
Hi Richard:

My first piece of advice would be to simply save the document out of Word as
a "Web Page (HTML)".

It's lying: the result is XML, not HTML.

[snip]

I've tried this route in the past, and had little luck. xmllint
complains that its not well formed. I've tried passing the result
through htmltidy --xhtml, but it ends up dropping small but
significant characters (spaces before closing tags, in one example).

My favoured route for Word -> xml is to use convert to open office and
use that as a starting file format. Then the xml is fairly
straightforward. I've also used Upcast, which works ok but no better
than using open office.

I've never seen Microsoft's own xml versions. Are they good?

In this particular case the end product has to go on a client machine,
and I want to limit the installation burden and keep the number of
conversion steps simple. Thus the word macro route.

The whole word -> xml thing is a thankless task though.

Richard
 
D

Daiya Mitchell

Works perfectly. I had assumed that I needed wildcards to access the
'Find What Text', but of course that was not true.
Thanks for the confirmation.

DM
 
J

John McGhie

Hi Richard:

I've never seen Microsoft's own xml versions. Are they good?

Yes if you want to make distributed applications using XML. Otherwise, they
may be over-kill, depending on your application.
The whole word -> xml thing is a thankless task though.

Which is why I have little experience with it :)

Cheers

--

Please reply to the newsgroup to maintain the thread. Please do not email
me unless I ask you to.

John McGhie <[email protected]>
Consultant Technical Writer
Sydney, Australia +61 4 1209 1410
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top