Converting Word to GOOD HTML

S

Sander Voerman

Hi,

I have a website on which I have published some of my essays, which were
originally written in word. This is how I used to edit them:

* Save the document as a webpage from Ms Word
* Manually edit the html file in notepad or wordpad:
* Edit the html, doctype and meta headers
* Remove all word xml and office xml tags and all style tags from the
document
* Remove all microsoft additional properties from italic tags and the like
* Replace all non-xml special characters with the appropriate XML escape
codes
* Remove all conditionals such as <![if !supportEmptyParas]> and the like
* Replace all style classnames and ids with classnames and ids specified by
the CSS sheets of my website (you cant do this from within word because some
built in features have predefined classnames like MsoNormal, MsoTitle. of
course I used the replace function of wordpad wherever possible)
* Remove all stuff related to footnotes and literature references, and
replace them by JavaScript function calls to the neat scripts of my website
* Add header and footer javascript function calls
* Done!

As you can imagine, this is a lot of work to do manually, especially when
you write large essays. Now I am going to start a second (PHP/MySQL-based)
website, which will feature a lot of articles written in word, by other
people. Is there any way I can automate some or all of the steps mentioned
above? It would already be much easier for me if I just had a tool which
converted word documents to simple, plain html or xml files that only
contain basic document structure (emphasis, superscript and subscript,
paragraphs, blockquotes, headings and notes - but no fonts and margins and
stuff) and converted all special characters to valid XML escapes.

Sander
 
S

Sander Voerman

No, that would completely retain the original layout of the word document.
Instead, what I need is to preserve document *structure* while replacing the
*style* of the original document with the house style of my webpages.

By now I have already found out about Microsoft HTML Filter 2.0 and HTML
Tidy, so people who were going to tell me about them can save their efforts
:) HTML Filter is useful for removing all office XML, and HTML Tidy does a
great job in converting special characters to XML escape codes and
generating proper XHTML. I am still looking for more advanced tools though,
specially one that allows me to automate the style conversions.

Sander


Graham Mayor said:
Have you thought about posting them as PDF files?

--
<>>< ><<> ><<> <>>< ><<> <>>< <>>< ><<>
Graham Mayor - Word MVP
E-mail (e-mail address removed)
Web site www.gmayor.dsl.pipex.com
Word MVP web site www.mvps.org/word
<>>< ><<> ><<> <>>< ><<> <>>< <>>< ><<>



Sander said:
Hi,

I have a website on which I have published some of my essays, which
were originally written in word. This is how I used to edit them:

* Save the document as a webpage from Ms Word
* Manually edit the html file in notepad or wordpad:
* Edit the html, doctype and meta headers
* Remove all word xml and office xml tags and all style tags from the
document
* Remove all microsoft additional properties from italic tags and the
like
* Replace all non-xml special characters with the appropriate XML
escape codes
* Remove all conditionals such as <![if !supportEmptyParas]> and the
like
* Replace all style classnames and ids with classnames and ids
specified by the CSS sheets of my website (you cant do this from
within word because some built in features have predefined classnames
like MsoNormal, MsoTitle. of course I used the replace function of
wordpad wherever possible)
* Remove all stuff related to footnotes and literature references, and
replace them by JavaScript function calls to the neat scripts of my
website
* Add header and footer javascript function calls
* Done!

As you can imagine, this is a lot of work to do manually, especially
when you write large essays. Now I am going to start a second
(PHP/MySQL-based) website, which will feature a lot of articles
written in word, by other people. Is there any way I can automate
some or all of the steps mentioned above? It would already be much
easier for me if I just had a tool which converted word documents to
simple, plain html or xml files that only contain basic document
structure (emphasis, superscript and subscript, paragraphs,
blockquotes, headings and notes - but no fonts and margins and stuff)
and converted all special characters to valid XML escapes.

Sander
 
L

lostinspace

----- Original Message -----
From: Sander Voerman <>
Newsgroups: microsoft.public.word.docmanagement
Sent: Sunday, August 10, 2003 6:40 AM
Subject: Re: Converting Word to GOOD HTML

No, that would completely retain the original layout of the word document.
Instead, what I need is to preserve document *structure* while replacing the
*style* of the original document with the house style of my webpages.

By now I have already found out about Microsoft HTML Filter 2.0 and HTML
Tidy, so people who were going to tell me about them can save their efforts
:) HTML Filter is useful for removing all office XML, and HTML Tidy does a
great job in converting special characters to XML escape codes and
generating proper XHTML. I am still looking for more advanced tools though,
specially one that allows me to automate the style conversions.

Sander


Graham Mayor said:
Have you thought about posting them as PDF files?

--
<>>< ><<> ><<> <>>< ><<> <>>< <>>< ><<>
Graham Mayor - Word MVP
E-mail (e-mail address removed)
Web site www.gmayor.dsl.pipex.com
Word MVP web site www.mvps.org/word
<>>< ><<> ><<> <>>< ><<> <>>< <>>< ><<>



Sander said:
Hi,

I have a website on which I have published some of my essays, which
were originally written in word. This is how I used to edit them:

* Save the document as a webpage from Ms Word
* Manually edit the html file in notepad or wordpad:
* Edit the html, doctype and meta headers
* Remove all word xml and office xml tags and all style tags from the
document
* Remove all microsoft additional properties from italic tags and the
like
* Replace all non-xml special characters with the appropriate XML
escape codes
* Remove all conditionals such as <![if !supportEmptyParas]> and the
like
* Replace all style classnames and ids with classnames and ids
specified by the CSS sheets of my website (you cant do this from
within word because some built in features have predefined classnames
like MsoNormal, MsoTitle. of course I used the replace function of
wordpad wherever possible)
* Remove all stuff related to footnotes and literature references, and
replace them by JavaScript function calls to the neat scripts of my
website
* Add header and footer javascript function calls
* Done!

As you can imagine, this is a lot of work to do manually, especially
when you write large essays. Now I am going to start a second
(PHP/MySQL-based) website, which will feature a lot of articles
written in word, by other people. Is there any way I can automate
some or all of the steps mentioned above? It would already be much
easier for me if I just had a tool which converted word documents to
simple, plain html or xml files that only contain basic document
structure (emphasis, superscript and subscript, paragraphs,
blockquotes, headings and notes - but no fonts and margins and stuff)
and converted all special characters to valid XML escapes.

Sander

Sander,
I should have replied to this sooner. :-(
The lonly feasible way to remove all the bloat which Word adds when creating
html is by cutting and pasting the text into NotePad.
This presents two issues for you?
1) Loss of CSS Styles
2) the possibility of some type of Batch Conversion

Creating ANY WebPages with Word is very TABOO.
The recent versions of Word add VML and other junk. All the critiscim which
is given to Microsoft Front Page is a direct result of FP users cutting and
pasting Word document text directly into FP.

There is no easy way and no simple solution to accomplish what you desire
:-(
Roll up your sleeves and get to work.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top