c# Word 2007 API InsertFile (html) adding strange characters in DO

J

Jay

Hey all,

We currently have a process that takes HTML from rich text box and inserts
that into a DOCX file via the Range.InsertFile method. The problem we are
having is that the produced word document ends up with strange characters
after the InsertFile is executed.

Our process is to create an HTML file for each of the rich text boxes to be
displayed in the DOCX file. We then locate the range in the DOCX file to be
the target location of the HTML and then use the InsertFile method.

During this process if you stop to look at the generated HTML file, there
are none of these characters inside the HTML document that is inserted. Word
2007 must be doing something inside the InsertFile method, but I'm not sure
how we can even begin to correct the issue.

Sample characters include:

* "Â" - appears to be inserted for "some" line-breaks in the word document
* "€™" - appears to be inserted for "some" single quotes
* "€œ" - appears to be inserted for "some" double-quote (left)
* "€�" - appears to be inserted for "some" double-quote (right)

Due to this problem we have added a method to "Clean" the DOCX file with
these strange characters. After doing all of the required InsertFile calls,
this clean method is called. We were able to replace the "Â" character with
an empty string and this appears to be working.

Sample clean method:

string capitalAWithCarrot = Convert.ToString((char) 194);
string smallAWithCarrot = Convert.ToString((char)226);
string euroSymbol = Convert.ToString((char)8364);
string trademarkSymbol = Convert.ToString((char)8482);
string ohEeSymbol = Convert.ToString((char)339);
string apostropheSymbol = Convert.ToString((char) 39);
string leftDoubleQuote = Convert.ToString((char)8220);
string rightDoubleQuote = Convert.ToString((char)8221);

findTag = capitalAWithCarrot;
replaceWithValue = string.Empty;
this.FindAndReplaceTextInDOCX(doc, findTag, replaceWithValue, queueItem);

Each one of the above is passed to a method (FindAndReplaceTextInDOCX())
that does the following:

object replaceAllAsObject = WdReplace.wdReplaceAll;

// loop through each StoryRange (section of Word doc)
foreach (Range tmpRange in doc.StoryRanges)
{
// set text to find and replace
tmpRange.Find.ClearFormatting();
tmpRange.Find.Text = findMe;
tmpRange.Find.Replacement.Text = replaceWithMe;

// set to find continue so dialog to continue is not displayed
tmpRange.Find.Wrap = WdFindWrap.wdFindContinue;

// perform replacement...passing in find and replace All
tmpRange.Find.Execute(ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref
missing, ref replaceAllAsObject, ref missing, ref missing, ref missing, ref
missing);
}

This works fine in most cases. However, I cannot identify the ansi character
value for one character that is injected. The 4th example above is actually a
euro sign, along with a square box with a question mark in side of it. I
searched an ANSI character set article
(http://www.alanwood.net/demos/ansi.html) but the square with question mark
character was not there. Any ideas as to how I would find out what character
that is?

Ideally, we would like to figure out why the InsertFile method is adding
these characters to begin with, but if not does anyone know of a way we can
find and replace that characters that I listed above?.

FYI - I posted this same issue in the VSTO forum but thought I'd have more
luck here.

Thanks,
Jay
 
P

Paul Shapiro

I don't remember now if it solved all the problems, but below is the code I
used to do the same thing. Maybe adding the <head> section helps, and I do
remember that the file had to be created in UTF8 format, not ascii or
unicode.

I am now using an alternate approach to inserting html into a Word document.
It's a little klugy, but worked better for me than InsertFile. I put the
html onto the clipboard, specifying that it's in html format, and then paste
it into the document. I can send you a copy of the code if you email me.

CODE USING Range.InsertFile:
Dim strFile As String 'Temporary file used for file-based
insertion
Dim strFileText As String 'Text to be written to the temp file

strFileText = _
"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" " _
& """http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">""" _
& "<HTML xmlns=""http://www.w3.org/1999/xhtml"">" _
& "<HEAD><meta http-equiv=""Content-Type"" content=""text/html;
charset=utf-16"" /></HEAD>" _
& "<BODY>" _
& strHTMLText _
& "</BODY></HTML>"

'File will be created in the temp folder
strFile = pjsGetTempDir() & "wdInsertHTML.html"
'strFile = "C:\Temp\wdInsertHTML.html" 'For debugging
If pjsFileCreateUTF8( _
strFilename:=strFile _
, fOverwriteExisting:=True _
, strText:=strFileText _
, strCharacterSet:="unicode" _
) Then
'http://word.mvps.org/faqs/macrosvba/GetRngToEndOfInsertFile.htm _
suggests the workaround using oRangeEnd to solve this issue: _
When you use MyRange.InsertFile, the range ends up collapsed _
at the start of the inserted text rather than at the end! _
.InsertFile seems to add a trailing Cr. _
Using the workaround with oRangeEnd we can delete the added Cr.
With oRange
Set oRangeEnd = oRange.Duplicate
oRangeEnd.InsertParagraph 'This range will now expand
.InsertFile FileName:=strFile, ConfirmConversions:=False
oRangeEnd.Characters.Last.Delete 'The Cr we added above
oRangeEnd.Characters.Last.Delete 'The Cr added by .InsertFile
End With
Kill strFile
End If
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top