Euro symbol issues

J

j

Hi,
I'm having all kinds of problems with Word Documents that contain euro
symbols.

I have an ASP.NET page (in VB.NET) that reads a word document into a
variable and then using a Regular Expression to find a particular
section of text that is delimited by custom delimiters i.e.
<<Start-of-content>> ... content ... <<End-of-content>>. The idea
behind this approach was to be able to count the characters/lines in
the content blocks, while being able to ignore the sections which are
not inside these delimiters.

I know that the format of a Word document is pretty strange when you
look at it in plaintext but my algorithm was working until the word
document contained a euro symbol.

Here's the code:
Dim oFileRegex As New Regex("<<Start-of-content>>" + "(.+?)" +
"<<End-of-content>>", RegexOptions.Multiline Or
RegexOptions.IgnoreCase)
'put all matches in the collection
Dim collMatches As MatchCollection = oFileRegex.Matches(sFileContent)
Dim oMatch As Match
Dim MatchText As String
Dim LineText As String
Dim i As Integer
Dim charCount As Integer = 0
Dim lineCount As Double = 0
Dim tempLineCount As Double = 0

'line counting section using split function
Dim arrMatchText() As String 'string array
Dim j As Integer = 0

'iterate over the collection
For i = 0 To collMatches.Count - 1
MatchText = Trim(collMatches(i).ToString()) 'removes all whitespace
'replace vbCRLF with vbCR - \r\n with \r
MatchText = Replace(MatchText, vbCrLf, vbCr)

If MatchText.Length > 0 Then
arrMatchText = MatchText.Split(vbCr) 'create the array of lines
For j = 0 To arrMatchText.GetUpperBound(0)
If arrMatchText(j) <> "<<Start-of-content>>" _
And arrMatchText(j) <> "<<End-of-content>>" Then
'As long as the text captured in the array index is NOT just a
tag then increment the counter
'now do the character counting - Remove the tags before
estimating line length.

Try
LineText = arrMatchText(j).Replace("<<Start-of-content>>",
"").Replace("<<End-of-content>>", "")
charCount += LineText.Length
tempLineCount = LineText.Length / CharactersPerLine()
lineCount += tempLineCount
Catch
Return -1
Exit Function
End Try

End If
Next

End If
Next

As you can see, the Matchtext String variable holds the content of
each of the matches for the regular expression. However, when a euro
symbol is present in the document, the content of the Matchtext
variable only "appears" to hold the characters up to the euro symbol.
What I mean by this is that when I use the debugger to view the value
of the MatchText variable I can only see the text up to (but not
including) the euro symbol. I'm not sure that this means anything
other than that the IDE cannot handle characters like the euro symbol
in the debug window. I thought that this side-effect (as I thought it
was) was too coincidental for my liking so I decided to look at the
format of the Word file in a text editor. Using TextPad or even
Notepad I noticed that a) the euro symbol rendered as 75 null
characters (00 in a hex editor) and then a "¬" character and b) that
AFTER the euro symbol all the content rendered with what appeared to
be "double spacing". On closer inspection, this "double spacing" was
because every second character was a null character. This also had the
side effect that the said:
" which the regular expression of course did not catch. This greatly
affected the counting algorithm and has me totally baffled.

So, what is it about the euro symbol that simply it's inclusion into a
Word document causes such dramatic changes to the format of the file
(in particular to all characters AFTER it)??? If I remove the euro
symbol from the file, this formatting anomaly disappears and
consequently my counting algorithm works.

What kind of workarounds could anyone propose so that I can still
adequately capture the number of characters/lines in my content
sections? Do I need to impose some kind of encoding on the file
contents to remove this anomaly?

All suggestions are very welcome. I'd prefer a "fix" on my current
approach just because of time-constraints but I'm also interested in
radically changing my approach if that would make the solution more
elegant, robust and manageable.

By the way, I didn't want to install Word on the server (and to
automate it using COM) because of the numerous (documented) issues
that seem to surround this approach. This is really not an approach I
can go with unless someone can give me VERY STRONG evidence that this
approach can be made work in a stable and scalable manner.

Thanks for your time and patience (sorry about the long post).
Regards,
Jeremy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top