High Performance Programming in MS Word

N

Nick

Hello,

I am new to MS Word programming, currently, I am planning to do a
project in which aims to

1. Read every words in a word document and parse it and analyze it using
multiple data mining algorithms (they are very CPU intensive algorithm!)

2. Bold and highlight the analyzed words in the same document

I have really no idea where to start with, the main concern is to choose
an efficient method to implement the system.

After some searching in google, there are some suggestions:

1. Pure VBA implementation
2. C++/COM + VBA

Some people said C++/COM + VBA is even slower than pure VBA
implementation. Is it true? I would like to hear more suggestions on
high performance programming in Win Word.

Thanks

Nick
 
D

DA

Hi Nick,

Excuse my ignorance here, but I'm unclear as to what
you're trying to do here. When you say you're parsing
each word.. where are you going with it? Are these
algorithms you talk about outside of the VBA environment?
If that's the case, your performance issues are unlikely
to be in Word.

Also, how intensive can you get with analyzing a word?
Have you tried anything, suffered any performance
issues?..perhaps if you can add a few more details to
your original post we may be able to help you a bit
better.

Regards,
Dennis
 
W

Word Heretic

G'day Nick <[email protected]>,

<chuckles> You too huh. It's an interesting area. There are two main
methods for you to consider here.

Method 1 - Formatting is NOT important to your parse.

From VBA
Save the bloody file as text
Use DocStats as a rough guide to your word count.
Call your C# to go sicko speeds.

From C#
Serialize word structures as per MS Word (any non-alpha post alpha is
a new word start) into a bloody huge array which you can predetermine
using the docstats result as a parm.

Keep a 'done' list of serialised words worthy of marking. Re-enter the
Word document, obtain Document.Content.Words(offset) and mark
accordingly.



Method 2 - Formatting is important

For extreme speed, I would probably use a variant of Method 1 that
uses a HTML output to parse.

OTHERWISE

Any C would be only using Word calls anyway - as who wants to rebuild
an RTF processor - YUCK! Avoid it, stick with VBA, as you won't be
needing interface wrappers for all your calls it is probable it will
actually run a bit faster for you from VBA.

First up, all the collections are dynamic, so you really want to avoid
doing things like .Para(k) as when k gets to 100, Word has to quickly
serialise the first 100 paras in the defined range to get your answer.

If you move your range start ahead a para at a time and use para 1 its
much quicker and automatically delivers doc end when myRange.start is
at myRange.end.

You will need to know about Range objects, and then start looking at
..Paragraphs.Range.Words(n).Text.

There's obviously some tricks to getting this running really quick in
VBA, I outline numerous performance enhancements in my Word VBA for
Beginner's book from my website for a small fee.


Steve Hudson - Word Heretic
Want a hyperlinked index? S/W R&D? See WordHeretic.com

steve from wordheretic.com (Email replies require payment)


Nick reckoned:
 
J

Jonathan West

Nick said:
Hello,

I am new to MS Word programming, currently, I am planning to do a
project in which aims to

1. Read every words in a word document and parse it and analyze it using
multiple data mining algorithms (they are very CPU intensive algorithm!)

Like Steve said, the last thing you need is the formatting of the document
to slow you down. Read the Range.Text property of the document into a
string, and process that. Use whatever language you deicde is best for
string handling. The StringBuilder class in VB.NET is good if you have
concatenation to do, or there is an equivalent VB class module produced by
Karl peterson which also works very fast in VBA. Take a look at
www.mvps.org/vb/
2. Bold and highlight the analyzed words in the same document

If you know where in the original string your analysed word is, you can
probably get to the same character position in the original document. This
might get a bit hairy if the doc has tables & frames in it, you'll have to
experiment. At the worst, you can see how many times a aparticular word
occurs in the string and use the Find object to get to their equivalent
positions in the document
I have really no idea where to start with, the main concern is to choose
an efficient method to implement the system.

After some searching in google, there are some suggestions:

1. Pure VBA implementation
2. C++/COM + VBA

Some people said C++/COM + VBA is even slower than pure VBA
implementation. Is it true? I would like to hear more suggestions on
high performance programming in Win Word.

High performance programming is possible in word VBA, you just have to
choose your tools right and get your algorithm right. I would concebntrate
firast on the algorithm and get that as well-designed as possible, and then
choose your language/implementation. If you choose to use Word VBA, then
there are all kinds of speedup tricks that can be used, but the first item
of business would be to identify the bottlenecks (eg inner nested loops) and
see where the time is being taken.

Remember that it is no good getting the program to be fast if it produces
the wrong answer!
 
N

Nick

Hi Dennis,

Thanks for your reply.
When you say you're parsing each word.. where are you going with it?
Are these algorithms you talk about outside of the VBA environment?

The algorithm is to find some "features words" in the original document
via some data mining algorithms. For example, given an article, the
algorithm is applied to the document and some keywords are highlighted.
To make it simple, you can just think of that they are very CPU
intensive algorithms which analyzed each words in a text file.

Currently, I have a pure C/C++ implementation(need 20 sec to parse this
post, so imagine how intensive it is!) of the algorithm, but I can
rewrite it using VBA or as a COM object using VC++.

What I want to know is the pros and cons of doing so. (Pure VBA vs COM
object + VBA or other approach I don't know) , especially for the speed
factor.


Regards,
Nick
 
N

Nick

Hi,

Thanks for your reply first.

I think formating is NOT important for me, so I choose method 1:
From VBA
Save the bloody file as text
Use DocStats as a rough guide to your word count.
Call your C# to go sicko speeds.

From C#
Serialize word structures as per MS Word (any non-alpha post alpha is
a new word start) into a bloody huge array which you can predetermine
using the docstats result as a parm.

Keep a 'done' list of serialised words worthy of marking. Re-enter the
Word document, obtain Document.Content.Words(offset) and mark
accordingly.

1. Why C#, wouldn't VC is much faster?

2. How to call the C# from VBA? Write the C# as a component? Sorry as I
am new to .NET, for example, for VC, should I use ATL instead?


Regards,
Nick
 
W

Word Heretic

G'day Nick <[email protected]>,

I was purely talking compiled dll vs interpreted host-based scripting.
Whatever language you choose, it makes little difference to the end
result :)

Steve Hudson - Word Heretic
Want a hyperlinked index? S/W R&D? See WordHeretic.com

steve from wordheretic.com (Email replies require payment)


Nick reckoned:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top