Threaded Word File Conversion

C

Chuck P

I have around a 1000 word files to convert from doc to html.
It's very slow. So I threaded it. When I threaded it frequently pukes, with
various error messages. Also I end up getting lots of copies of WinWord in
taskmanager after the program crashes.

I'm not an expert on Comm interop and I read this:

http://msdn2.microsoft.com/en-us/li...services.marshal.releasecomobject(VS.71).aspx
The runtime callable wrapper has a reference count that is incremented every
time a COM interface pointer is mapped to it. The ReleaseComObject method
decrements the reference count of a runtime callable wrapper.

I wondering if the runtime callable wrapper is counting or trying to re-use
com objects which are disposed.


Is word/com some how doing something on the main thread and causing the pukes.


static object workerLocker = new object();
static int runningWorkers = 0;

private void ConvertFiles()
{

if (!Directory.Exists(txtSourceDirectory.Text))
txtStatus.Text = "Source Directory does not exist.";
else if (!Directory.Exists(txtDestinationDirectory.Text))
txtStatus.Text = "Destination Directory does not exist.";
else
{
Directory.Delete(txtDestinationDirectory.Text, true);
Directory.CreateDirectory(txtDestinationDirectory.Text);

string[] SourceFiles =
Directory.GetFiles(txtSourceDirectory.Text, "*.doc");
foreach (string fileName in SourceFiles)
{

if (!Path.GetFileName(fileName).StartsWith("."))
{
lock (workerLocker)
{
runningWorkers++; Monitor.Pulse(workerLocker);
}
ThreadPool.QueueUserWorkItem(new
WaitCallback(SaveToHTML), fileName);
}
}

lock (workerLocker)
{
while (runningWorkers > 0) Monitor.Wait(workerLocker);
}

GC.Collect();
}
}

private void SaveToHTML(Object sfileName)
{
string fileName = (string)sfileName;

object Missing = System.Reflection.Missing.Value;
object readOnly = true;
object isVisible = false;
object fileToOpen = (object)fileName;
object fileToSave =
(object)Path.Combine(txtDestinationDirectory.Text,
Path.GetFileName(fileName)).Replace(".doc", ".html");
object FileFormat = Word.WdSaveFormat.wdFormatFilteredHTML;

Word._Application word = null;
Word._Document doc = null;

word = new Word.Application();
doc = new Word.Document();

doc = word.Documents.Open(ref fileToOpen,
ref Missing, ref readOnly, ref Missing, ref Missing,
ref Missing, ref Missing, ref Missing, ref Missing,
ref Missing, ref Missing, ref isVisible, ref Missing,
ref Missing, ref Missing, ref Missing);

doc.SaveAs(ref fileToSave,
ref FileFormat, ref Missing, ref Missing, ref Missing,
ref Missing, ref Missing, ref Missing, ref Missing,
ref Missing, ref Missing, ref Missing, ref Missing,
ref Missing, ref Missing, ref Missing);

doc.Close(ref Missing, ref Missing, ref Missing);
NAR(doc);
doc = null;
word.Quit(ref Missing, ref Missing, ref Missing); //closes all
documents
NAR(word);
word = null;

GC.Collect();
GC.WaitForPendingFinalizers();

GC.Collect();
GC.WaitForPendingFinalizers();


lock (workerLocker)
{
runningWorkers--; Monitor.Pulse(workerLocker);
SetStatus(runningWorkers.ToString() + " " + fileName +
Environment.NewLine);
}

Application.DoEvents();
}



delegate void SetStringDelegate(string parameter);
void SetStatus(string status)
{
if (!InvokeRequired)
{
txtStatus.Text += status;
this.Refresh();
}
else
Invoke(new SetStringDelegate(SetStatus), new object[] {
status });
}

private void NAR(object o)
{//http://support.microsoft.com/default.aspx?scid=kb;EN-US;317109
try
{
System.Runtime.InteropServices.Marshal.ReleaseComObject(o);
}
catch { }
finally
{
o = null;
}
}
 
J

Jialiang Ge [MSFT]

Hello Chuck,

I have tested your code carefully. The function SaveToHtml is used to
convert ONE doc to ONE html. According to its code, every time we call the
function, we will
Step 1. create an instance of Word (word=new Word.Application())
Step 2. convert the document format.
Step 3. release/close the Word instance (NAR(word)).

Step 2 should work very fast based on my experience. However, Step 1 and 3
generally take a long time. Every call of new Word.Application() will
create a new winword.exe process. This explains why the overall process of
the 1000 word files is very slow in the single-thread mode and it also
explains why you see a lot of Word processes in task manager when using
multi-threads to do the conversion.

The recommended approach is to create a single instance of Word.Application
object or get the active Word.Application instance running on your
computer, and re-use this instance to convert all the files, then
close/release it in the end. In this way, I'd assure you that the whole
process will turn much faster even if we do not apply multi-thread mode.
(To be honest, I don't think multi-threading will make it faster in this
case)

Here is an example for your reference:

private void ConvertFiles()
{
if (!Directory.Exists(txtSourceDirectory.Text))
txtStatus.Text = "Source Directory does not exist.";
else if (!Directory.Exists(txtDestinationDirectory.Text))
txtStatus.Text = "Destination Directory does not exist.";
else
{
Directory.Delete(txtDestinationDirectory.Text, true);
Directory.CreateDirectory(txtDestinationDirectory.Text);

Word._Application word = null;
object Missing = System.Reflection.Missing.Value;

try
{
// try to get an running instance of winword.exe
word =
(Word._Application)Marshal.GetActiveObject("Word.Application");
}
catch
{
// if no winword already running, we start a new
instance of Word
word = new Word.Application();
}

Word._Document doc = null;

object fileToOpen, fileToSave;
object FileFormat = Word.WdSaveFormat.wdFormatFilteredHTML;
object readOnly = true;
object isVisible = false;

string[] SourceFiles =
Directory.GetFiles(txtSourceDirectory.Text, "*.doc");
foreach (string fileName in SourceFiles)
{
if (!Path.GetFileName(fileName).StartsWith("."))
{
fileToOpen = (object)fileName;
fileToSave =
(object)Path.Combine(txtDestinationDirectory.Text,
Path.GetFileName(fileName)).Replace(".doc", ".html");
doc = word.Documents.Open(ref fileToOpen,
ref Missing, ref readOnly, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref isVisible, ref
Missing,
ref Missing, ref Missing, ref Missing);

doc.SaveAs(ref fileToSave,
ref FileFormat, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing);

//close the document
doc.Close(ref Missing, ref Missing, ref Missing);
NAR(doc);
doc =null;
}
}

// quit the instance of word in the end.
word.Quit(ref Missing, ref Missing, ref Missing); //closes
all documents
NAR(word);
word = null;
}
}

If you insist on using multi-threading, we need to initialize and close the
Word.Application instance in the ConverFiles function, and do the
conversion job in SaveToHtml function by re-using the word instance.

If you have any other concerns or questions, please feel free to let me
know.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
C

Chuck P

Thanks,
I tried that and I tried converted the threading to STA. I read this
weekend using MTA with com objects is a no no.


Jialiang Ge said:
Hello Chuck,

I have tested your code carefully. The function SaveToHtml is used to
convert ONE doc to ONE html. According to its code, every time we call the
function, we will
Step 1. create an instance of Word (word=new Word.Application())
Step 2. convert the document format.
Step 3. release/close the Word instance (NAR(word)).

Step 2 should work very fast based on my experience. However, Step 1 and 3
generally take a long time. Every call of new Word.Application() will
create a new winword.exe process. This explains why the overall process of
the 1000 word files is very slow in the single-thread mode and it also
explains why you see a lot of Word processes in task manager when using
multi-threads to do the conversion.

The recommended approach is to create a single instance of Word.Application
object or get the active Word.Application instance running on your
computer, and re-use this instance to convert all the files, then
close/release it in the end. In this way, I'd assure you that the whole
process will turn much faster even if we do not apply multi-thread mode.
(To be honest, I don't think multi-threading will make it faster in this
case)

Here is an example for your reference:

private void ConvertFiles()
{
if (!Directory.Exists(txtSourceDirectory.Text))
txtStatus.Text = "Source Directory does not exist.";
else if (!Directory.Exists(txtDestinationDirectory.Text))
txtStatus.Text = "Destination Directory does not exist.";
else
{
Directory.Delete(txtDestinationDirectory.Text, true);
Directory.CreateDirectory(txtDestinationDirectory.Text);

Word._Application word = null;
object Missing = System.Reflection.Missing.Value;

try
{
// try to get an running instance of winword.exe
word =
(Word._Application)Marshal.GetActiveObject("Word.Application");
}
catch
{
// if no winword already running, we start a new
instance of Word
word = new Word.Application();
}

Word._Document doc = null;

object fileToOpen, fileToSave;
object FileFormat = Word.WdSaveFormat.wdFormatFilteredHTML;
object readOnly = true;
object isVisible = false;

string[] SourceFiles =
Directory.GetFiles(txtSourceDirectory.Text, "*.doc");
foreach (string fileName in SourceFiles)
{
if (!Path.GetFileName(fileName).StartsWith("."))
{
fileToOpen = (object)fileName;
fileToSave =
(object)Path.Combine(txtDestinationDirectory.Text,
Path.GetFileName(fileName)).Replace(".doc", ".html");
doc = word.Documents.Open(ref fileToOpen,
ref Missing, ref readOnly, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref isVisible, ref
Missing,
ref Missing, ref Missing, ref Missing);

doc.SaveAs(ref fileToSave,
ref FileFormat, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing);

//close the document
doc.Close(ref Missing, ref Missing, ref Missing);
NAR(doc);
doc =null;
}
}

// quit the instance of word in the end.
word.Quit(ref Missing, ref Missing, ref Missing); //closes
all documents
NAR(word);
word = null;
}
}

If you insist on using multi-threading, we need to initialize and close the
Word.Application instance in the ConverFiles function, and do the
conversion job in SaveToHtml function by re-using the word instance.

If you have any other concerns or questions, please feel free to let me
know.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
C

Chuck P

I gave up on the threaded. Event with STA it's just not reliable, random
crashes.
Using the single threaded with one application object takes seconds per file.
To slow but at least it works.


Jialiang Ge said:
Hello Chuck,

I have tested your code carefully. The function SaveToHtml is used to
convert ONE doc to ONE html. According to its code, every time we call the
function, we will
Step 1. create an instance of Word (word=new Word.Application())
Step 2. convert the document format.
Step 3. release/close the Word instance (NAR(word)).

Step 2 should work very fast based on my experience. However, Step 1 and 3
generally take a long time. Every call of new Word.Application() will
create a new winword.exe process. This explains why the overall process of
the 1000 word files is very slow in the single-thread mode and it also
explains why you see a lot of Word processes in task manager when using
multi-threads to do the conversion.

The recommended approach is to create a single instance of Word.Application
object or get the active Word.Application instance running on your
computer, and re-use this instance to convert all the files, then
close/release it in the end. In this way, I'd assure you that the whole
process will turn much faster even if we do not apply multi-thread mode.
(To be honest, I don't think multi-threading will make it faster in this
case)

Here is an example for your reference:

private void ConvertFiles()
{
if (!Directory.Exists(txtSourceDirectory.Text))
txtStatus.Text = "Source Directory does not exist.";
else if (!Directory.Exists(txtDestinationDirectory.Text))
txtStatus.Text = "Destination Directory does not exist.";
else
{
Directory.Delete(txtDestinationDirectory.Text, true);
Directory.CreateDirectory(txtDestinationDirectory.Text);

Word._Application word = null;
object Missing = System.Reflection.Missing.Value;

try
{
// try to get an running instance of winword.exe
word =
(Word._Application)Marshal.GetActiveObject("Word.Application");
}
catch
{
// if no winword already running, we start a new
instance of Word
word = new Word.Application();
}

Word._Document doc = null;

object fileToOpen, fileToSave;
object FileFormat = Word.WdSaveFormat.wdFormatFilteredHTML;
object readOnly = true;
object isVisible = false;

string[] SourceFiles =
Directory.GetFiles(txtSourceDirectory.Text, "*.doc");
foreach (string fileName in SourceFiles)
{
if (!Path.GetFileName(fileName).StartsWith("."))
{
fileToOpen = (object)fileName;
fileToSave =
(object)Path.Combine(txtDestinationDirectory.Text,
Path.GetFileName(fileName)).Replace(".doc", ".html");
doc = word.Documents.Open(ref fileToOpen,
ref Missing, ref readOnly, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref isVisible, ref
Missing,
ref Missing, ref Missing, ref Missing);

doc.SaveAs(ref fileToSave,
ref FileFormat, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing, ref
Missing,
ref Missing, ref Missing, ref Missing);

//close the document
doc.Close(ref Missing, ref Missing, ref Missing);
NAR(doc);
doc =null;
}
}

// quit the instance of word in the end.
word.Quit(ref Missing, ref Missing, ref Missing); //closes
all documents
NAR(word);
word = null;
}
}

If you insist on using multi-threading, we need to initialize and close the
Word.Application instance in the ConverFiles function, and do the
conversion job in SaveToHtml function by re-using the word instance.

If you have any other concerns or questions, please feel free to let me
know.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
J

Jialiang Ge [MSFT]

Using the single threaded with one application object takes
seconds per file. To slow but at least it works.

Hello Chuck, honestly, I am a little surprised to see it takes seconds per
file. I conducted a test on my side, and it turned out to be only 0.355
seconds on average.

I created a folder which 1000 doc files whose size ranges from 120KB to
240KB, then I new a C# Console project from Visual Studio 2005 and
copy&paste the code I demonstrated in the last reply to it. In order to get
the total run time , I wrote the main function as:

static void Main(string[] args)
{
DateTime start = DateTime.Now;
ConvertFiles();
DateTime end = DateTime.Now;
TimeSpan span = end - start;
Console.WriteLine(span.Minutes.ToString() + "mins " +
span.Seconds.ToString());
}

The result is "5mins 55", namely, 0355 seconds per file. I repeated the
test for several times, and the result is almost the same.
My CPU is Intel? Core(TM)2 [email protected], with 2GB RAM

Therefore, I start to suspect if the low performance on your side results
from other factors? Is it because the files to be converted are too large
(e.g. 10MB)? Or is it due to the network bandwidth constraints if the files
are not on local machine?

And regarding the Threaded issue, this problem is most likely caused that
your COM objects are called on different threads. By default Office use STA
apartment model so if you are using COM objects in different threads, this
may cause issues like interop exceptions "object disconnected from its RCW".

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
J

Jialiang Ge [MSFT]

Using the single threaded with one application object takes
seconds per file. To slow but at least it works.

Hello Chuck, honestly, I am a little surprised to see it takes seconds per
file. I conducted a test on my side, and it turned out to be only 0.355
seconds on average.

I created a folder with 1000 doc files whose size ranges from 120KB to
240KB, then I new a C# Console project from Visual Studio 2005 and
copy&paste the code I demonstrated in the last reply to it. In order to get
the total run time , I wrote the main function as:

static void Main(string[] args)
{
DateTime start = DateTime.Now;
ConvertFiles();
DateTime end = DateTime.Now;
TimeSpan span = end - start;
Console.WriteLine(span.Minutes.ToString() + "mins " +
span.Seconds.ToString());
}

The result is "5mins 55", namely, 0355 seconds per file. I repeated the
test for several times, and the result is almost the same.
My CPU is Intel? Core(TM)2 [email protected], with 2GB RAM

Therefore, I start to suspect if the low performance on your side results
from other factors? Is it because the files to be converted are too large
(e.g. 10MB)? Or is it due to the network bandwidth constraints if the files
are not on local machine?

And regarding the Threaded issue, this problem is most likely caused that
your COM objects are called on different threads. By default Office use STA
apartment model so if you are using COM objects in different threads, this
may cause issues like interop exceptions "object disconnected from its RCW".

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
J

Jialiang Ge [MSFT]

Hello Chunk,

I am writing to check if there is anything I can do for you in this thread.
If you need further assistance, feel free to let me know. I will be more
than happy to be of assistance.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
C

Chuck P

Thanks, for the help.
I am retrieving the files from a network share but the network is a really
fast one.
Interesting once I process a couple hundred files the convert time doubles.
I ended up going back to disposing the word and doc objects after every
convert. If I don't it blows up after 50 files or so.
I am just going to do the converts at night.
 
J

Jialiang Ge [MSFT]

Hello Chunk
Interesting once I process a couple hundred files the convert time
doubles, I ended up going back to disposing the word and doc
objects after every convert.

It is necessary to dispose the doc objects after each convert, otherwise,
as you have seen, the process will blow up after 50 files or so, because
one Word application instance cannot have too many document open at the
same time.
However, creating and disposing word object per conversion will definitely
slow down the whole process. This is why I suggested you use only one Word
application instance in the previous replies.
I am just going to do the converts at night.
OK. If you encounter any other problems, please feel free to let me know.
We are always with you.

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top