extracting text from a publisher file

D

deek

I have a large publisher file which created by someone else and I need to
extract the text from this file. The easiest way seems like it might be to
File->SaveAs->filename.txt. However, when I do this the text comes out in a
strange order (e.g. pages out of order, headers/footers crammed together,
etc...). Same thing happens when I export as HTML or as a Word Document. I
find this frustrating and odd since the text seems right when I view the file
in MS Publisher 2007.

Question #1: Why does this happen? Is it related to the order in which the
text was added to the publisher file? Is there anything I can do about it?

Question #2: I am also looking at potentially manipulating this file from a
C# program using the interop assemblies and I was wondering if anyone can
recommend a good link (or a good book, or whatever...) to get me going
quickly with the Publisher interop assembly.

Thanks in advance!
-Deek
 
E

Ed Bennett

deek said:
Question #1: Why does this happen? Is it related to the order in which the
text was added to the publisher file? Is there anything I can do about it?

I have no idea why this happens, although it may be related to the
z-ordering of the page.
Question #2: I am also looking at potentially manipulating this file from a
C# program using the interop assemblies and I was wondering if anyone can
recommend a good link (or a good book, or whatever...) to get me going
quickly with the Publisher interop assembly.

The only systematic resource is the Publisher Object Model/VBA help
files. In older versions you can open the file VBAPB10.CHM from the
Office\1033 folder; in 2007 you have to hit F1 from the VBA IDE, or use
the online version at http://msdn.microsoft.com/en-us/library/aa437535.aspx

There's also a series of technical articles written for Publisher 2003
available at
http://msdn.microsoft.com/en-us/library/bb191021(office.11).aspx
 
D

deek

Ed Bennett said:
The only systematic resource is the Publisher Object Model/VBA help
files. In older versions you can open the file VBAPB10.CHM from the
Office\1033 folder; in 2007 you have to hit F1 from the VBA IDE, or use
the online version at http://msdn.microsoft.com/en-us/library/aa437535.aspx

There's also a series of technical articles written for Publisher 2003
available at
http://msdn.microsoft.com/en-us/library/bb191021(office.11).aspx

Ed, Thanks! Those were helpful articles.

One of the things I'm trying to do programatically is to get the text from
an entire document and preserve at least the simple formatting. To do this, I
have written a C# program using the interop assemblies that iterates through
the entire document and grabs the text from all the shapes. The code
basically looks like this:

-----------------
pub = new Microsoft.Office.Interop.Publisher.Application();
doc = pub.Open(@"C:\mydoc.pub", true, false,
PbSaveOptions.pbDoNotSaveChanges);
Microsoft.Office.Interop.Publisher.Pages pgs = doc.Pages;

foreach(Microsoft.Office.Interop.Publisher.Page pg in pgs)
{
foreach (Microsoft.Office.Interop.Publisher.Shape shp in
pg.Shapes)
{
if (shp.HasTextFrame == MsoTriState.msoTrue)
fulltext += shp.TextFrame.TextRange.Text;
else if (shp.HasTable == MsoTriState.msoTrue)
fulltext += "TABLE!!!!"; // To do!
}
}
-----------------

This seems to work pretty well. I iterate through all the pages and then all
the shapes on each page a grab the text. But what I can't figure out if how
to tell if there are regions of text that are bold or italics or whatever.
All I get is the raw text from the shape without any formatting markup. I'm
probably missing something pretty basic here, but I can't seem to find what
I'm looking for.

I've looked through the object model (which is quite onerous) and I'm still
stumped. Any pointers or help is welcome!

Thanks!
-Deek
 
D

deek

Ed Bennett said:
The only systematic resource is the Publisher Object Model/VBA help
files. In older versions you can open the file VBAPB10.CHM from the
Office\1033 folder; in 2007 you have to hit F1 from the VBA IDE, or use
the online version at http://msdn.microsoft.com/en-us/library/aa437535.aspx

There's also a series of technical articles written for Publisher 2003
available at
http://msdn.microsoft.com/en-us/library/bb191021(office.11).aspx

Ed, Thanks! Those were helpful articles.

One of the things I'm trying to do programatically is to get the text from
an entire document and preserve at least the simple formatting. To do this, I
have written a C# program using the interop assemblies that iterates through
the entire document and grabs the text from all the shapes. The code
basically looks like this:

-----------------
pub = new Microsoft.Office.Interop.Publisher.Application();
doc = pub.Open(@"C:\mydoc.pub", true, false,
PbSaveOptions.pbDoNotSaveChanges);
Microsoft.Office.Interop.Publisher.Pages pgs = doc.Pages;

foreach(Microsoft.Office.Interop.Publisher.Page pg in pgs)
{
foreach (Microsoft.Office.Interop.Publisher.Shape shp in
pg.Shapes)
{
if (shp.HasTextFrame == MsoTriState.msoTrue)
fulltext += shp.TextFrame.TextRange.Text;
else if (shp.HasTable == MsoTriState.msoTrue)
fulltext += "TABLE!!!!"; // To do!
}
}
-----------------

This seems to work pretty well. I iterate through all the pages and then all
the shapes on each page a grab the text. But what I can't figure out if how
to tell if there are regions of text that are bold or italics or whatever.
All I get is the raw text from the shape without any formatting markup. I'm
probably missing something pretty basic here, but I can't seem to find what
I'm looking for.

I've looked through the object model (which is quite onerous) and I'm still
stumped. Any pointers or help is welcome!

Thanks!
-Deek
 
E

Ed Bennett

deek said:
But what I can't figure out if how to tell if there are regions of
text that are bold or italics or whatever. All I get is the raw text
from the shape without any formatting markup.

The TextRange.Font object contains properties as to whether the text in
the given range is Bold, Italic, or otherwise. Obviously the text of an
entire text box will typically be a mixture of many types of formatting,
so you'll get a mixed tristate. In this case you can use the
TextRange.Characters(), Words(), Lines() and Paragraphs() methods to
return subranges.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top