Table Extraction from Word Documents

A

apondu

Hi,

I working on extracting the table content from the Word Documents. It
is workling fine whn the table is a simple and non-nested table.

But whn the word document contains the table which either as a merged
cells in a table or if the table is nested then it raises a exception.

I needed some help on extracting the tables from the word document
which can be a simple table or it can be either a nested table or
contains a merged cells.

I have pasted the code below, here the argument "path" specifies the
word document path.



public void wordTable(string path)
{

object missing = System.Reflection.Missing.Value;

bool errorFlag = false;

object fileName = path;

object readOnly = true;

object isVisible = false;

object isFalse = false;

object isTrue = true;

object isRepair = false;

object isPassword = "Gova";

object saveChanges = false;

object val = 0;

Word.ApplicationClass wordApp = null;

Word.Document aDoc = null;

Word.Table tableObj;

string tableContent = "";

int tableCount = 0;


Word.Range rng;



try
{
wordApp = new Word.ApplicationClass();

wordApp.Visible = false;

aDoc = wordApp.Documents.Open(ref fileName, ref missing,ref
readOnly, ref missing, ref isPassword, ref missing, ref missing, ref
missing, ref missing, ref missing, ref missing, ref isFalse,ref
isRepair,ref missing,ref missing,ref missing);

// Word.Tables table = aDoc.Application.ActiveDocument.Tables;

aDoc.ActiveWindow.Selection.WholeStory();

aDoc.ActiveWindow.Selection.Copy();

tableCount = aDoc.Tables.Count;

aDoc.Activate();

for(int i = 1;i <= tableCount ; i++)
{

tableObj = aDoc.Tables

tableContent += "\r\n\r\n" + " Table " + i + "\r\n\r\n";

foreach ( Word.Row row in tableObj.Rows )

foreach ( Word.Cell col in row.Cells)
{
tableContent += col.Range.Text + "\t";
tableContent += col.ToString() + "\t";
}
tableContent += "\r\n";
}
richTextBox1.Text = tableContent.ToString() ;

}
catch(Exception error)
{
string err = error.Message.ToString();
MessageBox.Show(err.ToString());
errorFlag = true;
}
finally
{
if(!errorFlag)
{
aDoc.Close(ref saveChanges, ref missing, ref missing);
wordApp.Quit(ref missing, ref missing, ref missing);
}
}

if(!errorFlag)
{
return(tableContent);
}
else
return("");
}


Can some one let me know if i am doing soem mistake and let me know
wht's the correction i need to do..

Thanks for help

Regards,
Govardhan
 
C

Cindy M.

Hi Apondu,
I working on extracting the table content from the Word Documents. It
is workling fine whn the table is a simple and non-nested table.

But whn the word document contains the table which either as a merged
cells in a table or if the table is nested then it raises a exception.

I needed some help on extracting the tables from the word document
which can be a simple table or it can be either a nested table or
contains a merged cells.
There is really no good solution for this situation. Your best bet is

1. For Word 2003, extract the information from the file's XML

2. For all versions, back through 2000: save the document as HTML (a web
page) then parse the HTML

In both cases you get information about colspan and rowspan so that you
can accurately reconstruct how the table is put together.

Cindy Meister
INTER-Solutions, Switzerland
http://homepage.swissonline.ch/cindymeister (last update Jun 17 2005)
http://www.word.mvps.org

This reply is posted in the Newsgroup; please post any follow question
or reply in the newsgroup and not by e-mail :)
 
A

apondu

hi..

Thanks for the reply Cindy, i thought of the option of converting the
Word Document to HTML file and try to extract the table, but i thought
may be there's a option by Word Object which can help me reduce my
efforts..

Thnks Cindy

Regards,
Govardhan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top