Suggestions for extracting masked selections and generating XML

  • Thread starter AutomationHelpinSF
  • Start date
A

AutomationHelpinSF

I have several hundred e-mail messages a day coming in, each in one of
roughly 17-20 different Word "format" files. My goal is to extract the same
data from each of these files so I can put them into a database for further
business processes.

For example, from each file I want to extract a Name, a numerical Score, and
a character Code. In each of the format files, the data is in a different
place in the document. Some have them in a sentence structure in the first
paragraph, some have a list format, etc. As it stands now a human has to
read it an extract the data.

What I would like is a way for a user to take a sample file that's in the
first format, make a selection and somehow "tag" it as if to say "this area
I've selected should be the Score". Do that for all the data items in the
file. Then, have a tool / macros which can then process any file in that
format, extract those selected areas, and generate some other file with
name/value pairs (XML, CSV, etc.)

I could instead write code which could extract the data, however if we get a
new file format that means writing new code. We'd like to delegate that to a
data entry person who can create a "mask" for the new format (they change
frequently as we're processing data from many vendors and there are no
standards in this field).

I am not a hardcore Outlook / Word / Office automation person and so I'm
looking for suggestions as to approach, tools which could be purchased which
would accomplish this, suggestions, etc.

If people have suggestions about more appropriate places to ask, I'd
appreciate that as well.

I appreciate the help, Steve.
 
M

macropod

Hi,

Word is not designed to work that way.

Your best approach would be to persuade the people sending the documents to you to either use formfields to capture the data in each
document. Then it would be quite simple to extract the data, without even needing to know much about the layout.

Alternatively, you could take the documents to a document scanning company with the software to create masks to recognise the data
fields (and, if there's a considerable degree of consistency in the various layouts, the document formats that go with each mask).
This is much more expensive than the first approach, but it can be quite fast.
 
A

AutomationHelpinSF

Thanks for responding macropod.

While I would love to put some standard forms in place for the data, that's
just not an option. Data is coming from many companies, in many forms, and
we're not the only ones receiving it. Yes, it's ripe for standardization,
but we're just no in the political and economic position to drive it and make
it happen.

So, I'm left with working with the data as it is.

You mention document scanning companies which might have software to
accomplish what I'm looking for. If so, do you have any idea how they would
achieve that for a Word file?

Any other suggestions?

I realize that Word doesn't really work that way, however I'm thinking that
if this were Excel I could ask the user to define several name spaces for the
various data fields and then I could use those in Macros across all the word
format files.

I was hoping there's something similar in Word which I wasn't aware of.

Thanks, Steve.
 
M

macropod

Hi,

The document scanning companies would most likely work from hard copy or PDFs - you'd need to discuss the options with the document
scanning company. Converting Word files to PDF, even in bulk, is fairly straightforward.

If there is always a particular text string, formfield or table associated with the data you're after from each company, you could
use a macro to find those and extract the asscociated data. This would be easiest if there is a ready way to identify the company
that generates a particular document (eg header text or one of the document's inbuilt properties).

--
Cheers
macropod
[Microsoft MVP - Word]


AutomationHelpinSF said:
Thanks for responding macropod.

While I would love to put some standard forms in place for the data, that's
just not an option. Data is coming from many companies, in many forms, and
we're not the only ones receiving it. Yes, it's ripe for standardization,
but we're just no in the political and economic position to drive it and make
it happen.

So, I'm left with working with the data as it is.

You mention document scanning companies which might have software to
accomplish what I'm looking for. If so, do you have any idea how they would
achieve that for a Word file?

Any other suggestions?

I realize that Word doesn't really work that way, however I'm thinking that
if this were Excel I could ask the user to define several name spaces for the
various data fields and then I could use those in Macros across all the word
format files.

I was hoping there's something similar in Word which I wasn't aware of.

Thanks, Steve.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top