Parsing MIME mail headers

M

Mike

I have thousands of .EMLX files from a client. I need to parse the To,
From, Subject, Date, and CC fields from the headers. I've got my code
running, and it works. I just can't help but think it isn't as efficient
as it could be.

I'm reading the files, which are plain text, and line-by-line looking for
the field names. The To and CC fields are particularly tricky because they
are multi-line (or can be). When I hit one of these fields, I start a sub
loop that captures the lines until it finds another field, which seems to
work okay.

I wrote versions of this code, which are practically identical, in VBA and
VBS to store this data in an access DB. Does anyone know a dll, COM object,
or .Net object that can do this more efficiently? I have no problem
re-writing my code if necessary. I'm going to be getting a hard drive full
of these soon, so I'd like to speed up my code as much as I can.

I also need to extract attachments. I have my Base64 decoder working using
MSXML2, but I'm having trouble identifying the begging and end of the
attachment.
 
M

Mike

This is not XML, perhaps they're not MIME, but they are standard internet
mail files. Here are some of the headers.

Received: <Removed for privacy> 12 Feb 1998 13:05:31 -0000
Received: from unknown <Removed for privacy>
<Removed for privacy>
by <Removed for privacy> with SMTP
for <Removed for privacy> 12 Feb 1998 13:05:31 -0000
Received: <Removed for privacy>; 12 Feb 1998 13:05:31 -0000
Received: from <Removed for privacy>
(envelope-sender <Removed for privacy>)
by <Removed for privacy> with SMTP
for <Removed for privacy>; 12 Feb 1998 13:05:30 -0000
Received: from unknown [<Removed for privacy>] (<Removed for privacy>)
by <Removed for privacy> (mxl_mta-5.4.0-1)
with ESMTP id <Removed for privacy>(envelope-from <Removed for privacy>);
Tue, 12 Feb 1998 06:05:30 -0700 (MST)
Received: from unknown [<Removed for privacy>] (EHLO<Removed for privacy>)
by <Removed for privacy>(mxl_mta-5.4.0-1) over TLS secured channel
with ESMTP id <Removed for privacy>(envelope-from <Removed for privacy>);
Tue, 12 Feb 1998 06:05:28 -0700 (MST)
Received: from <Removed for privacy> ([<Removed for privacy>]) by <Removed
for privacy>
([<Removed for privacy>]) with mapi; Tue, 12 Feb 1998 08:00:12 -0500
From: <Removed for privacy>
To: <Removed for privacy>
CC: <Removed for privacy>, <Removed for privacy>,
<Removed for privacy>, <Removed for privacy>, <Removed for privacy>,
<Removed for privacy>
Date: Tue, 12 Feb 1998 08:00:11 -0500
Subject: Re:something, something
Thread-Topic: <Removed for privacy>
Thread-Index: <Removed for privacy>
Message-ID: <Removed for privacy>
Accept-Language: en-US
Content-Language: en-US
 
J

James Whitlow

Mike said:
I have thousands of .EMLX files from a client. I need to parse the To,
From, Subject, Date, and CC fields from the headers. I've got my code
running, and it works. I just can't help but think it isn't as efficient
as it could be.

I'm reading the files, which are plain text, and line-by-line looking for
the field names. The To and CC fields are particularly tricky because
they are multi-line (or can be). When I hit one of these fields, I start
a sub loop that captures the lines until it finds another field, which
seems to work okay.

I wrote versions of this code, which are practically identical, in VBA and
VBS to store this data in an access DB. Does anyone know a dll, COM
object, or .Net object that can do this more efficiently? I have no
problem re-writing my code if necessary. I'm going to be getting a hard
drive full of these soon, so I'd like to speed up my code as much as I
can.

I also need to extract attachments. I have my Base64 decoder working
using MSXML2, but I'm having trouble identifying the begging and end of
the attachment.

Mike, you might want to consider using regular expressions. They tend to
work quite well for what you are wanting to do. See below for a small
example.

Set oFSO = CreateObject("Scripting.FileSystemObject")
Set oRegEx = CreateObject("VBScript.RegExp")

oRegEx.Multiline = True

sEmail = oFSO.OpenTextFile("Email.txt", 1).ReadAll

oRegEx.Pattern = "^To:([\x00-\xff]*?[\n\r\f]*?)[\n\r\f]*?.*?:"
sTo = oRegEx.Execute(sEmail)(0).Submatches(0)

oRegEx.Pattern = "^CC:([\x00-\xff]*?[\n\r\f]*?)[\n\r\f]*?.*?:"
sCC = oRegEx.Execute(sEmail)(0).Submatches(0)

MsgBox "To: " & sTo & vbCr & "CC:" & sCC
 
S

Stefan Hoffmann

hi Mike,
This is not XML, perhaps they're not MIME, but they are standard internet
mail files. Here are some of the headers.
Ah, okay, then my information was wrong. I thought the Apple EMLX files
where XML files...


mfG
--> stefan <--
 
D

david

Looking in the wrong places - asking how not to use VBA or VBS
in VBA and VBS groups :~)

Actually, file access is about the same speed in any environment,
so it's not going to make any difference how you code it. Avoid
string concatenation, because the fully managed string class in VBA
and VBS does concatenation a lot slower than a C string or a TP
string. Also, VBS is unable to do string folding or optimise out
constant values. But that's unlikely to make any difference in a
file-to-file filter application.

Having said that, on my PC, VBA is faster than .NET, but I'm sure
that's all just overhead: .Net is probably faster for some complex thing
on some better computer.

For the Access part of the loop, use bound variables rather than
field collection members. Post your code for suggestions.

(david)
 
M

mayayana

Much of email format is set out with blank lines
and "boundary markers". There's usually a blank
line in between parts of the message. (When you look
at the raw code, the blank lines all serve a purose.
They're not just for readability.) A boundary marker
can be any string, with certain limitations, but most
email programs go overboard and create them from
something like a GUID + computer name, so they're
very recognizable in the email body.

The details of MIME format are available but they
exist in excessively official, absurdly abstruse, nearly
unreadable, technical documents. If you want to check
that out search for:

RFC2045 RFC2046 RFC 822

It's hard to find more readable documentation because
few people deal with MIME format directly. Usually when
programmers want to send email they're using a component
or automate an email program that does the formatting
internally.

This might be somewhat helpful:

www.jsware.net/jsware/vbcode.php5#mail

It's VB code for sending email with no dependencies.
I know that's not what you need, but since the code
has to do the whole job of composing the actual email
in this case, I needed to figure out the format of email
messages. After having done so, I then included an
explanatory file named MIME format.txt in the download.
That file outlines the basic MIME structure.

After days poring over those abominable RFC files I
figured that I should try to do what I could to save others
from the same horrible fate in the future. :)
 
K

krazymike

Thanks for the code!! The To: field extracts perfectly! My only issue
is on this line:

sCC = oRegEx.Execute(sEmail)(0).Submatches(0)

I changed it to:

If oRegEx.Test(sEmail) = True Then sCC = oRegEx.Execute(sEmail)
(0).Submatches(0)

The CC: Field is not a required field. Actually, an email must only
have only one of the To:, CC:, or BCC: fields to be compliant.
 
K

krazymike

Ok, I'm REALLY new to RegEx. I have this Field in an email:
Date: Tue, 4 Mar 2008 17:25:43 -0600

I have this code:
oRegEx.Pattern = "^Date:(*\s*(Sun|Mon|Tue|Wed|Thu|Fri|Sat),\s*)?(0?
[1-9]|[1-2][0-9]|3[01])\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|
Dec)\s+(19[0-9]{2}|[2-9][0-9]{3}|[0-9]{2})\s+(2[0-3]|[0-1][0-9]):([0-5]
[0-9])(?::(60|[0-5][0-9]))?\s+([-\+][0-9]{2}[0-5][0-9]|(?:UT|GMT|(?:E|
C|M|P)(?:ST|DT)|[A-IK-Z]))(\s*\((\\\(|\\\)|(?<=[^\\])\((?<C>)|(?<=[^\
\])\)(?<-C>)|[^\(\)]*)*(?(C)(?!))\))*\s*$"
If oRegEx.Test(sEmail) = True Then sDT = oRegEx.Execute(sEmail)
(0).Submatches(0)

I used the tester on http://regexlib.com/RETester.aspx?regexp_id=969
to test this pattern and it works. My code fails on the If test
portion.
"Application-defined or Object-defined error"

Where did I go wrong?
 
P

Paul Randall

krazymike said:
Ok, I'm REALLY new to RegEx. I have this Field in an email:
Date: Tue, 4 Mar 2008 17:25:43 -0600

I have this code:
oRegEx.Pattern = "^Date:(*\s*(Sun|Mon|Tue|Wed|Thu|Fri|Sat),\s*)?(0?
[1-9]|[1-2][0-9]|3[01])\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|
Dec)\s+(19[0-9]{2}|[2-9][0-9]{3}|[0-9]{2})\s+(2[0-3]|[0-1][0-9]):([0-5]
[0-9])(?::(60|[0-5][0-9]))?\s+([-\+][0-9]{2}[0-5][0-9]|(?:UT|GMT|(?:E|
C|M|P)(?:ST|DT)|[A-IK-Z]))(\s*\((\\\(|\\\)|(?<=[^\\])\((?<C>)|(?<=[^\
\])\)(?<-C>)|[^\(\)]*)*(?(C)(?!))\))*\s*$"
If oRegEx.Test(sEmail) = True Then sDT = oRegEx.Execute(sEmail)
(0).Submatches(0)

I used the tester on http://regexlib.com/RETester.aspx?regexp_id=969
to test this pattern and it works. My code fails on the If test
portion.
"Application-defined or Object-defined error"

Where did I go wrong?

Did you do the test with the dot net engine or VBScript engine?

Regular Expression Workbench can give a somewhat English interpretation of
how the dot net engine would see a regular expression. Here is what it says
about yours:

^ (anchor to start of string)Date:
Capture
* (zero or more times)
Any whitespace character
* (zero or more times)
Capture
Sun
or
Mon
or
Tue
or
Wed
or
Thu
or
Fri
or
Sat
End Capture
,
Any whitespace character
* (zero or more times)
End Capture
? (zero or one time)
Capture
0
? (zero or one time)
Any character in "1-9"
or
Any character in "1-2"
Any character in "0-9"
or
3
Any character in "01"
End Capture
Any whitespace character
+ (one or more times)
Capture
Jan
or
Feb
or
Mar
or
Apr
or
May
or
Jun
or
Jul
or
Aug
or
Sep
or
Oct
or
Nov
or
Dec
End Capture
Any whitespace character
+ (one or more times)
Capture
19
Any character in "0-9"
Exactly 2 times
or
Any character in "2-9"
Any character in "0-9"
Exactly 3 times
or
Any character in "0-9"
Exactly 2 times
End Capture
Any whitespace character
+ (one or more times)
Capture
2
Any character in "0-3"
or
Any character in "0-1"
Any character in "0-9"
End Capture
:
Capture
Any character in "0-5"
Any character in "0-9"
End Capture
Non-capturing Group
:
Capture
60
or
Any character in "0-5"
Any character in "0-9"
End Capture
End Capture
? (zero or one time)
Any whitespace character
+ (one or more times)
Capture
Any character in "-\+"
Any character in "0-9"
Exactly 2 times
Any character in "0-5"
Any character in "0-9"
or
Non-capturing Group
UT
or
GMT
or
Non-capturing Group
E
or
C
or
M
or
P
End Capture
Non-capturing Group
ST
or
DT
End Capture
or
Any character in "A-IK-Z"
End Capture
End Capture
Capture
Any whitespace character
* (zero or more times)
(
Capture
\(
or
\)
or
zero-width positive lookbehind
Any character not in "\\"
End Capture
(
Capture to <C>
End Capture
or
zero-width positive lookbehind
Any character not in "\\"
End Capture
)
Capture
? (zero or one time)
<-C>
End Capture
or
Any character not in "\(\)"
* (zero or more times)
End Capture
* (zero or more times)
Conditional Subexpression
if: C
match: zero-width negative lookahead
End Capture
End Capture
)

I'm no regex expert, but perhaps someone else can identify and comment on
anything that VBScript's regular expression engine can't handle.

-Paul Randall
 
D

Dr J R Stockton

In microsoft.public.scripting.vbscript message <5154613a-f4b0-4b9c-a94a-
(e-mail address removed)>, Wed, 17 Sep 2008 09:03:56,
krazymike said:
Ok, I'm REALLY new to RegEx. I have this Field in an email:
Date: Tue, 4 Mar 2008 17:25:43 -0600

I have this code:
oRegEx.Pattern = "^Date:(*\s*(Sun|Mon|Tue|Wed|Thu|Fri|Sat),\s*)?(0?
[1-9]|[1-2][0-9]|3[01])\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|
Dec)\s+(19[0-9]{2}|[2-9][0-9]{3}|[0-9]{2})\s+(2[0-3]|[0-1][0-9]):([0-5]
[0-9])(?::(60|[0-5][0-9]))?\s+([-\+][0-9]{2}[0-5][0-9]|(?:UT|GMT|(?:E|
C|M|P)(?:ST|DT)|[A-IK-Z]))(\s*\((\\\(|\\\)|(?<=[^\\])\((?<C>)|(?<=[^\
\])\)(?<-C>)|[^\(\)]*)*(?(C)(?!))\))*\s*$"
If oRegEx.Test(sEmail) = True Then sDT = oRegEx.Execute(sEmail)
(0).Submatches(0)

You should not need = True .
Where did I go wrong?

IMHO, it is unwise to check numerical or alphabetic fields in detail
within such a RegExp - it makes the RegExp hard to read.

There's no need to test every line against that; it is simpler to test
the first word of each header and then to apply more specific tests to
its payload.

That date format matches something like
(\w\w\w), (\d\d?) (\w\w\w) (\d{4}) ([0-9:]*) ([+-]\d{4})
pause to text in JavaScript - yes, just that,
and one can see the individual fields found. Obviously the RegExp can
be tested by adding its "words" one at a time, so that if an error is
made it is easy to find.

Such string fields can be tested either with the obvious RegExp or by
finding their position in a string such as "Mon Tue Wed ..."; numeric
fields can be tested by conversion to numeric types. Dates can be
tested efficiently as in my vb-dates.htm ff., including rejection of
such as Feb 31.

I have a JavaScript RegExp tester in my js-valid.htm - it could be re-
implemented in VBS.
 
S

Stefan Kanthak

Stefan Hoffmann said:

..EML files can be loaded and parsed with CDO (and there is ABSOLUTELY
no need to reinvent the wheel):

With WScript.CreateObject("CDO.Message")
With .GetStream
.LoadFromFile "@:\MESSAGE.EML"
.Flush
End With

WScript.Echo .From & .To & .ReplyTo & .Subject & .GetText
End With
Ah, okay, then my information was wrong. I thought the Apple EMLX files
where XML files...

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top