Text Parsing

T

Tyro

Agreed. I've been massaging computer data for over 40 years. In this case,
it's best to go back to square 1.
 
R

Ron Rosenfeld

It seems like this ought to be easy, but I'm stumped.

I received several thousand address records in the following format--all in
one column of an Excel worksheet:

Smith, John & Mary 123 Main St Chicago, IL 60601
Williams, Bill & Cindy 45 54th Drive Madison, WI 51375

I need to parse these into:

Last Name
First Name(s)
Street Address
City
State
Zip

I can get the last name using =LEFT(B2, FIND(",",B2)-1) and the Zip using
=RIGHT(B2,5) but I can't come up with a way to get the other fields. A quick
scan of the records shows all the street addresses seem to start with a
number, and there are no numbers in any of the first names, so it seems like
if I can find the first number reading from the left, I could grab the first
names, but neither FIND nor SEARCH seem to work. I'm using Excel 2007.

Could anyone point me to a possible approach. Seems as if this is something
others must have solved many times before.

Many thanks,
Tom K

If your layout is always in the same format, and in particular if there are
only single first names, then this can be done using Regular Expressions.

The main problem is to differentiate the first name(s) from the beginning of
the address. The rule that I used in the example below is this pattern:

<comma><space>
<word>
then, optionally:
<space>
ampersand (&)
<space>
<word>

But this will fail if you have a "double" first name. For example:

Smith, Mary Jane 47 Edgecomb rd. ...

If that is the case, we would need to come up with a different rule. But try
this, anyway, and see if it works.

If your addresses always start with a number, we could incorporate that as a
rule and get around the issue of double first names.

Also, you need to list your cities by name. In the Pattern I devised, I listed
Chicago and Madison. You can add more possibilities by extending the
pipe-delimited list in Pattern.

First, enter the UDF into your workbook:

<alt-F11> opens the VB Editor
Ensure your project is highlighted in the Project Explorer window, then
Insert/Module and paste the code below into the window that opens.

The following formulas should work:

Last Name: =remid(A1,"^[^,]*")
First Name(s): =remid(A1,Pattern,1)
Street Address: =remid(A1,Pattern,3)
City: =remid(A1,Pattern,4)
State: =remid(A1,Pattern,5)
Zip: =remid(A1,Pattern,6)


Pattern:
,\s+(\w+(\s+&\s+\w+)?)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)


=============================================
Option Explicit
Function reMid(str As String, spattern As String, Optional Index) As String
Dim re As Object
Dim mc As Object

Set re = CreateObject("vbscript.regexp")
re.Global = True
re.ignorecase = True
re.Pattern = spattern

If re.test(str) = True Then
Set mc = re.Execute(str)
If IsMissing(Index) Then
reMid = mc(0)
Else
reMid = mc(0).submatches(Index - 1)
End If
End If
End Function
=============================================

This should handle any entries that are in the pattern you presented. Examples
that don't work would be useful in trying to refine this routine.
--ron

Martin just pointed out to me that EVERY street address begins with a number.

That being the case, the following is more efficient for picking out the first
name(s), and will handle the issue of two first names: e.g. Mary Jane & Justin

Last Name: =remid(A1,"^[^,]*")
First Name(s): =remid(A1,Pattern,1)
Street Address: =remid(A1,Pattern,2)
City: =remid(A1,Pattern,3)
State: =remid(A1,Pattern,4)
Zip: =remid(A1,Pattern,5)

and

Pattern:

,\s+(\D+)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

We still need to generate a pipe-delimited list of cities. How to do that
depends on the length of the list.
--ron

Just for consistency, we could use a single Pattern for all including the Last
Name.

Pattern:

(^[^,]+),\s+(\D+)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

Then use the formula:

Last Name: =remid($A1,Pattern,COLUMNS($A:A))

Copy/drag to the right, and the last argument will adjust to the proper Index
for each section of the string.
--ron
 
T

Tom K

Guys:

Thank you all for the many excellent suggestions. I'm grateful that so many
people would try to help out. What a great community!

After carefully scanning thorugh the list, it looks like there are two
instances where the street address does not start with a number (darn) --
those address that start PO Box XXX, or RR# XXX. Most of the first names are
of the format "Spouse 1 & Spouse 2" or just "Spouse 1".

Let me work throught the suggestions from Ron, Martin, and Earl and see if I
have any success. Again many thanks.

Tom K

P.S. What does "OP" mean?
 
R

Ron Rosenfeld

Guys:

Thank you all for the many excellent suggestions. I'm grateful that so many
people would try to help out. What a great community!

After carefully scanning thorugh the list, it looks like there are two
instances where the street address does not start with a number (darn) --
those address that start PO Box XXX, or RR# XXX. Most of the first names are
of the format "Spouse 1 & Spouse 2" or just "Spouse 1".

Let me work throught the suggestions from Ron, Martin, and Earl and see if I
have any success. Again many thanks.

Tom K

If the issue is that some of the streets start with RR or PO, then, using my
last example of formulas, try this for Pattern:

(^[^,]+),\s+(\D+?)\s+(?=PO|RR|\d)(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

Where you see the pipe-delimited list of "street starters" (e.g. PO|RR|\d) you
can expand that if needed (e.g. PO|P.O.|RR|\d)

Note that in that list, the "\d" stands for a digit, so don't leave that out.

You'll still need the pipe-delimited list of city names, but once you give some
more information, we can probably construct that easily.

The formulas in my last post:

Last Name: =remid($A1,Pattern,COLUMNS($A:A))

Copy/drag to the right, and the last argument will adjust to the proper Index
for each section of the string.

And the UDF itself is unchanged.
--ron
 
J

justinthered

Guys:

Thank you all for the many excellent suggestions. I'm grateful that so many
people would try to help out. What a great community!

After carefully scanning thorugh the list, it looks like there are two
instances where the street address does not start with a number (darn) --
those address that start PO Box XXX, or RR# XXX. Most of the first names are
of the format "Spouse 1 & Spouse 2" or just "Spouse 1".

Let me work throught the suggestions from Ron, Martin, and Earl and see if I
have any success. Again many thanks.

Tom K

P.S. What does "OP" mean?

did you try, like I am doing right now on some text... copy your
column to a worksheet, export that worksheet to a CSV file, then on a
new worksheet, import from that CSV and you can then use the wizard to
set comma as delimeter and make it so you get each item into new
column...

then you can work the information from that stand point.

If this is to be a a one time deal.

just in colorado
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top