Text Parsing

T

Tom K

It seems like this ought to be easy, but I'm stumped.

I received several thousand address records in the following format--all in
one column of an Excel worksheet:

Smith, John & Mary 123 Main St Chicago, IL 60601
Williams, Bill & Cindy 45 54th Drive Madison, WI 51375

I need to parse these into:

Last Name
First Name(s)
Street Address
City
State
Zip

I can get the last name using =LEFT(B2, FIND(",",B2)-1) and the Zip using
=RIGHT(B2,5) but I can't come up with a way to get the other fields. A quick
scan of the records shows all the street addresses seem to start with a
number, and there are no numbers in any of the first names, so it seems like
if I can find the first number reading from the left, I could grab the first
names, but neither FIND nor SEARCH seem to work. I'm using Excel 2007.

Could anyone point me to a possible approach. Seems as if this is something
others must have solved many times before.

Many thanks,
Tom K
 
L

LarryKO

I'm sure there is an easier way, but there is a mid() function that
will allow you to selectively pull from the middle of the field. If
the ampersands are consistent I'm sure this will work to get
everything out. Good luck!
 
T

Tyro

You have a real problem. You can easily get the last name and the zip and
the state, but the rest is problematic.. Are the cities Chicago or St
Chicago (as in St Louis) Madison or Drive Madison? What if there is no Mary
or Cindy? No matter how you do it, you're going to have a lot of manual
work. Even if you split the data into columns (space delimiter) the result
will be inconsistant. Can you get the original file, before it was put into
Excel?
 
T

Tom K

The cities (in this example) are Chicago and Madison. And there are cases
where there is no spouse--so no ampersand.

I believe the list came from QuickBooks. I'm not familiar with QuickBooks,
but maybe it has a CSV or fixed field format. Otherwise, I think I am stuck,
unless someone else on these boards is creative.

Thanks for your thoughts.
 
M

MartinW

Hi Tom,

This may help you get some of the way.
Do Data>Text to Columns using comma as the delimeter.

Then on your second column do an Edit>Replace for each digit.
Find what:- space1
Replace with:- ~1
Then repeat with
Find what:- space2
Replace with:- ~2
etc
............
............
Down to
Find what:- space9
Replace with:- ~9

You can then do a further Text to columns on the ~
or use it as a Find character.

Still a bit of manual work but hopefully that will
help you get part way.

HTH
Martin
 
E

Earl Kiosterud

Tom,

This is a tough one. Your data is quite ambiguous as to names, addresses, city, etc. As
Tyro said, you'll probably have some manual work. One workable, if manual, way is to put
the stuff in a Word table. Set up columns as you need (city, state, etc). Then you can
quickly select parts and drag them to the column where they belong. Manual, but fast.
Several thousand records??? Ick. Hire someone. If you use one any of the parsing
techniques to do the parsing automatically, you can still use Word for cleanup, as required.
You can easily put it in Word initially with copy/paste, or by opening the Excel file
directly in Word. DON'T SAVE IT IN WORD, or it'll clobber your original Excel file.

Rob Bovey had a pretty sophisticated Excel parser for this kind of stuff. It was many years
ago. See if it's available at his web site, www.appspro.com.
--
Regards from Virginia Beach,

Earl Kiosterud
www.smokeylake.com

Note: Top-posting has been the norm here.
Some folks prefer bottom-posting.
But if you bottom-post to a reply that's
already top-posted, the thread gets messy.
When in Rome...
 
T

Tyro

You see the problem with breaking down the cell?

Lets suppose the entry is Smith, John & Mary 123 Main St. Louis, MO 60601.
(St. Louis, Missouri)
How is that different from Smith, John & Mary 123 Main St Chicago, IL 60601
(St. Chicago, Illinois)
They both live at 123 Main Street

Do you see my point?
 
M

MartinW

I just realised that you may end up with two tilde's in some circumstances
which precludes using Text to Columns after the Find and Replace,
but doesn't stop you using FIND as it will work on the first tilde
only, if you use it like this.

This will return everything to the left of the tilde
=LEFT(B1,FIND("~",B1)-1)

Either of these will return everything to the right of the tilde
=MID(B1,FIND("~",B1)+1,100)
=RIGHT(B1,LEN(B1)-FIND("~",B1))

HTH
Martin
 
R

Ron Rosenfeld

It seems like this ought to be easy, but I'm stumped.

I received several thousand address records in the following format--all in
one column of an Excel worksheet:

Smith, John & Mary 123 Main St Chicago, IL 60601
Williams, Bill & Cindy 45 54th Drive Madison, WI 51375

I need to parse these into:

Last Name
First Name(s)
Street Address
City
State
Zip

I can get the last name using =LEFT(B2, FIND(",",B2)-1) and the Zip using
=RIGHT(B2,5) but I can't come up with a way to get the other fields. A quick
scan of the records shows all the street addresses seem to start with a
number, and there are no numbers in any of the first names, so it seems like
if I can find the first number reading from the left, I could grab the first
names, but neither FIND nor SEARCH seem to work. I'm using Excel 2007.

Could anyone point me to a possible approach. Seems as if this is something
others must have solved many times before.

Many thanks,
Tom K

If your layout is always in the same format, and in particular if there are
only single first names, then this can be done using Regular Expressions.

The main problem is to differentiate the first name(s) from the beginning of
the address. The rule that I used in the example below is this pattern:

<comma><space>
<word>
then, optionally:
<space>
ampersand (&)
<space>
<word>

But this will fail if you have a "double" first name. For example:

Smith, Mary Jane 47 Edgecomb rd. ...

If that is the case, we would need to come up with a different rule. But try
this, anyway, and see if it works.

If your addresses always start with a number, we could incorporate that as a
rule and get around the issue of double first names.

Also, you need to list your cities by name. In the Pattern I devised, I listed
Chicago and Madison. You can add more possibilities by extending the
pipe-delimited list in Pattern.

First, enter the UDF into your workbook:

<alt-F11> opens the VB Editor
Ensure your project is highlighted in the Project Explorer window, then
Insert/Module and paste the code below into the window that opens.

The following formulas should work:

Last Name: =remid(A1,"^[^,]*")
First Name(s): =remid(A1,Pattern,1)
Street Address: =remid(A1,Pattern,3)
City: =remid(A1,Pattern,4)
State: =remid(A1,Pattern,5)
Zip: =remid(A1,Pattern,6)


Pattern:
,\s+(\w+(\s+&\s+\w+)?)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)


=============================================
Option Explicit
Function reMid(str As String, spattern As String, Optional Index) As String
Dim re As Object
Dim mc As Object

Set re = CreateObject("vbscript.regexp")
re.Global = True
re.ignorecase = True
re.Pattern = spattern

If re.test(str) = True Then
Set mc = re.Execute(str)
If IsMissing(Index) Then
reMid = mc(0)
Else
reMid = mc(0).submatches(Index - 1)
End If
End If
End Function
=============================================

This should handle any entries that are in the pattern you presented. Examples
that don't work would be useful in trying to refine this routine.
--ron
 
M

MartinW

I may not have explained my first post very well.
The intention is to split the data at the first number of the
street address. After the Find and Replace procedure
your data should look like this.

John & Mary~123 Main St Chicago
Bill & Cindy~45~54th Drive Madison
Bob & Carol & Ted & Alice~623 River St Chicago
B.J~3/89~26th St Toronto


After splitting at the tilde with the aforementioned formulae

John & Mary | 123 Main St Chicago
Bill & Cindy | 45~54th Drive Madison
Bob & Carol & Ted & Alice | 623 River St Chicago
B.J | 3/89~26th St Toronto


Just use Find and Replace to clean up the leftover
tilde's in column C.

HTH
Martin
 
R

Ron Rosenfeld

That won't work. He's talking about tons of addresses, not just two.

It is not the number of addresses that is important. Rather it is the pattern
used to determine those addresses. I pointed out one potential issue with the
pattern having to do with multiple first names -- e.g. Mary Jane. I also
pointed out how he can extend the list of city names.

What other pattern related problems do you see?

You pointed out that there would be a potential issue if there are two cities
named, for example:

Louis
St Louis

But I doubt there are too many city pairs like that. So manual inspection
might be an option.

Certainly, adding "St Louis" to the city list enables proper parsing of an
entry such as:

Sylvia, Burt & Jinny Waburn St St Louis, MO 45678

using my algorithm.

And the pipe-delimited city list can be constructed in a variety of ways,
depending on how many cities there are in the list. It would sure be simpler
if each city were preceded by a <comma><space> rather than just a <space> as it
is now. But that may not be an option.

So far as dealing with "tons of addresses", in the formula I wrote, A1 can be
replaced with any cell reference. And, in Excel, if you fill down a formula,
the cell reference will adjust appropriately so as to be able to handle "tons
of addresses".

I'm not sure how long the string containing the City List can be. I just
constructed a list of about 250 cities which was almost 3,000 characters long
and it seemed to work OK. I suspect the limit may be 32,767 characters, but
I'm not certain of that.

Obviously, it cannot be entered as a text string within the formula, due to the
limitation of 1,024 characters in a formula (Excel 2003) but rather as a
concatenated function, or possibly a concatenated Name.


--ron
 
R

Ron Rosenfeld

I may not have explained my first post very well.
The intention is to split the data at the first number of the
street address.

That will present a problem if there are addresses that do not start with
numbers.

But if the all do, it would make it simpler for my method to parse out the
first names, whether there are one, two and whether or not the names had more
than one word.
--ron
 
M

MartinW

That will present a problem if there are addresses that do not start with

Tom's original post said that all the addresses start with a number
and there doesn't appear to be any numbers in the names, although
from the way he phrased it, that may not be entirely correct.

I think no matter how he achieves his result at some point he
will have to go through a painfully slow and careful checking
process.

Regards
Martin
 
R

Ron Rosenfeld

It seems like this ought to be easy, but I'm stumped.

I received several thousand address records in the following format--all in
one column of an Excel worksheet:

Smith, John & Mary 123 Main St Chicago, IL 60601
Williams, Bill & Cindy 45 54th Drive Madison, WI 51375

I need to parse these into:

Last Name
First Name(s)
Street Address
City
State
Zip

I can get the last name using =LEFT(B2, FIND(",",B2)-1) and the Zip using
=RIGHT(B2,5) but I can't come up with a way to get the other fields. A quick
scan of the records shows all the street addresses seem to start with a
number, and there are no numbers in any of the first names, so it seems like
if I can find the first number reading from the left, I could grab the first
names, but neither FIND nor SEARCH seem to work. I'm using Excel 2007.

Could anyone point me to a possible approach. Seems as if this is something
others must have solved many times before.

Many thanks,
Tom K

If your layout is always in the same format, and in particular if there are
only single first names, then this can be done using Regular Expressions.

The main problem is to differentiate the first name(s) from the beginning of
the address. The rule that I used in the example below is this pattern:

<comma><space>
<word>
then, optionally:
<space>
ampersand (&)
<space>
<word>

But this will fail if you have a "double" first name. For example:

Smith, Mary Jane 47 Edgecomb rd. ...

If that is the case, we would need to come up with a different rule. But try
this, anyway, and see if it works.

If your addresses always start with a number, we could incorporate that as a
rule and get around the issue of double first names.

Also, you need to list your cities by name. In the Pattern I devised, I listed
Chicago and Madison. You can add more possibilities by extending the
pipe-delimited list in Pattern.

First, enter the UDF into your workbook:

<alt-F11> opens the VB Editor
Ensure your project is highlighted in the Project Explorer window, then
Insert/Module and paste the code below into the window that opens.

The following formulas should work:

Last Name: =remid(A1,"^[^,]*")
First Name(s): =remid(A1,Pattern,1)
Street Address: =remid(A1,Pattern,3)
City: =remid(A1,Pattern,4)
State: =remid(A1,Pattern,5)
Zip: =remid(A1,Pattern,6)


Pattern:
,\s+(\w+(\s+&\s+\w+)?)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)


=============================================
Option Explicit
Function reMid(str As String, spattern As String, Optional Index) As String
Dim re As Object
Dim mc As Object

Set re = CreateObject("vbscript.regexp")
re.Global = True
re.ignorecase = True
re.Pattern = spattern

If re.test(str) = True Then
Set mc = re.Execute(str)
If IsMissing(Index) Then
reMid = mc(0)
Else
reMid = mc(0).submatches(Index - 1)
End If
End If
End Function
=============================================

This should handle any entries that are in the pattern you presented. Examples
that don't work would be useful in trying to refine this routine.
--ron

Martin just pointed out to me that EVERY street address begins with a number.

That being the case, the following is more efficient for picking out the first
name(s), and will handle the issue of two first names: e.g. Mary Jane & Justin

Last Name: =remid(A1,"^[^,]*")
First Name(s): =remid(A1,Pattern,1)
Street Address: =remid(A1,Pattern,2)
City: =remid(A1,Pattern,3)
State: =remid(A1,Pattern,4)
Zip: =remid(A1,Pattern,5)

and

Pattern:

,\s+(\D+)\s+(.*)\s+(Chicago|Madison),\s+([A-Z]{2})\s+(\d{5}(-\d{4})?)

We still need to generate a pipe-delimited list of cities. How to do that
depends on the length of the list.
--ron
 
R

Rick Rothstein \(MVP - VB\)

Tom's original post said that all the addresses start with a number
You're correct! I missed that. I think I will modify my suggestion.

Well, I wouldn't concede this point too quickly... the OP's original
statement was..

"...all the street addresses seem to start with a number..."

with the key word to note being "seem"; and, of course, the OP's observation
means nothing against possible future data.

Rick
 
R

Ron Rosenfeld

Well, I wouldn't concede this point too quickly... the OP's original
statement was..

"...all the street addresses seem to start with a number..."

with the key word to note being "seem"; and, of course, the OP's observation
means nothing against possible future data.

Rick

I agree.

But I've presented two methods to try to parse out the first name(s).

One relies on the first name(s) being just single words and, if there are two,
separated by an ampersand.

The second relies on the first name(s) being the words between the initial
comma, and the first digit.

So far as parsing out the city, the algorithm I've presented relies on a list
of all possible cities being available.

One could also, depending on the data, define the city as being between some
street-type abbreviation (e.g. Rd Av Ave St Dr Drive, etc) and the <comma>
which precedes the State abbreviation.

But I need more input from Tom K before going further.

100% success may not be possible, depending on the data.


--ron
 
R

Rick Rothstein \(MVP - VB\)

Ron Rosenfeld said:
I agree.

But I've presented two methods to try to parse out the first name(s).

One relies on the first name(s) being just single words and, if there are
two,
separated by an ampersand.

The second relies on the first name(s) being the words between the initial
comma, and the first digit.

So far as parsing out the city, the algorithm I've presented relies on a
list
of all possible cities being available.

One could also, depending on the data, define the city as being between
some
street-type abbreviation (e.g. Rd Av Ave St Dr Drive, etc) and the <comma>
which precedes the State abbreviation.

But I need more input from Tom K before going further.

100% success may not be possible, depending on the data.

You are being getting far more involved in this thread than I would ever
consider doing (kudos to you for that, by the way)... I look at the data and
simply see a horribly flawed structure which seems like it will take far
more time to develop a "no more than a partial solution" for than it would
probably take to parse the data lines one-by-one. I think the only thing the
OP can hope for is to be able to go back to the source and get the data in a
more structured form.

Rick
 
R

Ron Rosenfeld

I think the only thing the
OP can hope for is to be able to go back to the source and get the data in a
more structured form.

But then the task would become simple :))
--ron
 
P

Peo Sjoblom

Rick Rothstein (MVP - VB) said:
You are being getting far more involved in this thread than I would ever
consider doing (kudos to you for that, by the way)... I look at the data
and simply see a horribly flawed structure which seems like it will take
far more time to develop a "no more than a partial solution" for than it
would probably take to parse the data lines one-by-one. I think the only
thing the OP can hope for is to be able to go back to the source and get
the data in a more structured form.


I agree, yet posters seem to think this is easy, to quote the OP
"It seems like this ought to be easy, but I'm stumped"
I never understood why it would be easy to parse text strings that can
follow who knows how many different rules in a program made to crunch
numbers?
It seems that Excel more than any other programs is used (or rather people
want to use it that way) to do everything from word processing to scheduling
work shifts!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top