Identifying patterns in data

D

Darren

I have a data file of over 200,000 customers. Along with the basic name,
address information there are over 200 fields of demographic data.

What I am interested in doing is analyzing the data to identify the most
common traits of our customers.

Anyone have any ideas?

Darren
 
K

KARL DEWEY

That is a lot of demographic data or do you have a separate field for like
Brown Hair, Black Hair, Red Hair, Brown Eyes, Blue Eyes, Hazel Eyes, etc.
This is not how your data should be loaded. There should be a field for hair
color, eye color, height, weight, education, etc.
 
J

John Vinson

I have a data file of over 200,000 customers. Along with the basic name,
address information there are over 200 fields of demographic data.

What I am interested in doing is analyzing the data to identify the most
common traits of our customers.

Anyone have any ideas?

Darren

Well, NORMALIZE YOUR DATA for starters. 40 fields is a very wide
table; 200+ fields is not a table at all, it's a spreadsheet. I would
guess you have a many to many relationship between People and
DemographicCharacteristics (and I don't know what those might be); the
proper structure would have three tables:

People
PersonID
<basic biographical/contact data>

Characteristics
CharID
Description

Demography
PersonID <<< link to People
CharID <<< link to Characteristics
Value <<< might not even be needed, don't know what you record

Once you run a series of Append queries to migrate the data from your
spreadsheet into these tables, you can easily do Totals queries to
count occurances of particular Characteristics.

John W. Vinson[MVP]
 
R

Richard.Toren

Subject: Statistics question really

Although I'm familiar with statistics I would still advise that you consult someone with the proper credentials if you want to be really sure of what you have...

Having said that the technique for deciding how different independent values interact is by regression. Since more often than not some traits are not continuous (for example sex as opposed to income) you will probably have a non-linear regression. One of the first things you can do to speed the process is to do a cross-correlation, comparing one factor to another. Basically you need to find what traits are so closely linked to each other that you can eliminate one from the analysis. The fewer independent variables the more likely you can discover if there is a usable relationship between a factor the result.

None of this actually really relates to Access. It doesn't have the means to help except in the storage, retrieval or presentation (in spreadsheet form) of the data.

By the way one of the best things to do with data would be just looking at means and standard deviations of the variables using one of the other variables to look at it, for example divide income to 5 arbitrary levels and look how the other 199 group. If you choose wisely then you'll have some good immediate insight into what you have.

One final note: Don't normalize percentage data as it distorts the distribution.

Caveat: writer has a PhD in neurophysiology and not in statistics :)

Ilan
 
Top