Normalizing a table

K

Kevin Spencer

Hi Guys,

I am new to this Access and database stuff, but I am giving it a go anyway.
Here's the problem.... I have a large text file that needs to be in a data
base. I have imported it into one table, it is 65 columns wide and has
650,000 records. Each record has a persons SSN and just there first and last
names.

But one person can appear multiple times, so I thought that I would start to
normalize this into additional tables. My first target was three fields,
SSN, FirstName and LastName. I ran a make table query that selected those
three field distinctly on SSN and made a new table. The plan was to have the
SSN as the primary key, allowing me to leave the SSN in the main table and
have the names with SSN in the newly created table. Here is the query I
used:

SELECT DISTINCT tblClaimsData.SSN, tblClaimsData.FIRSTNAME,
tblClaimsData.LASTNAME INTO tblPatient
FROM tblClaimsData;

This worked fine until I went into design view of the newly created table
and tried to set the SSN as a primary key, I received an error message
saying that duplicates existed! On further investigation I discovered that
even though there were no duplicate SSN's there were people with exactly the
same first and last name, but obviously with different SSN's. I though that
as long as there was one unique number in a record, it was unique. Am I
completely missing the point? The plan is to have the main table relate to
the second table on the SSN, eliminating the first name last name field
having to be stored in the original table.

Is there another way I can acheive this goal or am I in the right track?

Thanks for your help

Kevin
 
A

Albert D. Kallal

Just make a table, and make the index on ssn not to allow duplicates. In
fact, you can make the ssn the primary key (not a good idea).

Now, just do a regular appending query like:
SELECT tblClaimsData.SSN, tblClaimsData.FIRSTNAME,
tblClaimsData.LASTNAME INTO tblPatient
FROM tblClaimsData;

Since the index does not allow duplciates...then no duplicae ssn will be
transfered....
 
K

Kevin Spencer

Hi Albert

I actually did that at one point, I made the new table and just created the
three colums and set the SSN field not to allow duplicates ,then ran the
append query. When running the make table query I had in my previous post it
created the table and removed all the records with duplicate SSN's. Which
would rightly have say SSN xxx-xx-xxxx being John Doe and another record
with SSN yyy-yy-yyy also belonging to John Doe. But if I tried tocreate a
relationship between this table and the first then I was informed by Access
that I could not do that because there were no unique records. When I did
the append query into a new table as you sugest it would only insert the
first SSN it came across and ignored the others. So comparing the two tables
there were many legitimate record discarded with append.

I tested the theory by only making a new table with just the SSN field, and
it allowed me to say no duplicates on the index, but as soon as I included
the first name/last name fields that would have legitimate name duplicates
it would not allow the name duplicates unless the index on the SSN field
allowed duplicates. I went a stage further by adding a fourth field that was
just an auto number field, and as soon as that was present, then Access
allowed me to have the index on the SSN field with no duplicates!

I am completly lost with this, I have pained over it all day, the SSN is
unique, but when there are two records with the same first and last names
BUT different SSN's Access complains that the record are not unique.... But
they are, each record has a completely unique SSN field. Is there a default
setting that needs to be changed? I just can't find it if there is.

I also manually created a table, and manually entered two of the records
that have the same first name and last name and different SSN's, then set
the SSN field as no duplicates and Access did NOT complain, but as soon as
the make or append is done by query, whether Access Design mode or just SQL
it fails.

Kevin
 
A

Albert D. Kallal

Kevin Spencer said:
Hi Albert

I actually did that at one point, I made the new table and just created the
three colums and set the SSN field not to allow duplicates ,then ran the
append query. When running the make table query I had in my previous post it
created the table and removed all the records with duplicate SSN's. Which
would rightly have say SSN xxx-xx-xxxx being John Doe and another record
with SSN yyy-yy-yyy also belonging to John Doe. But if I tried tocreate a
relationship between this table and the first then I was informed by Access
that I could not do that because there were no unique records.

I can't see, or understand why you were not allowed to create a relation
between the two tables? (I am a complete loss here????). Once you do the
merge using with that unique index, can not make ssn the Primary key in this
2nd table?

Further..what kind of relation are you trying to set? (left join, inner
join, right join?).

And, for this task...perhaps you don't need to enforce RI at all...do you?

I mean, really, for some testing, and searching, and matching in the two
tables...you don't need to set any relationships at all. You can just fire
up the query builder, and drop in the two tables and build your joins that
way?

It is not clear if you are building this 2nd table to clean up the first, or
in fact wish to move out the actual ssn, and name data to this 2nde table.
If you are just doing some cleaning out of duplicates..then you do not to
build a relationships here at all. Just build the query in the query
builder..and join between the two tables...
 
K

Kevin Spencer

Hi Albert,

You are right, the point of the exercise is to remove the duplicates, and
clean up the first table, but leave the SSN in there. The other goal is to
try and reduce the amount of disk space the database occupies as well, there
is so much repetative info in into i.e. one address 150,000 times.

Thanks for the advise, I was trying to establish a one to many relationship
with RI (told you I did not know what I was doing :) ) . The table is just
a text file that has been imported from a database dump from another
company, and is quite large, which is why I was trying to normalize it into
many tables. But (as you can tell) this is the first time I have done this.

I was going to leave the SSN in the first table and have the SSN, first and
last names in the second table eliminating over 1.2 million names in the
first table! But I thought that a relation ship between the two tables had
to exist to do this. I will experiment with the joins as you suggest ( which
will inevitably lead to more posts), I never considered that.

Thanks for all the advise, its given me a good starting point.

Kevin
 
R

Robin Chapple

Just make a table, and make the index on ssn not to allow duplicates. In
fact, you can make the ssn the primary key (not a good idea).
Can you explain to another beginner why using the SSN is not a good
idea please?

Thanks, Robin Chapple
 
M

MacDermott

As I understand it, SSN is not as unique an identifier of a person as we
might like to think.
Especially in immigrant communities, it's not unusual for several people to
share the same SSN.
And in jail communities, for example, it's not ususual to find the same
person using a variety of different SSNs.

There is also a school of thought which decrees that a Primary Key should
never be meaningful. Other indexes can be used to regulate individual (or
groups of) fields, but the PK should have no purpose other than to assure
that records are unique. I'm not sure I follow all of the logic behind
this, but it's a position I've heard proposed with some vehemence.

HTH
- Turtle
 
R

rkc

MacDermott said:
As I understand it, SSN is not as unique an identifier of a person as we
might like to think.
Especially in immigrant communities, it's not unusual for several people to
share the same SSN.
And in jail communities, for example, it's not ususual to find the same
person using a variety of different SSNs.

There is also a school of thought which decrees that a Primary Key should
never be meaningful. Other indexes can be used to regulate individual (or
groups of) fields, but the PK should have no purpose other than to assure
that records are unique. I'm not sure I follow all of the logic behind
this, but it's a position I've heard proposed with some vehemence.

The idea that a meaningless primary key makes a row unique is a completely
flawed concept. A meaningless primary key can uniquely identify a row, but
it can't possibly ensure that the row is unique. Unless it is the only
column in
the table. Which makes the table meaningless.
 
A

Albert D. Kallal

As MacDermott pointed out, often the ssn is NOT unique.

However, worse is that your application cannot run UNLESS you enter the ssn.

Perhaps you need to enter a new names, and they are temp workers, or simply
awaiting the ssn number. All of a sudden, your application cannot run.

The same goes for invoice numbers...I don't use them in my relations (I use
a autonumber...behind the scenes that the user never sees).

That way, I can again have my application function, and not necessary have
to enter/create a invoice number (for example, if the invoice is not
approved, or canceled by the user..then no invoice number needs to be
generated (or wasted). As a result, during normal operation of the program,
then there are no gaps in the invoice numbers issued.

When you use a program like Outlook, or even something like QuickBooks, you
NEVER see any of the actual numbers used to build the relationships (and,
really...do you care? All you really want is a relation between two
tables....do you care how the computer does this? no! Do you need, or care
if the users sees some autonumber when using Quicken? no? Users do not need
to see this stuff at all..
 
R

Randy Harris

rkc said:
people

The idea that a meaningless primary key makes a row unique is a completely
flawed concept. A meaningless primary key can uniquely identify a row, but
it can't possibly ensure that the row is unique.

That's illogical. The function of the PK is to uniquely identify the row,
but if the primary key is unique (it must be by definition), then the row
that contains it must also be unique.
 
R

rkc

Randy Harris said:
That's illogical. The function of the PK is to uniquely identify the row,
but if the primary key is unique (it must be by definition), then the row
that contains it must also be unique.

What is illogical is thinking that something that has no meaning can make
something that has meaning unique.
 
R

Rick Brandt

Randy Harris said:
That's illogical. The function of the PK is to uniquely identify the row,
but if the primary key is unique (it must be by definition), then the row
that contains it must also be unique.

I believe the distinction is in Row vs. Record. A meaningless PK Like
AutoNumber guarantees the uniqueness of the "row", but since it is not
considered to be part of the real data it does not guarantee uniqueness of the
"record".

Use of a surrogate key would normally be coincident with one or more unique
indexes on data fields and it is those indexes that would guarantee uniqueness
of the record.
 
R

Randy Harris

Rick Brandt said:
I believe the distinction is in Row vs. Record. A meaningless PK Like
AutoNumber guarantees the uniqueness of the "row", but since it is not
considered to be part of the real data it does not guarantee uniqueness of the
"record".

I understand the distinction you make here. I think that, because the PK is
permanently assigned to the record, I have become accustomed to thinking of
it as part of the "real data" of the record. As opposed to a transient
assignment, such as "Record Number". It wasn't clear to me that was the
point rkc was making.
 
K

Kevin Spencer

So, if I created an additional field into the table which is just an
autonumbered field then thta guarentees the uniqunes of the record (just not
necessarily the uniqueness of the data within it, which I do not care about
too much).

I have the original table that has 700,000 records in, and I am removing the
firstname/lastname data from the table and leaving the SSN in the first
table, but from the earlier discussions this may not be a good idea, should
I take the SSN into the second table? And if I do then how do I know which
record in the first table relates to the SSN, first name lastname info now?
Would the auto number values have to be present in the first table and if so
how would I insert them into the correct records.

Thanks to everyone for all their input, it has been invaluable.

Kevin
 
M

MacDermott

Have you looked at Tools - Analyze - Table?
This may help you normalize your data.

HTH
- Turtle
 
A

Albert D. Kallal

I think your case is somewhat different.

However, the process would be take our single table:

Add the autonumber field.

At this point, we now have a primary key field,a nd it is NOT the ssn.

You then split out the records to the other table, and use the autonumber
field, and not the ssn.

So, simply in place of using the ssn to built the relation, you now send out
the id in place of the ssn.

You don't want to confuse the issue of removing duplicates of the ssn with
the issue of creating and maintain a relation from a parent table to a child
table. These are two separate issues. You slicing, and dividing of the data,
and figure out what record belongs to what parent record will HAVE to use
the ssn to figure out what records belong to who.

So, really, once you finally get all your data correct, then you only want
ONE ssn number for each person. So, the ssn goes in the parent table, and
you use the autonumber to now maintain this relation (and, like all
relations, this mean that you need a standard long number field in the child
table that contains the autonumber id of the parent record). This means
that you can change the ssn in the parent record...and not have to worry
about changing the ssn in the child record (which, once we get the data
cleaned up....we do NOT need the ssn in the child table). And, of course,
once you do finally get things right...you will be able to change the ssn,
and not worry about the child records still being linked..as they are using
some invisible number tat you, or your users don't care about.

So, I shall repeat:
You don't want to confuse the issue of removing duplicates of the ssn with
the issue of creating and maintain a relation from a parent table to a child
table.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top