Why do all my Relationships show as Indeterminate ?

David W. Fenton · Apr 13, 2007

See below:

Referring to Northwind Traders, there is a table of Categories.
The category name from this table is every bit as much an
attribute of the Products table as any other column in that table.

You're right, of course. In that case, it's a lookup table, and the
record it's pointing to has no other attributes. Indeed, that's the
one circumstance where I've always agreed that using a natural key
is perfectly acceptable (assuming the values are nearly 100%
static).

Indeed, if there were not
table of categories, there could still be a CategoryName column in
the Products table. There could be a fixed list of categories and
the control for Category in the datasheet could list these fixed
categories. That would not be a flexible as having a separate
Categories table, but it would work. So, having the natural value
of the Category, whether it exists in another table, is every much
as valid an attribute of the Products table as any other column in
the Products table.

Yes, I was too broad in my statement, though what I really meant was
implied by it -- if your parent table has other attributes besides
the PK, then I believe that your link field(s) are serving the
meta-function, as they are not replicating *all* the related data,
just enough to provide a link to a unique record. Worded in that
fashion, there is no superiority to the natural key method over the
surrogate key, since either is a proxy for a lot of attributes not
found in the foreign key values.

I do not say you cannot, or should not "use a meta field," but
that you need not do so. There is an alternative. I am promoting
the proposition that there are advantages and disadvantages to
both methods of designing a database. But both are valid.
However, I am finally saying there is a way to operate databases
that have all the advantages of both approaches, and more
advantages that neither offers.

Not in any db engine that exists,though.

Have you done any thinking about my idea that your proposals would
be helped if the db were in RAM instead of stored on disk? In that
case dynamic linking of data structures is much simpler, and that
removes many of the objections about fragmentation and updating that
are present in disk images. A pointer in RAM is much more efficient
than a pointer on disk, for instance, because that's the way memory
and our code are engineered to work.

Very much so! It would be entirely a function of the database
engine.

But that would mean, then, that your data tables were not portable
to another db engine. This goes against the very basis of SQL and
Codd's theories, which is the representation of the data entities
and their relationships in the abstract sense. I'm no fan of slavish
devotion to theory (quite the contrary! I wouldn't be for surrogate
keys if I were!), but I don't want to abandon it entirely.

Implementing a surrogate key is now a mathematical construct? In
that context, so is using a natural key value. It's hard for me
to see the rationale here.

Well, "mathematical" was the wrong word -- should have said
"logical." The fields that define the link simply create a logical
relationship between entities. For that function, there is no
advantage to the natural key except the join avoidance issue when
you're dealing with only two levels of a hierarchy.

I do not say they aren't .

I did not say you *said* they aren't.

I'm not sure that surrogate keys are uniformly simpler. Natural
keys are often, even usually single column.

Not in any of the apps I've ever dealt with. That may be because
most of my apps store data about people.

For simple lookups, what you say about natural keys is true, and
I've always agreed that for single-column lookup tables, yes,
natural keys are perfectly usable (when the lookup values are
relatively static).

But when there are multiple attributes in the parent record beyond
those found in the PK, it reduces the natural key, a subset of the
data in the parent record, to a logical function, rather than
representing all the data. In that case, I don't see how there is
any advantage to the natural key over the surrogate key in terms of
the *logical* function.

Again, using a surrogate key tyupically
involves adding a column to the foreign key table and an index to
that table. That is NOT simpler than using a single column
natural key.

It *is* simpler if the values in the lookup table are not static. A
surrogate key allows for changes to the lookup values without the
overhead (and problems) of cascading updates.

(my avoidance of cascading updates comes from two sources: my
preference for surrogate keys, where cascades are never needed, and
my extensive work on replication, where cascading updates can lead
to replication errors if you don't implement schema changes
properly)

The justifications are:

- You don't have to add an additional column to the foreign key
table.

As you noted at length in an earlier post, the overhead for that is
very minor.

- You don't add an index to the foreign key table.

Again, a relatively minor issue.

- You don't have to make a JOIN in any query just to get the
natural key
value, since that will be in every row of the dependent table.

Relevant only for queries with only two levels of the hierarchy, and
when you don't need any attributes from the parent table except
those found in the PK column(s).

- Because of the above, the query will be faster (not having a
JOIN, not
having an additional declared table in the query.

And that's only relevant to a certain set of circumstances. True,
it's a common set of circumstances, but I'd say it's less than a
quarter of the SQL that I write that would benefit from it.

Having fewer columns, fewer indices, and smaller queries is not
easier?

You seem to center your comments about natural keys around
single-column natural keys, which I've always allowed were OK.

I'm thinking of multi-column keys, which vastly increase the
difficulty of working with the data, in both joins and indexing.

Now, most of what I have written above is what I would have
written 5 years ago. However, I have modified my position, as I
now feel the database engine can be designed to take care of all
this transparently, while giving all the current function in a
manner that is more efficient.

I'm interested in seeing you develop your ideas, but I still think
you're talking about mixing the logical layer and the storage layer.
Obviously, I already believe that we're doing that to a certain
extent, and I definitely interpret your idea as a form of surrogate
key, so I'm not opposed to it. But it wouldn't be portable, and I
consider that a major problem.

Perhaps one could build a layer on top of the implementation that
would appear to the user just as the logical layer does today, but
do it behind the scenes without actually using indexes and storing
the PK values in the child table. If that's what you're talking
about, well, I would see that as possibly a good idea, but it would
depend on how much control the DBA had over the hidden structures.
One of my objections to ACE's multi-value fields is that the actual
data structures are hidden from you and accessible only
programatically. Thankfully, they are accessible, but I would prefer
access to the structures behind them in the Access UI.

Anyway -- interesting discussion. We've gone well beyond the content
of any previous discussion on this topic that I've ever seen,
largely because of your promotion of your ideas.

Thanks for that! It's been very stimulating, and has actually helped
me articulate my own ideas about surrogate keys well beyond what I'd
ever done in the past.

I still don't like natural keys, though!

Tom Ellison · Apr 15, 2007

David W. Fenton said:
You're right, of course. In that case, it's a lookup table, and the
record it's pointing to has no other attributes. Indeed, that's the
one circumstance where I've always agreed that using a natural key
is perfectly acceptable (assuming the values are nearly 100%
static).

Yes, I was too broad in my statement, though what I really meant was
implied by it -- if your parent table has other attributes besides
the PK, then I believe that your link field(s) are serving the
meta-function, as they are not replicating *all* the related data,
just enough to provide a link to a unique record. Worded in that
fashion, there is no superiority to the natural key method over the
surrogate key, since either is a proxy for a lot of attributes not
found in the foreign key values.

Not in any db engine that exists,though.

Have you done any thinking about my idea that your proposals would
be helped if the db were in RAM instead of stored on disk? In that
case dynamic linking of data structures is much simpler, and that
removes many of the objections about fragmentation and updating that
are present in disk images. A pointer in RAM is much more efficient
than a pointer on disk, for instance, because that's the way memory
and our code are engineered to work.

This is already quite possible. You set up a RAM Drive and put the database
there. I've done that, and as far as performance is concerned, there's
really nothing like it. Whether the database is written as efficiently as
possible or not is no longer an issue. Virtually everything is
instantaneous.

But, I would think that databases should be written to operate well under
more typical circumstances.

But that would mean, then, that your data tables were not portable
to another db engine. This goes against the very basis of SQL and
Codd's theories, which is the representation of the data entities
and their relationships in the abstract sense. I'm no fan of slavish
devotion to theory (quite the contrary! I wouldn't be for surrogate
keys if I were!), but I don't want to abandon it entirely.

It does not mean that at all. It does mean that a conversion to another
database engine would have to ignore the pointers which I propose to use in
place of autonumber/identity surrogate keys. However, based on only the
natural key, it would be entirely possible to port to other database
engines.

I would like to point out that using autonumber/identity values in a
database is not portable. Having such a datatype is NOT a requirement. So,
using surrogate keys currently facilitates portability because surrogate
keys are so popular that nearly all databases have implemented the required
datatype. If all databases supported what I propose, then that would be the
standard instead. What you say may have some practical meaning, but that is
arbitrarily based on the current culture in building databases and database
engines. I am proposing a solution that would be a large improvements in
database engines based on a different paradigm.

I'm not completely certain that a method of replacing the pointer I propose
with an autonumber/identity and fixing up the data would not be quite
practical. As I think about it, I believe it would be possible to automate
adding an autonumber/identity just based on the natural key. In fact, I'm
sure of it.

Well, "mathematical" was the wrong word -- should have said
"logical." The fields that define the link simply create a logical
relationship between entities. For that function, there is no
advantage to the natural key except the join avoidance issue when
you're dealing with only two levels of a hierarchy.

I have no problem with this statement. There ARE potential advantages in
using an autonumber/identity key. You may not see it, but there are also
advantages in using the natural key. I propose to implement all the
advantages of both, and to give additional advantages as well.

I did not say you *said* they aren't.

Now, this has become quite pointless.

Not in any of the apps I've ever dealt with. That may be because
most of my apps store data about people.

I believe you are saying that none of your tables have a single column
natural key. Is that the case?

The databases I build all have a deliberate structure in which compound keys
are built up from simpler ones. This structure always begins with tables
with a single column natural key and build up from there when necessary. Of
all the tables, between 30% and 80% of all the tables have a single natural
key column. A large proportion of my designs have compound tables only to
the depth of 2 or 3 columns. I have rarely reached the level of 5 tables
compounding the key.

I point out that, using only a single surrogate identity column to relate
tables when they are at a depth of 5 levels, you must create all 4 JOINs in
a query in order to retrieve any column from the lowest level table in the
hierarchy. If you store the natural key in the 5th level table, then you
can JOIN directly from the highest level to the lowest. By subsuming all
the keys into one, you lose this capability with surrogate keys. In
addition, it is not uncommon to need only the single natural key column from
the lowest level table in this hierarchy. This value would be stored in the
5th level table using the scheme I find useful. You would need to create 4
JOINs in your query to obtain that. And if a different column in the lowest
level table is required, using compound natural keys allows you to retrieve
that in a single JOIN.

This debate is old, and I would not be participating in it except that I now
have worked out a methodology to permit the best of both sides to be
realized, and more besides.

For simple lookups, what you say about natural keys is true, and
I've always agreed that for single-column lookup tables, yes,
natural keys are perfectly usable (when the lookup values are
relatively static).

Codd requires stability in a natural key, and I expect we are in complete
agreement on that.

But when there are multiple attributes in the parent record beyond
those found in the PK, it reduces the natural key, a subset of the
data in the parent record, to a logical function, rather than
representing all the data. In that case, I don't see how there is
any advantage to the natural key over the surrogate key in terms of
the *logical* function.

It *is* simpler if the values in the lookup table are not static. A
surrogate key allows for changes to the lookup values without the
overhead (and problems) of cascading updates.

What problems have you experienced with cascading updates? I expect I have
used them many more times than you have, and I haven't experienced them.

As to the overhead, I have tracked the frequency of cascading updates in a
large, rather busy multi-user database that is typical of my designs.
Cascade updates commonly occur 1-2 times per month (which confirms that my
selection of natural keys is stable as required) and typically take 1-3
seconds. A total performance hit of 2-6 seconds per month is not much of an
issue. So, from my perspective, this is not much of an issue.

(my avoidance of cascading updates comes from two sources: my
preference for surrogate keys, where cascades are never needed, and
my extensive work on replication, where cascading updates can lead
to replication errors if you don't implement schema changes
properly)

Well, not implementing schema changes properly sounds like a very, very bad
thing, whether you use surrogate or natural keys. I'm not aware of exactly
what kind of improperly implement schema change you are suggesting, and I
cannot comment further because of this vagueness.

As you noted at length in an earlier post, the overhead for that is
very minor.

Again, a relatively minor issue.

Perhaps not. When you add a surrogate/identity key to a table that contains
transactions on a system that is very busy adding these transactions
(perhaps one which gathers data collected in automated equipment, say ATMs)
your system may gather tens of thousands of rows into this transaction table
hourly. Indexing that will be a considerable challenge.

Relevant only for queries with only two levels of the hierarchy, and
when you don't need any attributes from the parent table except
those found in the PK column(s).

Well, that's what I said. However, my 5 level scenario (above) is very
realistic, and illustrates the full case of what does happen.

And that's only relevant to a certain set of circumstances. True,
it's a common set of circumstances, but I'd say it's less than a
quarter of the SQL that I write that would benefit from it.

You seem to center your comments about natural keys around
single-column natural keys, which I've always allowed were OK.

Not at all. Again, I have extended my discussion to a 5 level, 5 key
instance.

I'm thinking of multi-column keys, which vastly increase the
difficulty of working with the data, in both joins and indexing.

I agree somewhat. I have timed myself putting in the components of a 3
level compound key JOIN, and it can take up to 30 seconds to type. As I
have proposed elsewhere, a shortcut for this should be implemented in SQL.
Like this:

SELECT *
FROM TableA A
INNER JOIN TableB B
ON A.Column1 = B.Column1 AND A.Column2 = B.Column2
AND A.Column3 = B.Column3

could be replaced with:

SELECT *
FROM TableA A
INNER JOIN TableB B
ON A.IndexX(3) = B.IndexY(3)

Now, the IndexX and IndexY will have been defined anyway as:

IndexX Column1
Column2
Column3

IndexY Column1
Column2
Column3
Column4

These 2 indices being for TableA and TableB, respectively. I refer to the
way an index looks in the dialog where you define it.

So, the above new syntax I propose has the identical meaning and function as
the current standard method.

What I am trying to convey here is that it is the TOOLS we use that make
using natural keys a bit more difficult that is necessary, but that this is
not a necessary attribute of working with natural keys. In terms of
difficulty, I believe the above is very nearly equivalent to using surrogate
keys.

I'm interested in seeing you develop your ideas, but I still think
you're talking about mixing the logical layer and the storage layer.
Obviously, I already believe that we're doing that to a certain
extent, and I definitely interpret your idea as a form of surrogate
key, so I'm not opposed to it. But it wouldn't be portable, and I
consider that a major problem.

I expect, then, that you consider your usage of surrogate/identity keys to
be part of the logical layer, and that you WANT to be dealing with the
additional complexity it adds. My proposal is to use a superior method of
generating such keys automatically and transparently, but to use a value in
this key that avoids the need to generate and use an index to accomplish
what they perform.

I ask you to stop and think. What do you gain by having to add identity
columns to the database that you would lose if it were done for you
automatically? Why do you feel you need to be aware of them and to be able
to access their values?

Perhaps one could build a layer on top of the implementation that
would appear to the user just as the logical layer does today, but
do it behind the scenes without actually using indexes and storing
the PK values in the child table. If that's what you're talking
about, well, I would see that as possibly a good idea, but it would
depend on how much control the DBA had over the hidden structures.
One of my objections to ACE's multi-value fields is that the actual
data structures are hidden from you and accessible only
programatically. Thankfully, they are accessible, but I would prefer
access to the structures behind them in the Access UI.

Anyway -- interesting discussion. We've gone well beyond the content
of any previous discussion on this topic that I've ever seen,
largely because of your promotion of your ideas.

Thanks for that! It's been very stimulating, and has actually helped
me articulate my own ideas about surrogate keys well beyond what I'd
ever done in the past.

I still don't like natural keys, though!

Well, you're very welcome. I think I have seen the reasons you don't like
natural keys for relationships. Considerable experimentation on my part has
shown me that (and this is partially opinion, but also based on verifyable
facts) there is some definite superiority in using natural key relationships
in small and medium sized databases. The database engines provided with
Access ensure that what we build will be within that range of size.

This is a major paradigm shift for a developer, and I know some about from
where you're coming. My choice has been deliberate, having compared the two
paradigms side by side. What I discovered in terms of the performance of
the finished product would likely be surprising to you, but I'll leave that
for an article I'm preparing. It may take most of this year to finish, so
don't hold your breath!

Tom Ellison
Microsoft Access MVP

Michael Gramelspacher · Apr 16, 2007

Dear David:

This is a very real situation. When you have a multi-column natural key,
there is no simple, reasonable, and brief way to reference this.

This is not inherently a problem with natural keys. I suggest that a
database should allow the designer to designate the keys to any table and to
label that key with a simgle name that will then replace the enumeration of
all the columns comprising that key. For example, when writing a query that
JOINs 5 pairs of natural key columns in two tables, you would need reference
the keys by their composite name:

TableA
Key:
ColumnX
ColumnY
ColumnZ
Composite Key:
KeyM = ColumnW + ColumnX + ColumnY

TableB
Key:
ColumnT
ColumnU
ColumnV
ColumnW
Composite Key:
KeyN = ColumnT + ColumnU + ColumnV + ColumnW

Instead of writing:

SELECT *
FROM TableA
INNER JOIN TableB
ON ColumnX = ColumnT
AND ColumnY = ColumnU
AND ColumnZ = COlumnV

you could write:

SELECT *
FROM TableA
INNER JOIN TableB
ON KeyM(3) = KeyN(3)

thus matching the first 3 key columns of both keys.

The above would be used for a typical 1:M relationship.

By simply declaring the important sets of columns in a table as Keys, you

Tom,

Consider this:

Semesters
----------
semesters_key(2)
semester_year PK
semester_nbr PK

Classes
-------
classes_key(1)
class_id PK

ClassSchedules
---------------
classschedules_key(4)
class_key(1) PK
section_nbr PK
semester_key(2) PK

Students
---------
student_key(1)
student_id PK

StudentClassSchedules
-----------------------
studentclassschedules_key(5)
students_key(1) PK
classschedules_key(4) PK

StudentClassAttendance
------------------------
studentclassattendance_key(6)
studentclassschedules _key(5) PK
calendar_date PK

Is this what you are proposing?
I have to admit that I cannot consider all the implications,
but I am not all that knowledgable.

It seems like a unique index on the natural keys with the
index serving as a super-key for linking. You eliminate
cascading updates, which is said to be a big plus for
surrogate keys. You also eliminate compound joins. The
so-called super-key becomes in some sense a real key in
that it is dependent on actual columns.

Is this really different than using an identity key and having
a unique index on the natural key? Too many newbies forget
to add a unique index on the natural key. Your proposal
would force that, in that the index key is dependent on actual
columns. Not that the actual columns used really are unique, but
the index key makes the row unique, which brings you right back
where you started with identity, doesn't it?

Real sample database here:
http://www.psci.net/gramelsp/temp/Attendance2.zip

Tom Ellison · Apr 16, 2007

Michael Gramelspacher said:
Tom,

Consider this:

Semesters

Just a note, the year and number of the semester is actually a single value
representing a single, distinct semester. I recommend using a single column
to store it, such as:

2006/2007 Semester 1 is 2006.1
2007/2008 Semester 2 is 2007.2

This would sort properly (up to 9 semesters per year, then it would break
down, but that's not an issue I would think). You could write functions
that would display this in a way that is useful.

Classes

OK, this is a natural key (even though you call it an "id" which many
designers use to designate a surrogate identity or autonumber).

ClassSchedules
---------------
classschedules_key(4)
class_key(1) PK
section_nbr PK
semester_key(2) PK

Students
---------
student_key(1)
student_id PK

StudentClassSchedules
-----------------------
studentclassschedules_key(5)
students_key(1) PK
classschedules_key(4) PK

StudentClassAttendance
------------------------
studentclassattendance_key(6)
studentclassschedules _key(5) PK
calendar_date PK

Is this what you are proposing?

To a large degree, I believe it is. Some of your keys are, however, a bit
arbitrary. For example, Faculty_id. Are these the numbers assigned to
those faculty members and in use outside the database?

Now, here, I do something nasty. I presume that, although there COULD be 2
faculty members named John Smith, there is probably no difficulty in natural
conversation knowing which is intended. That is to say, there would be a
unique natural key based, at least in part, on the name. Perhaps you have 2
John Smith faculty members, but they are in different departments. Then the
compound key of the name with the department is unique. However, this might
not hold for all future cases.

However, if you are communicating in speech or in writing, I'm suggesting
there WILL be a method used to distinguish between 2 John Smith faculty
members, even if both of them are in the Math Department. Perhaps a middle
initial for one or both would be employed. Again, I'm not speaking first of
all about how it would be done in the computer, but how it is done in
natural speech. Perhaps one is a PhD and the other not.

I commonly use a single column for the name.

Gutneck, Ralph A PhD
Schuster, David K
Etienne, John PhD
Elliott, Brian

I often write a function to parse the above, giving their components: Last
Name, First Name, Middle Initial, Title. Using certain rules about how they
are entered, I put the parsing of the name on the screen as the user enters
it. If the system mistakes PhD for the middle name of John Etienne, then
that problem will show up. I use a simple convention for something like
that. Put only one space before a middle initial, and 2 spaces before PhD
(or other title). The purpose is to:

1. Have simple rules that are used to consistently parse the name/title.
2. Display the parsing on the screen as the user enters the data.

With the above, a unique natural key is probably quite possible, and the
information within it can be parsed if and when it is needed. Here we have
a case where a set of values, which could be required to be considered
separately, may also be useful to be considered as a single coherent value.
This is probably a stretch of the rules of normalization, but it is useful
and has not caused me any problems.

This is not unlike what we do with a date/time datatype. It is a single
entity, yet it contains components that can be useful separately as well.
Indeed, it is considered to be an unnecessary burden NOT to put the Month,
Day, Year, Hour, Minute, and Second all in a single column. So, there is
some solid support for doing something like this.

Note: In the above example, the 3rd row is incorrectly entered. There
needs to be a second space just before PhD.

I have to admit that I cannot consider all the implications,
but I am not all that knowledgable.

Your database is rather advanced in these concepts for someone "not all that
knowledgable."

It seems like a unique index on the natural keys with the
index serving as a super-key for linking. You eliminate
cascading updates, which is said to be a big plus for
surrogate keys. You also eliminate compound joins. The
so-called super-key becomes in some sense a real key in
that it is dependent on actual columns.

I recommend creating a unique index for the surrogate keys, not using the
primary key. Remember that with both Jet and SQL Server, the rows will be
physically ordered according to the primary key. Ordering the rows by the
surrogate key has little if any value, but ordering them physically in the
natural key order will match the order they are often displayed on the
screen or in a report. So, this has to be a performance boost.

I do not find that cascading updates are any significant performance
problem. It occurs rarely, and takes only a few seconds to accomplish. As
I said before, it is rare for a system to spend even 10 seconds a week
performing cascade updates.

Is this really different than using an identity key and having
a unique index on the natural key? Too many newbies forget
to add a unique index on the natural key. Your proposal
would force that, in that the index key is dependent on actual
columns. Not that the actual columns used really are unique, but
the index key makes the row unique, which brings you right back
where you started with identity, doesn't it?

Indeed, having a unique index on the natural key (if not, indeed, the
primary key) is usually essential, and yes, I've frequently found this to be
missing.

Not sure what you're saying about uniqueness here. What is "the index key
[that] makes the row unique"? I suspect this is different for different
tables.

To follow what I recommend, there would be no columns in the table which are
not natural. The test for this is not terribly simple. But I suspect you
are using some surrogate keys here, though they are not autonumbers or
identities. For example: faculty.faculty_id. Are these the numbers
already assigned to the faculty members before the database was created? Or
are they entirely an internal device used to join your tables?

There is an important weakness in using any form of surrogate key. That is,
such a key guarantees uniqueness within the database without guaranteeing
non-duplicity in the correspondence to the real world. I have seen the
argument for surrogate identity keys take a very unwise turn. There should
be a definite attempt to create non-duplicity as well as uniqueness - that
is, a real 1 to 1 correspondence between the entities in a table and the
entities in the real world. In many cases, I have seen that the use of a
surrogate identity or autonumber key will mask the fact that this
correspondence is not being maintained. The judicious use of a unique
natural key does not guarantee non-duplicity either, but it makes it easier
to detect when it fails. Allowing a database to have no unique natural key
is a dangerous approach in nearly all cases. Requiring users to employ
whatever device they use in normal conversation to guarantee both uniqueness
and non-duplicity is the best policy.

Again, that does not guarantee non-duplicity, but it does give a level of
natural support to detecting and eliminating duplicity. In such cases, I do
not think there is any better way to provide support to this and thereby aid
in making the database significantly more reliable in representing the real
world. Surely that is a specific goal of a database designer, not just to
support some internal structure within the database, but to actually support
the prospect of having a true 1:1 correspondence of entities in the real
world with rows in the database. That is fundamentally, and indeed
entirely, the only way to look at this issue.

So, one of my mottoes is "uniqueness is not the entire goal, but
non-duplicity is also essential."

Real sample database here:
http://www.psci.net/gramelsp/temp/Attendance2.zip

Thanks, I've looked at it. I especially like it that you have a decent
database diagram. That tells the whole story much more directly, and I
really appreciate that.

Tom Ellison
Microsoft Access MVP

David W. Fenton · Apr 16, 2007

[]

Have you done any thinking about my idea that your proposals
would be helped if the db were in RAM instead of stored on disk?
In that case dynamic linking of data structures is much simpler,
and that removes many of the objections about fragmentation and
updating that are present in disk images. A pointer in RAM is
much more efficient than a pointer on disk, for instance, because
that's the way memory and our code are engineered to work.

Click to expand...

This is already quite possible. You set up a RAM Drive and put
the database there.

But that is just a faster hard disk -- it does *not* take advantage
of the random access nature of RAM as opposed to disk storage in the
data storage structures, which in all the database engines I know
anything about are optimized for disk access. The database engine
and the data structures used would have to be altered to take
advantage of RAM.

I've done that, and as far as performance is concerned, there's
really nothing like it. Whether the database is written as
efficiently as possible or not is no longer an issue. Virtually
everything is instantaneous.

But, I would think that databases should be written to operate
well under more typical circumstances.

I think you're missing a major point -- many of the things you are
criticizing about our traditional methods of structuring
relationships are entirely a legacy of database engines designed
around reading and writing data to/form a disk drive. A database
engine designed (or, at least, tuned) to run in RAM would need to
have a different design, and, I think, your ideas would work *much*
better in a RAM-optimized db engine than in the disk-optimized ones
we have to work with today.

Consider the mere issue of your physical pointers. In a disk
environment, you have a major performance drag if the disk image
becomes fragmented in multiple locations on the disk volume. In RAM,
fragmentation is only a problem when it becomes so severe that there
are no blocks large enough to load data/code segments that can't be
fragmented. There is little speed penalty for fragmenting your table
over different locations in RAM. Thus, the background "cleanup"
becomes much less important (and would possibly need to be done only
for writing to persistent storage).

It does not mean that at all. It does mean that a conversion to
another database engine would have to ignore the pointers which I
propose to use in place of autonumber/identity surrogate keys.
However, based on only the natural key, it would be entirely
possible to port to other database engines.

I would like to point out that using autonumber/identity values in
a database is not portable. Having such a datatype is NOT a
requirement. So, using surrogate keys currently facilitates
portability because surrogate keys are so popular that nearly all
databases have implemented the required datatype. If all
databases supported what I propose, then that would be the
standard instead.

No, it doesn't mean that at all, unless your pointer is to a lookup
table instead of to the actual storage location. For the direct
pointer to work, all database engines would have to use exactly the
same file format.

Obviously, portability of the actual data to another db engine would
be easily handled through a view that presents the "human-friendly"
version of the data.

But your app would not work the same way on a different db engine
unless you only ever worked with the "view" level of data in your
application. If that's the case, I don't see how you get any benefit
from your approach except (putatively) at the performance level.

What you say may have some practical meaning, but that is
arbitrarily based on the current culture in building databases and
database engines. I am proposing a solution that would be a large
improvements in database engines based on a different paradigm.

But it couldn't possibly be portable without an abstraction layer
running on top of the db engine's low-level data storage system.
That's exactly what we have today, so I expect that a lot of the
performance benefits you are seeking would disappear in the
re-translation back to the presentation that we need to understand
what data we're actually working with.

[]

I have no problem with this statement. There ARE potential
advantages in using an autonumber/identity key. You may not see
it, but there are also advantages in using the natural key. I
propose to implement all the advantages of both, and to give
additional advantages as well.

The way I see it, you're just creating a hidden surrogate key that
is tied to the data storage format. It gains performance because of
running so close to the metal, but is also not really transparent to
the DBA, as you have to depend on the db engine to keep it
up-to-date. I'd prefer to have that at a lower level so that if it
becomes corrupted, it's a matter of recovering dumb data structures
that store data that has its own logical structure. With your
approach, there aren't two independent logical structures, so if the
pointers get corrupted, there's no redundant information to recover
the data. I find that problematic to say the least.

[]

I believe you are saying that none of your tables have a single
column natural key. Is that the case?

Except for simple lookup tables with static values, yes.

The databases I build all have a deliberate structure in which
compound keys are built up from simpler ones. This structure
always begins with tables with a single column natural key and
build up from there when necessary. Of all the tables, between
30% and 80% of all the tables have a single natural key column. A
large proportion of my designs have compound tables only to the
depth of 2 or 3 columns. I have rarely reached the level of 5
tables compounding the key.

Well, it certainly depends on the type of data you're working with,
but when storing data about people, there are simply no natural keys
that are usable in tables like Person and Address.

I point out that, using only a single surrogate identity column to
relate tables when they are at a depth of 5 levels, you must
create all 4 JOINs in a query in order to retrieve any column from
the lowest level table in the hierarchy.

That almost never happens. I hardly ever have hierarchies of more
than 3 levels. I *do* often have joins to multiple tables at a
single level, but that's a different issue entirely.

If you store the natural key in the 5th level table, then you
can JOIN directly from the highest level to the lowest. By
subsuming all the keys into one, you lose this capability with
surrogate keys. In addition, it is not uncommon to need only the
single natural key column from the lowest level table in this
hierarchy. This value would be stored in the 5th level table
using the scheme I find useful. You would need to create 4 JOINs
in your query to obtain that. And if a different column in the
lowest level table is required, using compound natural keys allows
you to retrieve that in a single JOIN.

This debate is old, and I would not be participating in it except
that I now have worked out a methodology to permit the best of
both sides to be realized, and more besides.

I just don't see too much benefit and fear the problems it brings in
terms of mixing the data level and the storage level. Codd and all
the SQL theorists have always said that the storage should not
intrude into the logical data level, and I think that's a good
thing. It's why data is largely quite portable between all SQL
databases, because that logical abstraction is there, which makes
what the db engine is doing internally completely irrelevant. Your
proposal moves an element of the storage level into the logical
relationships of your actual data. That's a theoretical problem, but
there are good reasons theory advises us to avoid that, and I don't
see that you've addressed those problems at all.

Codd requires stability in a natural key, and I expect we are in
complete agreement on that.

Yes, and that's *precisely* the reason why very few entities that
I've ever worked with have any viable natural keys -- because the
only combination of fields that would guarantee uniqueness are quite
subject to change.

[]

What problems have you experienced with cascading updates? I
expect I have used them many more times than you have, and I
haven't experienced them.

It's a huge performance drain, as well as being problematic in a
replicated scenario, where it can hose your data structure if you
don't manage your FK edits and your synchs well. Sure, that's an
implementation problem within Jet replication, but it's a problem to
which there is probably no solution in any replication scenario (you
have to have *some* ordering of the synchronization of
edits/adds/deletes and without extremely complicated logic, it's
probably impossible for any db engine to always get it right).

As to the overhead, I have tracked the frequency of cascading
updates in a large, rather busy multi-user database that is
typical of my designs. Cascade updates commonly occur 1-2 times
per month (which confirms that my selection of natural keys is
stable as required) and typically take 1-3 seconds. A total
performance hit of 2-6 seconds per month is not much of an issue.
So, from my perspective, this is not much of an issue.

But you end up updating *two* indexes because of it, whereas the
surrogate key on the lookup table requires an update to one index
(and only one value). It really doesn't matter to me that it is fast
-- the issue is that you are updating a lot more data pages with the
cascade update than would be required with a surrogate key.

Well, not implementing schema changes properly sounds like a very,
very bad thing, whether you use surrogate or natural keys.

Replication invalidates some practices that cause no problems in a
non-replicated scenario. It's one of the things that makes
implementing replication complex, in that you can't always just take
a working schema and replicate it. This is not a problem with Jet
replication, but an aspect of the whole problem space for any db
engine's replication implementation, because replication imposes
requirements on order of transactions that are irrelevant in
non-replicated apps.

I'm not aware of exactly
what kind of improperly implement schema change you are
suggesting, and I cannot comment further because of this
vagueness.

If you were experienced with replication you'd probably know what
I'm talking about.

For instance, a client of mine with a replicated database changed
the value of a natural key in a lookup table and then moved all the
child records to a new parent record. They then deleted the original
parent record. The result was that when they synched all these
changes at one time, the child records got deleted because the order
of transactions is deletes/updates/adds. Validation rules and RI
alterations in your schema have to be handled very carefully -- for
instance, if you're adding a validation rule on a field, you have to
make sure all the data *globally* in your entire replica set is
valid *before* you implement the validation rule. That means: edit
the data to make it valid for the new validation rule, then synch
around the whole replica set, and only *then* apply the new
validation rule. The reason is because schema changes are applied
*before* edits, so when you synch, you may have valid data in the
replica you're synching from, but if the replica you're synching
with has invalid data under the new validation rule, you'll get an
error, because the schema change is propagated *before* the edits
that made the data valid.

Again, this is not an implementation problem in Jet replication --
it's a logical issue with *all* replication regimes.

Natural keys with cascading updates make this more problematic.

Of course, I also avoid cascading deletes, as well, because of this
very problem.

[]

Well, that's what I said. However, my 5 level scenario (above) is
very realistic, and illustrates the full case of what does happen.

Well, I would disagree that it may be realistic as a *possibility*,
but not necessarily terribly common. Again, I'd say you're going for
the "premature optimization," by adducing a value for edge cases
that won't be much of an assist for the core uses of the database.

And what you've gotten in return for it is substantial complexity in
handling the joins that you *do* need (any time you need any
attributes that are not part of the PK).

[]

Not at all. Again, I have extended my discussion to a 5 level, 5
key instance.

Five tables means 5 single-column PKs, which is what I said.

I agree somewhat. I have timed myself putting in the components
of a 3 level compound key JOIN, and it can take up to 30 seconds
to type. As I have proposed elsewhere, a shortcut for this should
be implemented in SQL. Like this:

SELECT *
FROM TableA A
INNER JOIN TableB B
ON A.Column1 = B.Column1 AND A.Column2 = B.Column2
AND A.Column3 = B.Column3

could be replaced with:

SELECT *
FROM TableA A
INNER JOIN TableB B
ON A.IndexX(3) = B.IndexY(3)

Now, the IndexX and IndexY will have been defined anyway as:

IndexX Column1
Column2
Column3

IndexY Column1
Column2
Column3
Column4

Well, that's kind of like the way a SEEK works, in that you specify
the index instead of the fields, but that's db-engine specific, and
one of the reasons why SQL doesn't support it.

I think your problem here is less with db engine implementations
than it is with the philosophy behind SQL.

These 2 indices being for TableA and TableB, respectively. I
refer to the way an index looks in the dialog where you define it.

So, the above new syntax I propose has the identical meaning and
function as the current standard method.

The way I'd see it is that your db engine would provide to your SQL
engine a "fake" field that would be your "surrogate" for the natural
keys. You'd get the advantage of the ease of use of the surrogate
key while being able to use your natural keys.

I recognize this as a potential advantage, but only for those using
natural keys. It would certainly negate the ease-of-use objections
I've made against natural keys.

But it would only really help when natural keys are viable, and, as
I've said, I don't encounter many scenarios where natural keys are
usable, and where they are, they aren't multi-column keys, anyway.

What I am trying to convey here is that it is the TOOLS we use
that make using natural keys a bit more difficult that is
necessary, but that this is not a necessary attribute of working
with natural keys. In terms of difficulty, I believe the above is
very nearly equivalent to using surrogate keys.

Again, you're complaining about SQL's logical abstraction,
independent of implementation. When you propose using a particular
index in a SQL statement, you've already gone outside the realm of
SQL.

[]

I expect, then, that you consider your usage of surrogate/identity
keys to be part of the logical layer, and that you WANT to be
dealing with the additional complexity it adds.

It's a complexity that comes very naturally from our real-world
experience. We've all got "identity" keys that we have to remember,
such as one's SSN, one's bank account numbers, etc.

My proposal is to use a superior method of
generating such keys automatically and transparently, but to use a
value in this key that avoids the need to generate and use an
index to accomplish what they perform.

You're basically trying to build Access's table-level lookups into
the db engine. I understand that. I just don't think it's going to
be as good an idea as it sounds. Had I no experience with Access's
table-level lookups, I'd say that was a good idea, too, but it turns
out, it doesn't belong at the table level.

I ask you to stop and think. What do you gain by having to add
identity columns to the database that you would lose if it were
done for you automatically? Why do you feel you need to be aware
of them and to be able to access their values?

Because I don't like giving up control over how they are handled. I
don't like losing the redundancy that comes from *not* mixing my
app's schema logic in with the organization of the actual data
store.

David W. Fenton · Apr 16, 2007

The judicious use of a unique
natural key does not guarantee non-duplicity either, but it makes
it easier to detect when it fails. Allowing a database to have no
unique natural key is a dangerous approach in nearly all cases.

This is a straw man argument, as it's entirely dependent on the
argument that surrogate-key advocates are suggesting that one *not*
use unique indexes on the candidate natural keys. No one that I've
ever seen advocate surrogate keys has ever suggesting purposefully
failing to index the natural keys -- it's just that it's quite often
not practical (because of Nulls).

Michael Gramelspacher · Apr 16, 2007

Tom,

My purpose in writing was to offer a more concrete example so I could
understand how your proposal would work.

You wrote: "To follow what I recommend, there would be no columns
in the table which are not natural."

So now I am confused about what it is you propose that replaces
multiple-column keys for linking. It does not appear as a column?
If it is not a column, how do we use it?

My attendance database was done as an exercise. You respond at length
about faculty_id and suggest that it is probably a surrogate key. I
remember
that I had a student ID number and just presumed that faculty members had
an ID number too. Actually, I am ignorant about this.

I have read that a surrogate key can become naturalized when it is exposed
to the user. I purchased from a mail-order catalog and now their mailings
have my customer number printed on them. When I call in an order I am
asked for this customer order, but the company can easily find me using
just
my name and address. If a person can be asked for their student number,
faculty number, etc., then what may have started out as a surrogate key
changes over to a natural key in my way of thinking.

Using compound natural keys is no big deal for me. It does not cause me
any real problems. For me natural keys make the database design as clear
as cloudless day.

Tom Ellison · Apr 16, 2007

Michael Gramelspacher said:
Tom,

My purpose in writing was to offer a more concrete example so I could
understand how your proposal would work.

I hope this can be accomplished.

You wrote: "To follow what I recommend, there would be no columns
in the table which are not natural."

So now I am confused about what it is you propose that replaces
multiple-column keys for linking. It does not appear as a column?
If it is not a column, how do we use it?

What I propose is an internal database engine facility which USES internal
database engine values to locate rows of a primary key table from a
dependent table without using the natural key. This is the same function as
that of a surrogate key.

My attendance database was done as an exercise. You respond at length
about faculty_id and suggest that it is probably a surrogate key. I
remember
that I had a student ID number and just presumed that faculty members had
an ID number too. Actually, I am ignorant about this.

Well, I thought I tried to couch my response on the basis that I was not
certain of this. I'll certainly take your word for it.

I have read that a surrogate key can become naturalized when it is exposed
to the user. I purchased from a mail-order catalog and now their mailings
have my customer number printed on them. When I call in an order I am
asked for this customer order, but the company can easily find me using
just
my name and address. If a person can be asked for their student number,
faculty number, etc., then what may have started out as a surrogate key
changes over to a natural key in my way of thinking.

This is certainly true. If a computer assigned number (surrogate key)
becomes in common use as the unique identifying key for the entity
represented by a row in the database, then it becomes, effectively, a
natural key. When this is done by design, perhaps because there is no
conversationally convenient natural key, then this would be a good design,
because it solves a pre-existing problem in natural conversation. The point
is that the optimal solution should be that the conversational unique key be
the same as the key used inside the computer. In its purest form, a
surrogate key is entirely internal to the database, and is not exposed
publicly.

Using compound natural keys is no big deal for me. It does not cause me
any real problems. For me natural keys make the database design as clear
as cloudless day.

Bravo! Indeed, I find that natural keys promote clear thinking about
database design. I strongly agree.

I find it unfortunate that you do not comment on duplicity as a factor of
database integrity. This was perhaps my most significant point.

Tom Ellison
Microsoft Access MVP

Tom Ellison · Apr 16, 2007

David W. Fenton said:
This is a straw man argument, as it's entirely dependent on the
argument that surrogate-key advocates are suggesting that one *not*
use unique indexes on the candidate natural keys. No one that I've
ever seen advocate surrogate keys has ever suggesting purposefully
failing to index the natural keys -- it's just that it's quite often
not practical (because of Nulls).

Well, I do not intend to accuse you of this. However, I've heard it from
other MVPs, so I suggest it is a very pervasive error. They have suggested
that there are cases where no unique natural key exists. Indeed, I have
found there are such instances.

Consider the lines of detail for an invoice. If the client requires they be
allowed to put the same item on the invoice on two or more lines, and if
they do not want to number the lines of the invoice (preferring instead that
the invoice display and print sorted by the item number, not the order they
were entered) then such a table may indeed have no unique natural key. It
is a correlary that such a situation precludes having there be any table
dependent on the invoice detail, as they are not uniquely identifyable.

Finally, in such a case, there may be requirements (such as replication, for
instance) that require a surrogate identity for those rows. For reasons
having nothing to do with relationships in the database, a surrogate
identity can be a requirement.

I do not support the prospect that there can be any case of a foreign key
table that has no unique natural key. From the standpoint of modelling the
real world, this make no sense. This statement can be made without any
reference to building a formal database structure. It is simply not a
systematic attribute of a workable natural system.

If there are potentially duplicates of all the natural columns in the
foreign key table, then it becomes a greatly difficult proposition to assign
a new dependent row to the "correct" foreign key row. The only way to
distinguish one foreign key row from another is to examine the rows in the
dependent table(s) assigned to that foreign key row. For example, John
Smith buys a watch. The only way to distinguish which John Smith bought the
watch is to specify that this John Smith is the one who bought a screwdriver
last Tuesday. Of course, there's certainly no guarantee that there aren't
two John Smiths who bought a screwdriver last Tuesday. Such uniqueness is
far beyond guaranteeing. However, keeping the sets of dependent rows
segregated according the the actual fact that the two John Smiths are
different individuals is essential to having database integrity. If the
John Smith who bought the screwdrive posts a payment to his account, are you
going to expect him to record the fact that he bought a screwdriver on the
memo of his check? Well, it's just that ridiculous.

So, I agree strongly that every foreign key table MUST have a unique natural
key. So, our debate is only about the mechanisms used. Fair enough.

Since you WILL then have a unique natural key to your "fundamental" tables
(those that are not dependent on any other table), why do you then find you
do not need to have a unique natural key to tables that are dependent on
these table, but which are foreign to "third tier" tables and beyond?

I would like to ask you whether you have tested the performance of a design
using all natural key relationships along side a design using surrogate
keys. I would ask that you test not only the speed at which such a database
queries (which MAY be faster using surrogate keys) as well a to measure the
time taken to add new rows to the foreign key table (which will be slower
using surrogate keys, as you have an additional table to maintain).

Also, please remember that replication is an important complication. There
must be conflict resolution for the natural keys. Since you use them as I
also do, this is not an issue in which our approaches differ. But the
autonumber values are likely to conflict as well. Perhaps you use GUIDs
instead, although this doubles the key length, and thus reduces the
performance factor that I understand is one of your primary concerns.

Thanks for your interest in this discussion.

Tom Ellison
Microsoft Access MVP

Michael Gramelspacher · Apr 17, 2007

Tom,

You write, "I find it unfortunate that you do not comment on duplicity as
a factor of database integrity. This was perhaps my most significant
point."

Preventing duplicate rows is easy with composite natural keys.

Say I have this: PRIMARY KEY (last_name, first_name, birth_date)

Say I have the row O. Arnold Cross, born 12/5/1977. If I now try to enter
again O. Arnold Cross, born 12/5/1977, I am prevented. So now I need to
investigate this. Are there really two separate individuals with the very
same first name, last name and birth date. While it is not out of the
realm of possibility, it is more likely a duplicate entry. If these really
are separate people, I have a problem. With just an autonumber surrogate
key there would be no issue. I could stuff duplicates into my table right
and left and never even be aware.

Really, no database person would ever use an autonumber primary key
without having a unique index on the natural key columns. Only in some
ideal world perhaps would this be true.

I stumbled into Access probably like many people did. I had some project
that I needed to get done and just jumped right into it. Database design,
normalization, column and table naming, I did not have a clue, but just
ploughed right on. O, what a mess. And I was so far down the line that it
was not practical to correct course. Then years later I encountered Celko
and became a little smarter in my opinion.

Tom Ellison · Apr 17, 2007

But your app would not work the same way on a different db engine

unless you only ever worked with the "view" level of data in your
application. If that's the case, I don't see how you get any benefit
from your approach except (putatively) at the performance level.

Well, the question of performance is certainly primary in what I intend,
while I also propose changes in the SQL language to eliminate some inherent
penalties in using natural key relationships.

But it couldn't possibly be portable without an abstraction layer
running on top of the db engine's low-level data storage system.
That's exactly what we have today, so I expect that a lot of the
performance benefits you are seeking would disappear in the
re-translation back to the presentation that we need to understand
what data we're actually working with.

At the same time, the use of surrogate identity keys would not be portable
unless the specific mechanisms that support it were not built into database
engines. It seems then that designers are being strongly influenced to use
surrogate identity keys for relationships because the tools are strongly
slanted in that way.

The way I see it, you're just creating a hidden surrogate key that
is tied to the data storage format. It gains performance because of
running so close to the metal, but is also not really transparent to
the DBA, as you have to depend on the db engine to keep it
up-to-date. I'd prefer to have that at a lower level so that if it
becomes corrupted, it's a matter of recovering dumb data structures
that store data that has its own logical structure. With your
approach, there aren't two independent logical structures, so if the
pointers get corrupted, there's no redundant information to recover
the data. I find that problematic to say the least.

I strongly suggest that the mechanism of how a relationship operates need
not be any concern of the DBA. What would be the disadvantage of hiding
surrogate keys from the DBA, and allowing the database engine to create and
operate on them? I see no reason why a DBA should have any concern over
these.

Well, it certainly depends on the type of data you're working with,
but when storing data about people, there are simply no natural keys
that are usable in tables like Person and Address.

That almost never happens. I hardly ever have hierarchies of more
than 3 levels. I *do* often have joins to multiple tables at a
single level, but that's a different issue entirely.

For applications I create, there are almost always 4 and 5 levels of
relational hierarchy.

I just don't see too much benefit and fear the problems it brings in
terms of mixing the data level and the storage level. Codd and all
the SQL theorists have always said that the storage should not
intrude into the logical data level, and I think that's a good
thing. It's why data is largely quite portable between all SQL
databases, because that logical abstraction is there, which makes
what the db engine is doing internally completely irrelevant. Your
proposal moves an element of the storage level into the logical
relationships of your actual data. That's a theoretical problem, but
there are good reasons theory advises us to avoid that, and I don't
see that you've addressed those problems at all.

The very point is that the surrogate key is entirely appropriate to the
"storage level" and is entirely artificial to the date definition level. I
see no reason why a designer should want to be concerned with this.

Yes, and that's *precisely* the reason why very few entities that
I've ever worked with have any viable natural keys -- because the
only combination of fields that would guarantee uniqueness are quite
subject to change.

This is very foreign to my experience. The absense of a unique natural key
in a foreign table seems to me to be a complete oxymoron.

When a new dependent row is added to any dependent table, the user must be
able to specify exactly and accurately a unique natural key value
identifying to which foreign table row the new dependent row is to be
related. If you cannot display the foreign table row candidates with
specificity, then the user is powerless to correctly construct the data.

In a table which is not, and cannot become a foreign key table, it is
certainly proper to design without a unique natural key.

It's a huge performance drain, as well as being problematic in a
replicated scenario, where it can hose your data structure if you
don't manage your FK edits and your synchs well. Sure, that's an
implementation problem within Jet replication, but it's a problem to
which there is probably no solution in any replication scenario (you
have to have *some* ordering of the synchronization of
edits/adds/deletes and without extremely complicated logic, it's
probably impossible for any db engine to always get it right).

But you end up updating *two* indexes because of it, whereas the
surrogate key on the lookup table requires an update to one index
(and only one value). It really doesn't matter to me that it is fast
-- the issue is that you are updating a lot more data pages with the
cascade update than would be required with a surrogate key.

There is no case in which what I suggest will require an additional index on
the foreign key (lookup) table. My design uses the same natural indices on
the foreign table, but you add an index on the identity column, which you
also add to the foreign table.

I have stated before that the cascade issue is extremely minor, given that
the unique natural key rarely changes. I can conceive of no instance in
which my designs will result in more than a small number of cascades (less
than 10) per month. This involves a tiny fraction of a percent of the
processing power of the system.

Replication invalidates some practices that cause no problems in a
non-replicated scenario. It's one of the things that makes
implementing replication complex, in that you can't always just take
a working schema and replicate it. This is not a problem with Jet
replication, but an aspect of the whole problem space for any db
engine's replication implementation, because replication imposes
requirements on order of transactions that are irrelevant in
non-replicated apps.

What support do identity surrogate give to sorting out the "order of
transactions" since this order jumps from one table to another. Only a
date/time stamp can solve this.

If you were experienced with replication you'd probably know what
I'm talking about.

I'm familiar, but I still don't know to what you refer.

For instance, a client of mine with a replicated database changed
the value of a natural key in a lookup table and then moved all the
child records to a new parent record. They then deleted the original
parent record. The result was that when they synched all these
changes at one time, the child records got deleted because the order
of transactions is deletes/updates/adds. Validation rules and RI
alterations in your schema have to be handled very carefully -- for
instance, if you're adding a validation rule on a field, you have to
make sure all the data *globally* in your entire replica set is
valid *before* you implement the validation rule. That means: edit
the data to make it valid for the new validation rule, then synch
around the whole replica set, and only *then* apply the new
validation rule. The reason is because schema changes are applied
*before* edits, so when you synch, you may have valid data in the
replica you're synching from, but if the replica you're synching
with has invalid data under the new validation rule, you'll get an
error, because the schema change is propagated *before* the edits
that made the data valid.

Again, this is not an implementation problem in Jet replication --
it's a logical issue with *all* replication regimes.

Natural keys with cascading updates make this more problematic.

Of course, I also avoid cascading deletes, as well, because of this
very problem.

Cascade deletes do not become involved in the issue we are discussing.
Their purpose is unrelated to whether you implement natural or surrogate
keys for relationships.

Well, I would disagree that it may be realistic as a *possibility*,
but not necessarily terribly common. Again, I'd say you're going for
the "premature optimization," by adducing a value for edge cases
that won't be much of an assist for the core uses of the database.

Disagree strongly, based on experience.

And what you've gotten in return for it is substantial complexity in
handling the joins that you *do* need (any time you need any
attributes that are not part of the PK).

As I have stated elsewhere, the "complexity in handling joins" is not a
requirement of my method, but a requirement imposed by the absence of proper
syntactical tools in the implementation of SQL language. By allowing SQL to
reference named indices, and specify how many key columns are to be related,
this issue would disappear. I am stubborn enough not to be deterred by the
absence of good tools in making my decisions as to how I wll implement my
projects, as long as this does not cause problem in the final
implementation. In this case, there is no affect to the final product.

Five tables means 5 single-column PKs, which is what I said.

Thus requiring 4 JOINs to retrieve a row from the first foreign key table in
the sequence, even if only the key column value is desired. Now here is a
case where both the time to construct code, and the performance of the
finished system are considerably impacted.

Well, that's kind of like the way a SEEK works, in that you specify
the index instead of the fields, but that's db-engine specific, and
one of the reasons why SQL doesn't support it.

I think your problem here is less with db engine implementations
than it is with the philosophy behind SQL.

My recommendation to add a feature to the SQL language is not a change in
philosophy (except perhaps in the eyes of someone committed to handling
surrogate keys inside their coding) but a simple extension that enables
efficient handling of compound natural keys.

The way I'd see it is that your db engine would provide to your SQL
engine a "fake" field that would be your "surrogate" for the natural
keys. You'd get the advantage of the ease of use of the surrogate
key while being able to use your natural keys.

The hidden surrogate column is not "fake" but is "hidden". It is hidden
because the database designer need not be concerned with, or even aware of
it.

I recognize this as a potential advantage, but only for those using
natural keys. It would certainly negate the ease-of-use objections
I've made against natural keys.

There would be two advantages to those who currently use surrogate keys.
Firstly, the surrogate key, and all its advantages would be accomplished
without the designer making any reference to it. It would perform the very
function they desire without any design effort.

But it would only really help when natural keys are viable, and, as
I've said, I don't encounter many scenarios where natural keys are
usable, and where they are, they aren't multi-column keys, anyway.

I can imagine no case in which a foreign key table could be constructed
without a unique natural key. I have spoken of this before. How can a user
who is adding a new dependent table row assign this row to the correct
foreign table row unless he can see and uniquely identify the row to which
this new dependent table row is to be assigned? If the user sees two or
more foreign table rows that are indistinguishable, then this is an
impossibility.

You cannot say it does not matter to which of these indistinguishable
foreign table rows the new dependent table row is assigned. Identical rows
in a foreign table quicky become unique when dependent rows are assigned to
them. Each foreign table row will likely have its own set of dependent rows
and thus become unique in this indirect, but very real sense. The only
solution for the problem of the user assigning a new dependent table row
would be to display for him all the rows of all the dependent tables for
each of the foreign table candidate rows. While this is certainly possible,
and could possibly be necessary, it should certainly be avoided if at all
possible. In my opinion, the necessity of this precludes me ever building
such a construction without first convincing me of the absolute necessity of
doing so.

Again, you're complaining about SQL's logical abstraction,
independent of implementation. When you propose using a particular
index in a SQL statement, you've already gone outside the realm of
SQL.

No, it is entirely possible to do what I do using SQL as is. What I'm doing
is to point out that the slight complexity of writing JOINs with multiple
column natural keys, and simultaneously suggest that this should be
remedied.

It's a complexity that comes very naturally from our real-world
experience. We've all got "identity" keys that we have to remember,
such as one's SSN, one's bank account numbers, etc.

Perhaps our discussion has been entirely perverted by a misunderatanding. I
never consider a SSN, bank account number, or similar key to be surrogate.
To me a surrogate key is a computer assigned identity that is never
disclosed in the application. It has no chance of becoming public. SSN and
bank account number are natural keys, absolutely. They are perfectly
reasonable to be used as natural keys in the database.

You're basically trying to build Access's table-level lookups into
the db engine. I understand that. I just don't think it's going to
be as good an idea as it sounds. Had I no experience with Access's
table-level lookups, I'd say that was a good idea, too, but it turns
out, it doesn't belong at the table level.

I do not find that this is relevant. The problems of hiding a lookup within
the datasheet view has little to do with what I propose. The natural value
would actually be stored in the dependent table.

Because I don't like giving up control over how they are handled. I
don't like losing the redundancy that comes from *not* mixing my
app's schema logic in with the organization of the actual data
store.

The action of referencing the foreign table row need not be a mechanism with
which you are concerned. Once you've created this relation, you only need
use it.

Tom Ellison
Microsoft Access MVP

Tom Ellison · Apr 17, 2007

Dear Michael:

The terms "duplicity" and "uniquity" are my own coinage, and I must define
them.

I mean to expand the meaning of "duplicate" with the term "duplicity".

What I mean is that, even with a unique key, it is still possible to add 2
rows to a database that represent the same real world entity. Having a
unique key is the first line of defense against this, but it is not
necessarily sufficient. There can still be two rows in the database with
differing unique keys that nonetheless accidentally and mistakenly represent
the same entity. There is no method within a coputer system to prevent this
absolutely, but a serious system should provide mechanisms to maximize the
possibility that the users can detect this.

Consider John Smith living at 123 Main St. If the computer also shows a
John A Smith at 123 Main Street, I think you would agree this is a likely
candidate for duplicity, but is not technically a duplicate. a unique index
will not detect this, but algorighms can equate St with Street and match
first name and last name, ignoring middle initial, thus showing the probably
duplicity in this situation.

Duplicity, then, is an extension of the concept of duplication, meant to
include the uniqueness of the entities in the real world to which the
database refers. Duplicity, then, is the concept of uniqueness of the rows
in the database at the level of their real world reference.

There is a corresponding concept of uniquity, which is the converse. It
refers to the possibility that two separate real world objects may
accidentally be referenced by a single row in the database. It is difficult
to see just how the computer can handle support the prevention of this
problem, but having additional columns of information can be the defense.
If you gather more information about an object, you certainly may be able to
find distinguishing characteristics about the object that may eventually
reveal the fact that there are two distinct entities in the real world.
Remember, the computer operator isn't looking at a snapshot showing the twin
brothers, John G. Smith and John H. Smith. They look identical. They are
the same age. Many things about them are identical. Without a middle
initial, or possibly social security numbers, you cannot distinguish.
Indeed, if they choose to project the same image, you will never know there
are two of them. They could agree never to divulge any information that
would not be identical as applied to the other.

It seems to me that the database designer should strive to identify
attributes that strongly tend to identify uniqueness. If the designer
ignores duplicity and uniquity the database will not be corrupted in a
technical sense, but the ability of the database to functionally represent
the real world can be severely compromised. This is potentially much worse
than bad design. Identifying such problems is a great challenge, and one
that is not discussed and taught to a sufficient degree (in my humble
opinion).

Thanks again for your participation.

Tom Ellison
Microsoft Access MVP

David W. Fenton · Apr 18, 2007

Well, I do not intend to accuse you of this. However, I've heard
it from other MVPs, so I suggest it is a very pervasive error.
They have suggested that there are cases where no unique natural
key exists. Indeed, I have found there are such instances.

There are very, very many such cases, unless you choose to tolerate
Nulls and use other methods to break the Null ties.

Consider the lines of detail for an invoice. If the client
requires they be allowed to put the same item on the invoice on
two or more lines, and if they do not want to number the lines of
the invoice (preferring instead that the invoice display and print
sorted by the item number, not the order they were entered) then
such a table may indeed have no unique natural key. It is a
correlary that such a situation precludes having there be any
table dependent on the invoice detail, as they are not uniquely
identifyable.

Finally, in such a case, there may be requirements (such as
replication, for instance) that require a surrogate identity for
those rows. For reasons having nothing to do with relationships
in the database, a surrogate identity can be a requirement.

Jet Replication works by assigning every row a GUID. There is no
other way that this could be handled, in fact (other than some other
surrogate key that is guaranteed to be unique). But it's not a
property of the data -- it's a property of the data storage
structure.

I do not support the prospect that there can be any case of a
foreign key table that has no unique natural key. From the
standpoint of modelling the real world, this make no sense. This
statement can be made without any reference to building a formal
database structure. It is simply not a systematic attribute of a
workable natural system.

This kind of statement completely baffles me. How, excatly, do you
handle storing data about people? There is *no* natural key
candidate that avoids the Null problem unless you introduce
artificial data into your record.

If there are potentially duplicates of all the natural columns in
the foreign key table, then it becomes a greatly difficult
proposition to assign a new dependent row to the "correct" foreign
key row. The only way to distinguish one foreign key row from
another is to examine the rows in the dependent table(s) assigned
to that foreign key row. For example, John Smith buys a watch.
The only way to distinguish which John Smith bought the watch is
to specify that this John Smith is the one who bought a
screwdriver last Tuesday. Of course, there's certainly no
guarantee that there aren't two John Smiths who bought a
screwdriver last Tuesday. Such uniqueness is far beyond
guaranteeing. However, keeping the sets of dependent rows
segregated according the the actual fact that the two John Smiths
are different individuals is essential to having database
integrity. If the John Smith who bought the screwdrive posts a
payment to his account, are you going to expect him to record the
fact that he bought a screwdriver on the memo of his check? Well,
it's just that ridiculous.

I haven't a clue what you're going on about. There are plenty of
good ways for human beings, who do the data entry, to keep track of
which record is which. They can do it with a plurality of incomplete
data. A unique compound key can't accomodate incomplete data, thus,
it's not viable.

So, I agree strongly that every foreign key table MUST have a
unique natural key.

I didn't say that.

So, our debate is only about the mechanisms used. Fair enough.

I'm not sure that it is. Everything you've written about sounds
dogmatic and completely impractical to me. It is nothing like any
real-world database that I've ever been involved with in the 15+
years I've been dealing with databases as a designer.

Since you WILL then have a unique natural key to your
"fundamental" tables (those that are not dependent on any other
table), why do you then find you do not need to have a unique
natural key to tables that are dependent on these table, but which
are foreign to "third tier" tables and beyond?

I didn't say that you will always have a natural key. I said that
when you *do*, it must be uniquely indexed.

I would like to ask you whether you have tested the performance of
a design using all natural key relationships along side a design
using surrogate keys. I would ask that you test not only the
speed at which such a database queries (which MAY be faster using
surrogate keys) as well a to measure the time taken to add new
rows to the foreign key table (which will be slower using
surrogate keys, as you have an additional table to maintain).

I don't have any data that has any candidate natural keys.

I also am not interested in the horrid join syntax that it
necessitates.

Also, please remember that replication is an important
complication. There must be conflict resolution for the natural
keys. Since you use them as I also do, this is not an issue in
which our approaches differ.

You are imagining something I never wrote.

But the
autonumber values are likely to conflict as well.

No, they aren't. Jet Replication, for instance, uses random
autonumbers. In 10 years of doing Jet replication, not a single
random autonumber collision has ever occurred in any of my clients'
numerous applications.

Perhaps you use GUIDs
instead, although this doubles the key length, and thus reduces
the performance factor that I understand is one of your primary
concerns.

No, because GUIDs are unusable in Access. And there's no practical
benefit to them, anyway, as they just make life more difficult and
don't give you any more uniqueness than any application whose data
can be accomodated in a Jet MDB ever needs.

David W. Fenton · Apr 18, 2007

Preventing duplicate rows is easy with composite natural keys.

Say I have this: PRIMARY KEY (last_name, first_name, birth_date)

How do you enter people when you lack any of these three attributes?

David W. Fenton · Apr 18, 2007

Really, no database person would ever use an autonumber primary
key without having a unique index on the natural key columns.
Only in some ideal world perhaps would this be true.

Well, unless there is no candidate natural key that has no Nulls.
Once you have Nulls, then you've got to do something in the
application to filter out the dupes, and sometimes that means
depending on a human being making a judgment call.

Michael Gramelspacher · Apr 18, 2007

How do you enter people when you lack any of these three attributes?

David,

This is something I have asked myself for years. The answer is that you
use an identity. There was the jurist who said that the constitution is
not a suicide pact; likewise, natural keys are not either, figuratively
speaking of course.

Most of what I have done with Access involves transcribing genealogical
records. Would a priest omit the baptismal date of a baptism? You bet.
The list of possible ommissions goes on and on. Somehow I imagined that in
the real world of business the omitted data was under control. I imagined
that there was some minimal data elements required in order to conduct
business, perhaps first name, last name and birth date. But I suppose there
are people who refuse to disclose their birth date. Real life may not be
like most web forms I see that refuse to proceed as long as required
information is missing or in error.

David W. Fenton · Apr 18, 2007

[]

Most of what I have done with Access involves transcribing
genealogical records. Would a priest omit the baptismal date of a
baptism? You bet. The list of possible ommissions goes on and on.
Somehow I imagined that in the real world of business the omitted
data was under control. I imagined that there was some minimal
data elements required in order to conduct business, perhaps first
name, last name and birth date. But I suppose there are people who
refuse to disclose their birth date. Real life may not be like
most web forms I see that refuse to proceed as long as required
information is missing or in error.

Well, obviously, in a genealogical application, birthdate and both
names are very likely to be always known, and always required.

But in a business application, birthdate is not usually justifiable,
and you often have contacts with people whose full name you do not
know. Thus, you have incomplete data. That's real life and most of
the natural-key advocates I've encountered simply fail to address
this issue.

Tony Toews [MVP] · Apr 19, 2007

David W. Fenton said:
Well, obviously, in a genealogical application, birthdate and both
names are very likely to be always known, and always required.

Not at all. Mom has lots of dates with about 17xx or such in them.
And what border officials who speak different languages than the
immigrant do to names? Wow.

Finally a somewhat humorous example of the Y2K problem. There are
many parish records from the 18th century and older for which people
aren't quite certain what century the book covers.

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm

John W. Vinson · Apr 19, 2007

Finally a somewhat humorous example of the Y2K problem. There are
many parish records from the 18th century and older for which people
aren't quite certain what century the book covers.

Radiocarbon date the paper...?

John W. Vinson [MVP]

David W. Fenton · Apr 19, 2007

Radiocarbon date the paper...?

Don't they record the priest who did the baptism?

Perhaps handwriting analysis would help dating them, as handwriting
styles tend to change over time (orthography, too).

Surely there's enough internal evidence that a historian could
evaluate the content in line with historical data and figure out
what's what.

3 tables - relationships come up indeterminate	12	Oct 19, 2007
Tables - Relationships - databases - Question	0	Aug 15, 2009
Table Relationships: An Understanding is necessary...	8	May 28, 2010
Access Relationships	2	Sep 20, 2008
query for values not in use	2	Jan 4, 2008
Referential integrity and join type	2	Apr 16, 2010
Relationships	0	Jul 1, 2008
relationships	5	Nov 9, 2006

Why do all my Relationships show as Indeterminate ?

David W. Fenton

Tom Ellison

Michael Gramelspacher

Tom Ellison

David W. Fenton

David W. Fenton

Michael Gramelspacher

Tom Ellison

Tom Ellison

Michael Gramelspacher

Tom Ellison

Tom Ellison

David W. Fenton

David W. Fenton

David W. Fenton

Michael Gramelspacher

David W. Fenton

Tony Toews [MVP]

John W. Vinson

David W. Fenton

Ask a Question

Similar Threads