Why do all my Relationships show as Indeterminate ?

'69 Camaro · Apr 9, 2007

Hi, David.

I don't know how to respond to that. Are you spoiling for a fight?

Actually, if you re-read this ambiguous sentence -- like I had to, because I
got the same impression the first time! -- Tom is expressing his expectation
of avoiding a fight, given your civility in your previous posts, because all
too often this topic sparks brawls.

HTH.
Gunny

See http://www.QBuilt.com for all your database needs.
See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials.
Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com
http://www.Access.QBuilt.com/html/expert_contributors2.html for contact
info.

Tom Ellison · Apr 9, 2007

Dear friends:

Gunny's helpful post is very specifically and exactly the correct
understanding of my meaning and intentions. Thank you very much indeed.

Tom Ellison
Microsoft Access MVP

Tom Ellison · Apr 9, 2007

Dear friends:

Gunny's helpful post is very specifically and exactly the correct
understanding of my meaning and intentions. Thank you very much indeed.

Tom Ellison
Microsoft Access MVP

Tom Ellison · Apr 9, 2007

Dear David:

The natural key value would be in the foreign key table in any case. The
comparison is between adding the autonumber column to both tables plus
adding an index on the autonumber in the foreign key table. The break even
point is at about a 20 character natural key length, which happens to be a
very common length in my designs. I figure adding 4 bytes to each table,
plus at least 8 bytes of indexing, plus some overhead as indexes are never
filled to 100%.

More inline.

Tom Ellison
Microsoft Access MVP

David W. Fenton said:
I don't know how to respond to that. Are you spoiling for a fight?

I'm sure of that. But I still am baffled why anyone would still
think natural keys are the way to go except in a small number of
cases.

I have actually built databases side by side and tested this. I do not
share your conclusion. There are some advantages to both. My contention
these days is that a database engine can, and should be built that gives the
best advantages of both, and indeed exceeds the performance of either. The
whole thing is a "stop, you're both right" situation.

You lost me. Assuming we're talking about a single-column PK, the
difference in storage space between a long integer and a text field
is going to be proportional to the difference between the storage
space required for the one vs. the other. Since both require storage
in two tables in order to map the relationship, and because both
have to be indexed, it seems to me that all you need to do is figure
out the difference between the size of the two data types as stored.
Only very small text fields are going to take up less space than a
long integer (4 bytes? or is it 8?).

Not actually proportional. There are 3 entities involved.

1a. The foreign key table, which must grow by 4 bytes for each row it
contains for the autonumber column

2a. The dependent table, which will shrink by the difference between the
natural key and the autonumber key

3a. The index on the foreign key table for the autonumber column, which is
entirely avoided by the natural key method

There are also differences in performance.

1b. A lookup will be modesty faster for large tables using the autonumber.
The primary factor here is the number of levels in the b-tree index. A 20
byte natural text key makes for a 24 byte entry in each node of index,
meaning there will be a maximum of about 160 rows per node, while the 4 byte
autonumber has an 8 byte entry and about 500 entries per node. For, say, 1
million rows, the natural key index will require 3 levels of index, meaning
3 disk accesses for every joined row of data (160 cubed is about 4 million).
For the autonumber, it will still require 3 levels of b-tree, and will
perform at the same speed. However, 500 cubed is 125 million, so between 4
million rows and 125 million rows, the autonumber index is faster by a 3:4
ratio. This is a very modest gain in performance during lookups.

2b. When adding rows, the natural key method will be faster. The foreign
key row is smaller, and has one few indices to be maintained. There is no
index on the autonumber column to be maintained, either.

3b. When executing a query, it is often common to require only the natural
key value. See the Categories table in Northwind, where the natural key
value is the ONLY column in the foreign key table. If you put this value in
the dependent table, then you avoid the need to JOIN to this table
whatsoever.

I propose to gain the advantages of using both methods, and to gain
additional advantages as well. When my work is finished, I will be able to
show these things quite well.

As I have been saying, both approaches have some potential advantages,
especially in very large databases, probably those in excess of the 2 GB
limit on Jet. But the whole debate is somewhat frustrating since the best
solution would be a database engine that automatically provides all the best
advantages of both methods, and can indeed exceed either approach taken
separately.

There exists a value like the autonumber but which is the value recorded in
the lowest level of the b-tree, which points to the exact location of each
row in the foreign key table. If this value were recorded in the dependent
table in place of the autonumber value in 2a above and removing the index in
3a above. This values would not be stored in the foreign key table, 1a
above, but would be external to that table.

There are two challenges to this. One is to maintain integrity in the
database when a foreign key table entry is updated in the natural key. This
is currently handled by cascading updates. I propose this be deferred, and
instead keep a bitmap of all the rows in the foreign key table, showing
whether they have been updated but not yet cascaded. When this happens, the
database engine would force a lookup to get the current value of the natural
key, and not use the natural key value in the dependent table.

The other situation occurs especially with a table that is "clustered" (SQL
Server terminology) or cleaned up (Jet). This relocates the rows of tables
to the order of the primary key. In the case of surrogate autonumber keys,
this places the rows in an order that is not particularly useful. Indeed,
it is more efficient to use the natural key as the primary index, and create
a separate unique index for the surrogate keys. If, for example, you are
reporting the contents of the table, having the rows in the same natural key
order that the report requires will almost certainly be more efficient than
having the rows in the autonumber order, which would generally be the order
in which they were entered.

In this case, the natural key values in the dependent table will be correct,
but the surrogate "row pointer" values will be invalidated.

In both cases, I propose that the database engine should spend its idle time
cleaning up these situations, returning the database to a state where these
operations can proceed at the optimal speed.

Well, now I've spilled the beans. Does anyone follow this mess?

So it seems quite obvious to me that natural keys are going to
always be longer, unless they are numeric data (in which case they
aren't really natural, as somebody has to maintain uniqueness of the
numeric values).

And the overhead of updating large indexes increases as the number
of records increases, regardless of which is the larger table in the
relationship (parent or child).

I would be much more concerned about the performance issues with the
larger index data pages that will be requuired for a text-based
natural key.

Because of the way b-tree indexes are created, there is little, and often no
difference in performance, as I covered in my discussion of the number of
disk accesses needed to perform this operation.

In the case of indexes it certainly does, both in editing/updating
and in retrieval.

The time taken to access an index is not at all proportional to its size.
It is proportional to the number of levels required, which grows in an
inverse geometric style.

Ah! So you agree with what I've written above. I was only going with
"space" as the starting point because that's the ground you chose to
stake out in the discussion. I'm much more concerned with
performance, too, and that's a major reason why I would consider a
natural key only in a very limited number of instances.

But this does not have the effect I believe you suppose it has. It has some
effect, but the geometric nature of the growth surprisingly limits this
effect.

I don't know what theory says. I don't really care. What I do know
is that the more fields you have, the more they are likely to need
to CASCADE UPDATES. That means a *huge* performance issue, because
an update to any of the columns of the PK of the parent table will
need to update the corresponding column in all the records of the
child table. But with a surrogate key, which is never updated, you
never ever have that issue.

But you lose the advantage of having the natural key value stored in the
dependent table.

That you can keep the repeated data in synch at the engine level is
dependent on which engine you use. But you are updating data in
multiple locations. An autonumber surrogate key is never changed,
never updated, so the issue with keeping repeated data synchronized
simply vanishes.

There are specialized problems in maintaining surrogate keys in multiple
areas. The GUID was created to solve this. But you've just quadrupled the
size of the surrogate key. An yet you must somehow still maintain the
integrity of the uniqueness of the natural key! This illustrates the depth
and complexity of the problem with which we're dealing here.

To me, that's the main reason to normalize, so that you don't have
to maintain data in multiple locations. Even though it's usually
done at the engine level, multi-column natural keys still require
tye CPU cycles to keep the repeated data in synch.

But SQL-based database engines are optimized to make such joins
every efficient. Secondly, in your version, the FK columns won't be
updatable, but in many views, the data the FK points to in the
parent table *will* be updatable.

Any method that increases the number of hard disk accesses is the biggest
enemy of performance. Avoiding a JOIN remains one of the biggest ways to
keep query performance high.

Well, now you get into the other major flaw of natural keys -- that
for many entities, there are no reasonable candidate keys. What
would you use as a natural key for a person? There is simply no
viable natural key to identify a person in any application that is
using real-world data and is not handing off the problem to some
other PK-assigning authority (like the Social Security
Administration, which is itself problematic, since, over time, that
PK gets re-used).

The requirement that the BE a unique natural key cannot be camoflaged by
avoiding HAVING a unique natural index. To fail to do so prevents the
database designer from ensuring that there be a one-to-one correspondence
between the entities in the real world and the rows in his table. If you
depend only on a surrogate key, then you have done NOTHING to prevent the
duplication of those natural entities within your database. I do not agree
that this argument enters into consideration here. The need to have a
unique natural key is absolute, and no autonumber/identity key helps remove
that problem in any fashion.

I suggest that human ingenuity will commonly find a way to uniquely describe
natural entities in ordinary speech, and that we can tap into that to design
unique natural keys. The proposal that there is no way to specify a unique
person when we talk about John Smith in conversation means we cannot speak
of him or conceive of him uniquely. The fact that we consistently find a
way that does permit us to do so uniquely in natural speech tell me we can
have the computer do the same. Otherwise, there fails to be a way to
communicate at all.

There are lots of entities where the only candidate natural keys
will need to be updated often. And, of course, there can't be any
Null fields in the compound PK, which means storing default values
that will then have to be eliminated in reporting, etc.

I disagree that there is such a lack of candidate natural keys. This is, as
I have just described, not something that happens in normal speech and
writing. We DO find a way to uniquely identify a person or entity when such
uniqueness is necessary. If we can convey that uniqueness to one another,
then we can convey it to the computer as well.

I'd be interested to see your thoughts on that at more length. It
intrigures me as a concept, and it's certainly a general principle
that I would support, but I'm afraid that in so many cases I'm not
happy when such things are actually done for me, behind the scenes.
Some examples: Name Autocorrect, Multi-value fields.

Do you really believe, for instance, that there is no index to the
data pages where memo fields are stored, so that retrieval of the
memo data with the record it belongs to is not quicker? There would
have to be internal structures doing the mapping, and they'd have to
be optimized for efficiency. Seems to me that it would just be a
bunch of hidden structures, as opposed to ones that you can create
and work with yourself.

Depending on the design of the database engine, there is certain to be a
librarian keeping track of the pages of information stored in the database
file. However, for efficiency, I am convinced this index is always kept in
memory. This whole point is to eliminate hard drive accesses. That's where
99% of the performance issue lies.

Ack. That sounds terrible. What if the server never has idle time?
What if you're not using a server-based data engine? And what
happens to data requests between the start of the update and the
commit of it? Your proposal is adding a whole lot of new locking
issues, seems to me, and would be doing internally exactly what is
happening already in most database engines.

A server that has no idle time isn't keeping up with user requests. The
possibility that the server always has just enough power to perform exactly
what is asked of it is too unlikely to consider. A server that falls behind
just 10 minutes each day will soon be taking hours to answer each user
request. This is not what is happening in the real world. A server always
have some additional time available. Excess capacity is PLANNED. And, what
I propose will free up considerable additional server processing time,
making it possible to perform what I propose.

I really cannot see how. Perhaps you don't have this level of hand
waving in your planned long-form version of this, but I just don't
see how you get from where we are to what you propose without using
exactly the same kinds of structures we already use.

But the problem I see is that you'd be depending on the db engine to
determine something that is application-specific. That is,
uniqueness is a characteristic of the data used for a particular
purpose in a particular application, and natural keys in particular
are very dependent on that. I don't see how a database engine could
determine this without having an understanding of the way the entity
is being used in the particular application.

It is not at all the case that the implementation of relationships is
application specific. This is a general approach to a general problem.

Your proposed system would use surrogate GUIDs to maintain the
links. I can't see how it could work otherwise. And you'd still have
to maintain uniqueness of the actual attributes with indexes you
create yourself. So, I see absolutely no difference between the
current surrogate key type of RI and what you're proposing.

No, my proposal uses existing internal pointers, as found in the lowest
level nodes of any index, to implement what I propose.

The differences are: the natural values are found in the dependent rows,
and the surrogate key is completely hidden from the database designer, and
operates without the need for any index.

David W. Fenton · Apr 9, 2007

Actually, if you re-read this ambiguous sentence -- like I had to,
because I got the same impression the first time! -- Tom is
expressing his expectation of avoiding a fight, given your
civility in your previous posts, because all too often this topic
sparks brawls.

I don't recall any kind of "brawls" on this topic. Spirited
disagreement, yes, but hey, it's Usenet!

I read it as being addressed to alleged past behavior by me, in that
the civil tone was unexpected. Perhaps he meant that in terms of
this particular topic instead of addressing me, specifically, but
somehow it didn't seem to me to be a general comment.

I hope I am wrong on that, and then I'd have to admit to being
overly sensitive, which would be pretty ironic!

David W. Fenton · Apr 9, 2007

The natural key value would be in the foreign key table in any
case. The comparison is between adding the autonumber column to
both tables plus adding an index on the autonumber in the foreign
key table. The break even point is at about a 20 character
natural key length, which happens to be a very common length in my
designs. I figure adding 4 bytes to each table, plus at least 8
bytes of indexing, plus some overhead as indexes are never filled
to 100%.

I don't understand how a 20-character text key can be as efficient
in terms of storage and index maintenance as a long integer, which
my Access help file tells me is 4 bytes. For the data storage alone,
you're talking 5 times as much space per record. Secondly, my
understanding (which could be wrong) is that Jet (and other db
engines) have optimized retrieval of numeric values as compared to
text values, even when indexed. So there'd be a performance hit
because you're no longer benefiting from that optimization. Then
there's the index update hit because you've got to maintain more
data in the index tree.

David W. Fenton said:
David W. Fenton said:

news:[email protected]:

Click to expand...

[]

You lost me. Assuming we're talking about a single-column PK, the
difference in storage space between a long integer and a text
field is going to be proportional to the difference between the
storage space required for the one vs. the other. Since both
require storage in two tables in order to map the relationship,
and because both have to be indexed, it seems to me that all you
need to do is figure out the difference between the size of the
two data types as stored. Only very small text fields are going
to take up less space than a long integer (4 bytes? or is it 8?).

Click to expand...

Not actually proportional. There are 3 entities involved.

1a. The foreign key table, which must grow by 4 bytes for each
row it contains for the autonumber column

2a. The dependent table, which will shrink by the difference
between the natural key and the autonumber key

3a. The index on the foreign key table for the autonumber column,
which is entirely avoided by the natural key method

Your terminology confused me for a while, and that caused me to not
understand your point. By "foreign key table" you mean the parent
table in the relationship. Yes, you are adding 4 bytes per record
*if* there is an actual candidate natural key.

So, yes, you are right that it is not directly proportional.

But only in cases where there really is a proper candidate natural
key, which I find very rare except for simply lookup tables.

There are also differences in performance.

1b. A lookup will be modesty faster for large tables using the
autonumber. The primary factor here is the number of levels in the
b-tree index. A 20 byte natural text key makes for a 24 byte
entry in each node of index, meaning there will be a maximum of
about 160 rows per node, while the 4 byte autonumber has an 8 byte
entry and about 500 entries per node. For, say, 1 million rows,
the natural key index will require 3 levels of index, meaning 3
disk accesses for every joined row of data (160 cubed is about 4
million). For the autonumber, it will still require 3 levels of
b-tree, and will perform at the same speed. However, 500 cubed is
125 million, so between 4 million rows and 125 million rows, the
autonumber index is faster by a 3:4 ratio. This is a very modest
gain in performance during lookups.

That is, if the only speed benefit is from the nodes that need to be
traverssed. My understanding is that there are other optimizations
for numeric values in most db engines that speed the processing of
joins beyond just the b-tree traversal. But I'm just going on
third-hand information there, so I could be wrong on that.

Certainly the need to handle double-byte data surely must introduce
some kind of additional overhead in the text-based indexes.

2b. When adding rows, the natural key method will be faster. The
foreign key row is smaller, and has one few indices to be
maintained. There is no index on the autonumber column to be
maintained, either.

I sure wish you'd use "parent row" instead of "foreign key row."

I don't think this is a huge issue, as you're on the 1 side of the
join. The many side is where most of the records will be added. I
can see where that would be less of an issue if most of you N's are
1 or 2, but once you have an average of 2 or more, you've really
multiplied the amount of maintenance hit well beyond the small
difference in updating the surrogate key index.

3b. When executing a query, it is often common to require only
the natural key value. See the Categories table in Northwind,
where the natural key value is the ONLY column in the foreign key
table. If you put this value in the dependent table, then you
avoid the need to JOIN to this table whatsoever.

But that's *precisely* the kind of table I'm *agreeing* works very
well with a natural key, becuase it's a one-column lookup table.
However, I don't always use a natural key, because the values
sometimes have to be updated (in some apps more often than others).

I propose to gain the advantages of using both methods, and to
gain additional advantages as well. When my work is finished, I
will be able to show these things quite well.

Well, I'm begining to see where you're going, but I just don't see
the advantages except in the type of case I was already agreeing
were just fine for natural keys.

Once you get to a 2-column or more PK, then I think your whole
theory breaks down. It certainly doesn't change the data storage
issues, but it does magnify the storage and index updating issues in
the child table (i.e., the one with the N records), and the join
issue only works when the data you need to filter on is only one
join away (i.e., it only helps with direct relationships). Once you
need data two joins away, you probably have greatly *increased* the
problems with join performance.

As I have been saying, both approaches have some potential
advantages, especially in very large databases, probably those in
excess of the 2 GB limit on Jet. But the whole debate is somewhat
frustrating since the best solution would be a database engine
that automatically provides all the best advantages of both
methods, and can indeed exceed either approach taken separately.

There exists a value like the autonumber but which is the value
recorded in the lowest level of the b-tree, which points to the
exact location of each row in the foreign key table. If this
value were recorded in the dependent table in place of the
autonumber value in 2a above and removing the index in 3a above.
This values would not be stored in the foreign key table, 1a
above, but would be external to that table.

But my understanding is that the value that is recorded there points
to a data page with an offset for the start of the record. Maybe the
offset is stored in the data page, instead. But when you compact,
the indexes have to be updated, and if you stored that value in each
child record instead, you'd have to update it in many more places
than you do with the current situation. I'm speaking of Jet here,
but surely every database engine has some similar kind of methods
(MySQL would have a file name and a row number).

In short, it's exactly the same problem I have with depending
CASCADE UPDATES for natural keys -- when the parent value changes
you have to do a bunch of updates to a lot of records and the
indexes for those records.

There are two challenges to this. One is to maintain integrity in
the database when a foreign key table entry is updated in the
natural key. This is currently handled by cascading updates. I
propose this be deferred, and instead keep a bitmap of all the
rows in the foreign key table, showing whether they have been
updated but not yet cascaded. When this happens, the database
engine would force a lookup to get the current value of the
natural key, and not use the natural key value in the dependent
table.

That requires branching logic to decide which to do, and the time it
takes to run that test could add significantly to data retrieval
time. This would greatly increase your need to perform data
maintenance to get the data defragmented.

The other situation occurs especially with a table that is
"clustered" (SQL Server terminology) or cleaned up (Jet). This
relocates the rows of tables to the order of the primary key. In
the case of surrogate autonumber keys, this places the rows in an
order that is not particularly useful.

On the contrary, it can be *very* useful in reducing concurrency. A
random Autonumber means that the data ends up randomly distributed,
so that updates are not as likely to collide on the same data page.
Certainly something similar is going to be the case in all data
stores, at some level or the other.

Indeed,
it is more efficient to use the natural key as the primary index,
and create a separate unique index for the surrogate keys. If,
for example, you are reporting the contents of the table, having
the rows in the same natural key order that the report requires
will almost certainly be more efficient than having the rows in
the autonumber order, which would generally be the order in which
they were entered.

This would be what I would call a "premature optimization," in that
you're trying to get application-level performance enhancements out
of operations at the lowest level of the database. I see this as a
mistaken approach, in that you're building a bias into the data
store that is not always going to be useful. And it violates the
principles behind Codd's rules and SQL in that you're worrying about
the data store.

No, a data engine that makes it easy to load into memory and
pre-optimize for a particular purpose, might very well show a real
performance boost without downgrading it elsewhere.

It certainly is true that most database engines are optimized for
reading/writing to disk, when these days there's enough RAM to run
production apps directly from the image in memory. But there would
be reliability issues with that and I'm not sure we're prepared for
that yet. But I do know that a lot of large databases are, in fact,
completely loaded into RAM to improve performance. I don't know how
many databases out there have added features to exploit running from
RAM.

In this case, the natural key values in the dependent table will
be correct, but the surrogate "row pointer" values will be
invalidated.

In both cases, I propose that the database engine should spend its
idle time cleaning up these situations, returning the database to
a state where these operations can proceed at the optimal speed.

That's exactly where the RAM vs. disk storage advantage could be
exploited. An engine designed around the capabilities of RAM would
fix this.

But, in order to do it, you would have re-introduced a hash table
that translates the original locations into the current locations of
the data, and you're basically back to the current index structure,
but with fragmentation inherent in your translation structure.

Well, now I've spilled the beans. Does anyone follow this mess?

Yes, I'm following you, but I'm not sure the performance
improvements would follow from what you're suggesting. But, I'm used
to thinking in terms of storage on disk, read serially, rather than
storage in RAM, read randomly (all locations are, theoretically,
just as close as all others, thought they aren't, really, but much
more so than on a hard drive). Perhaps you should add the whole RAM
vs. disk issue into your discussion of this, as I'm pretty sure it's
central to the current design of database engines.

Because of the way b-tree indexes are created, there is little,
and often no difference in performance, as I covered in my
discussion of the number of disk accesses needed to perform this
operation.

But you've tended to restrict your discussion of that issue to the
parent table, and rather ignored the way the problem can multiply in
the child table. In an app where every parent record has 100
children, the issue can become pretty great, both in storage and
performance, seems to me.

The time taken to access an index is not at all proportional to
its size. It is proportional to the number of levels required,
which grows in an inverse geometric style.

Again, I refer to my "impression" that there were index
optimizations that favored numeric values over text.

But index updating performance *does* degrade with the size of the
field being updated.

But this does not have the effect I believe you suppose it has.
It has some effect, but the geometric nature of the growth
surprisingly limits this effect.

Again, I think you're ignoring the asymmetric nature of the
performance issues. You concentrated mostly on the parent table.
I've concentrated on the results of duplicating the parent record's
data in multiple records in the child table. When it's 1:1, then
there's little issue. When it's 1:10 or 1:100, it becomes a whole
different ballgame.

But you lose the advantage of having the natural key value stored
in the dependent table.

Joins are so easy that I just don't see that as much of an
advantage. I'm not an end user, after all -- I understand how to
write SQL! And most db engines are optimized for joins, precisely
because of the issues involved.

And, of course, if you *do* need to join on your natural-key
indexes, you've probably reduced the efficiency of the join because
it's a text-based index.

[]

Any method that increases the number of hard disk accesses is the
biggest enemy of performance. Avoiding a JOIN remains one of the
biggest ways to keep query performance high.

Depends on what your app is doing most of -- retrieving data or
editing it.

And throw in the database-in-RAM issue and see if the results don't
change.

The requirement that the BE a unique natural key cannot be
camoflaged by avoiding HAVING a unique natural index. To fail to
do so prevents the database designer from ensuring that there be a
one-to-one correspondence between the entities in the real world
and the rows in his table. If you depend only on a surrogate key,
then you have done NOTHING to prevent the duplication of those
natural entities within your database. I do not agree that this
argument enters into consideration here. The need to have a
unique natural key is absolute, and no autonumber/identity key
helps remove that problem in any fashion.

I am not persuaded of the theoretical importance of your PK insuring
uniqueness. There's a difference between uniqueness of the record
and uniqueness of the entity being represented by that record. Given
the imperfections of the latter, I don't see that it makes sense to
work awfully hard attempting to make the two correspond.

Data is imperfect precisely because we never have complete
information. If you force the PK function onto the natural key
fields (and the PK function is a meta-function of the database
engine, not of your entities themselves), then you put requirements
on the data you can store in those fields, and thus have to make up
some values to make sure you have no Nulls. Then you have to
suppress those fake values in certain situations, all because of a
choice you've made to overload your data fields with both user-level
functions (fully representing the data known about the entity) and
meta-functions at the database level (relating the record to records
in other tables).

I suggest that human ingenuity will commonly find a way to
uniquely describe natural entities in ordinary speech, and that we
can tap into that to design unique natural keys. The proposal
that there is no way to specify a unique person when we talk about
John Smith in conversation means we cannot speak of him or
conceive of him uniquely. The fact that we consistently find a
way that does permit us to do so uniquely in natural speech tell
me we can have the computer do the same. Otherwise, there fails
to be a way to communicate at all.

If natural speech were so easy to represent in digital form, then I
think we'd have much more accurate voice recognition systems than we
do.

I disagree that there is such a lack of candidate natural keys.
This is, as I have just described, not something that happens in
normal speech and writing.

But that is precisely because we don't put our identifiers into neat
little columns. We have a Gestalt representation in our heads that
we use to identify the entity, one that often includes information
that is useless in a business application. Do you really want to
have your contact management app asking people for their hair and
eye color, just so we can use that information to distinguish the
John Smith with red hair and green eyes from the one with black hair
and brown eyes?

In a real-world application, when you ask for too much data, you end
up with fake data, or no data. So there's always a tension between
getting the most possible complete data and the practical realities
of incomplete information and what actual users will have the
patience to put in. If the attribute doesn't have any use in your
app other than to help establish uniqueness, then your users aren't
going to want to be bothered, and you'll end up with a column of
UNKNOWN.

We DO find a way to uniquely identify a person or entity when such
uniqueness is necessary. If we can convey that uniqueness to one
another, then we can convey it to the computer as well.

But our data storage systems are several orders of magnitude more
complex than anything even conceived of for the computer.

[]

Depending on the design of the database engine, there is certain
to be a librarian keeping track of the pages of information stored
in the database file. However, for efficiency, I am convinced
this index is always kept in memory. This whole point is to
eliminate hard drive accesses. That's where 99% of the
performance issue lies.

Bingo!

But you'll still end up maintaining a table mapping the RAM image to
the current disk image. That will be much faster than doing it on
disk, but will still require time/CPUs, so efficiency will still be
important.

It may be that current RAM prices make it possible to do this at a
level that makes the difference no longer relevant (just as in the
90s graphics speed became fast enough to make a GUI completely
viable without being too sluggish in comparison to character-based
UIs).

But the real question would be whether or not re-engineering a
database engine is going to give enough performance/ease-of-use
benefit to be worth the development investment.

I'm not convinced there's going to be enough of a difference to
justify it, nor that it's a good idea to prematurely optimize the
data storage structures for any particular application (see above).

A server that has no idle time isn't keeping up with user
requests. The possibility that the server always has just enough
power to perform exactly what is asked of it is too unlikely to
consider. A server that falls behind just 10 minutes each day
will soon be taking hours to answer each user request. This is
not what is happening in the real world. A server always have
some additional time available. Excess capacity is PLANNED. And,
what I propose will free up considerable additional server
processing time, making it possible to perform what I propose.

I'm always scared of delayed writes, which is basically what you're
proposing, even in a transactional system.

Let me point out that you've proposed exactly that in your
discussion of moving the data page pointers from the indexes to the
child records.

It is not at all the case that the implementation of relationships
is application specific. This is a general approach to a general
problem.

I think in retrospect what you're actually proposing amounts to
using hidden surrogate keys for the meta function of relating
records in different tables.

No, my proposal uses existing internal pointers, as found in the
lowest level nodes of any index, to implement what I propose.

The differences are: the natural values are found in the
dependent rows, and the surrogate key is completely hidden from
the database designer, and operates without the need for any
index.

Perhaps not an index created by the DBA, but there's an index in
there somewhere, behind the scenes.

'69 Camaro · Apr 9, 2007

Hi, David.

I don't recall any kind of "brawls" on this topic. Spirited
disagreement, yes, but hey, it's Usenet!

Then you've been very fortunate not to have attended some of the same
meetings I have! ;-)

Perhaps he meant that in terms of
this particular topic instead of addressing me, specifically, but
somehow it didn't seem to me to be a general comment.

I doubt he was aiming his comment specifically at you. It's a religious war
and wherever you go, you'll find warriors on both sides. Some are ready to
duke it out at the first hint of "your side is wrong." In those cases, it's
usually best to keep a safe distance, or else wear armor or asbestos
underwear. Mine always seem to be at the dry cleaners when I need them.

I hope I am wrong on that, and then I'd have to admit to being
overly sensitive, which would be pretty ironic!

LOL!

Gunny

See http://www.QBuilt.com for all your database needs.
See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials.
Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com
http://www.Access.QBuilt.com/html/expert_contributors2.html for contact
info.

Tom Ellison · Apr 9, 2007

Dear David:

If you use the surrogate identity key, you must add that column to both the
foreign key table and the dependent table, while you can remove the natural
key value from the dependent table. In addition you will be adding an index
to the foreign key table. In this index, each row of the foreign key table
will require at least 8 bytes, 4 for the key value, and 4 for the pointer to
the data it represents (in some databases, it is more than this). That's a
minimum of 16 bytes to replace what took 20 bytes before (if we assume one
dependent row for each foreign key row).

Now it is not always the case that there is only one row in the dependent
table for each row in the foreign key table. There can be 100 rows in the
dependent table. In this case, you have saved some space. But there can be
0 rows in the dependent table for some of the rows in the foreign key table.
In this case, you have added at least 12 bytes and gained none. When you
add in a fill factor for the index of perhaps 50%, you will have used 24
bytes instead of 20.

In terms of saving space, this will happen if the average foreign key row is
used more than 3 times. Below that, the effect is probably very small.

The entire effect is often minimal, but depends on the actual data used. In
any case, this is hardly the main point of discussion (in my opinion). The
main point should be performance, not storage, and it would be if we did not
have an artificial 2 GB limit built in.

Tom Ellison

David W. Fenton · Apr 10, 2007

The entire effect is often minimal, but depends on the actual data
used. In any case, this is hardly the main point of discussion
(in my opinion). The main point should be performance, not
storage, and it would be if we did not have an artificial 2 GB
limit built in.

I think the main point is performance, specifically in regards to
the maintenance of indexes, and especially on the child tables. I
don't care about the extra 4 bytes in the parent table. We can
reduce the discussion to a scenario that favors your techniques,
but, again, I would say that will only apply to a very small number
of situations, and is thus not terribly useful. It is not that
natural keys can *not* be more efficient or perform better. It is
that in most cases they do not, and they introduce significant
problems that lead either to compromises in your data or to very
complex logic in your application.

Tony Toews [MVP] · Apr 10, 2007

David W. Fenton said:
I don't recall any kind of "brawls" on this topic. Spirited
disagreement, yes, but hey, it's Usenet!

I read it as being addressed to alleged past behavior by me, in that
the civil tone was unexpected. Perhaps he meant that in terms of
this particular topic instead of addressing me, specifically, but
somehow it didn't seem to me to be a general comment.

I hope I am wrong on that, and then I'd have to admit to being
overly sensitive, which would be pretty ironic!

I'm with Gunny on this topic. Also I've met Tom Ellison in person at
a previous MVP summit. Now he's passionate about natural keys. But I
very much doubt he's looking for a fight.

I'm sure Tom didn't mean you specifically. Just folks in general.
He might not have hung out much if at all in c.d.m-a and didn't
realize you are a distinguished Access denizen over there.

Now it was an ambiguous sentence but yes you are being a tad
sensitive. <smile>

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm

Tom Ellison · Apr 11, 2007

Hi, Tony!

You're right that I'm not looking for a fight, and that's the jest of it.
If anyone IS looking for a fight, and not a reasoned discussion, I choose
not to participate.

What I did mean specifically about David was that his discussion seemed
reasoned and not overly impassioned. For that reason, I did choose to
continue a discussion with him. If anything, that was meant as a
complement.

I have chosen to re-open this topic because, over the last two years, I have
come to a new realization of it.

Some of us "old timers" remember the networked database. In a networked
database, it is possible to create a relationship between rows of two tables
(or within one table) by recording a pointer to the "related" row. This is
the same thing we do with a surrogate key, except that using the surrogate
key we must traverse an index in order to find that pointer value. So, I
thought, why not record that pointer in the dependent table instead of
creating an autonumber/identity key and indexing it. True, this would pose
some challenges in designing the database engine, as such pointers do become
invalid (such as with a clustered index or a cleanup on an Access table)
when the rows are moved about within the table. But if this can be handled
in an index, then probably it could be handled within a table.

This creates the possibility that such a thing could be implemented
automatically, hidden within the dependent table. I propose to record both
the natural value AND the surrogate value in the dependent table. Rather
than requiring that the "fix up" occur immediately when a cascade or
re-ordering of the foreign table occurs, I propose that this be flagged in a
bit map of the rows in the foreign table, and be fixed up by the database
engine as a low priority task.

The advantages are simple to describe, but profound in practice. Any query
that JOINs the rows thus related could follow this "link" from the dependent
table to the foreign table without any index access. However, if only the
natural key column(s) value(s) are required by the query, the foreign table
row need not even be accessed, as the natural key values are already in the
dependent table. Thus, this meets, or exeeds, all the advantages of BOTH
methods, and in a way that is transparent to the database designer.

If you think about this hard enough, you may come to a surprising
conclusion. I know I have. That is, the debate between the use of natural
keys and surrogate identity keys is the result of a failure of those who
produced our database engines (certainly not just MS!) to provide a
mechanism that is natural, efficient, and transparent.

Anybody listening?

Tom Ellison
Microsoft Access MVP

David W. Fenton · Apr 11, 2007

The advantages are simple to describe, but profound in practice.
Any query that JOINs the rows thus related could follow this
"link" from the dependent table to the foreign table without any
index access. However, if only the natural key column(s) value(s)
are required by the query, the foreign table row need not even be
accessed, as the natural key values are already in the dependent
table. Thus, this meets, or exeeds, all the advantages of BOTH
methods, and in a way that is transparent to the database
designer.

I think it's a very bad idea because you're moving the maintenance
of the pointer from a single location in an index to multiple
locations in all the child records. It's the CASCADE UPDATE problem
of natural keys, but now made even worse by being at a level that
the user can't control.

It makes much more sense to me to keep the mapping of row to
location in a single place, rather than having to update it multiple
times, especially because it's data that the user (i.e., the DBA)
never sees and cannot manipulate.

Now, I do see some merit in your ideas if you are storing your
database in RAM. There the cost of fragmenting your pointers is not
going to be as bad as when you're storing it on disk. And I *do*
think db engines need to be significantly redesigned to take
advantage of RAM, instead of being built around the limitations of
disk storage.

But I'm not sure the flaws I adduce in your proposal would be
mitigated by the benefits of RAM storage.

Tony Toews [MVP] · Apr 11, 2007

Tom Ellison said:
If you think about this hard enough, you may come to a surprising
conclusion. I know I have. That is, the debate between the use of natural
keys and surrogate identity keys is the result of a failure of those who
produced our database engines (certainly not just MS!) to provide a
mechanism that is natural, efficient, and transparent.

I used natural keys in one of my first Access databases, back in 2.0.
It was quite a PITA. However in hindsight the reason it was a PITA
was that,

1) I didn't know Access all that well back then

2) Access wizards and screens simply don't do a good job with multi
column primary keys. I'm thinking of the Link Child Fields and Link
Master Fields fields in the subform property sheet to be specific.

If MS were to spend a bit of time there life would become much easier.

That said I'm still not quite convinced. <smile>

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm

Tom Ellison · Apr 11, 2007

David W. Fenton said:
I think it's a very bad idea because you're moving the maintenance
of the pointer from a single location in an index to multiple
locations in all the child records. It's the CASCADE UPDATE problem
of natural keys, but now made even worse by being at a level that
the user can't control.

Why do you feel it is desirable to have the user (or the designer) involved
in this maintenance at all? I am proposing to make this completely
transparent to the user/programmer. This is already a problem that exists,
and which has been solved in a way that is transparent to you. Possible you
are completely unaware of it, because, like what I propose, it is
transparent.

When you have a clustered table in SQL Server, the position of a row of data
can change from insertions and deletions. When this happens, all the index
entries that point to this entry must be changed to reflect its new
position. I presume you do not object to this, do you? Why would a similar
facility such as I propose be objectionalbe to you?

It makes much more sense to me to keep the mapping of row to
location in a single place, rather than having to update it multiple
times, especially because it's data that the user (i.e., the DBA)
never sees and cannot manipulate.

As my response above explains, this is already NOT the case. Especially,
this mechanism, as other mechanisms are designed, is completly transparent
to the DBA. I can conceive of no case in which a DBA would want to be
involved in such mechanisms, as they are with surrogate keys, creating
columns and indexes for them.

Now, I do see some merit in your ideas if you are storing your
database in RAM. There the cost of fragmenting your pointers is not
going to be as bad as when you're storing it on disk. And I *do*
think db engines need to be significantly redesigned to take
advantage of RAM, instead of being built around the limitations of
disk storage.

Always true!

But I'm not sure the flaws I adduce in your proposal would be
mitigated by the benefits of RAM storage.

You have not convinced my of any such flaws. As a developer of a database
engine myself, I am rather well aware of what is happening here.

David W. Fenton · Apr 12, 2007

I used natural keys in one of my first Access databases, back in
2.0. It was quite a PITA. However in hindsight the reason it was
a PITA was that,

1) I didn't know Access all that well back then

2) Access wizards and screens simply don't do a good job with
multi column primary keys. I'm thinking of the Link Child Fields
and Link Master Fields fields in the subform property sheet to be
specific.

If MS were to spend a bit of time there life would become much
easier.

That said I'm still not quite convinced. <smile>

I had an app once that didn't use natural keys, but we needed to use
2-column PKs because of a requirement to synch between two databases
that weren't the same data engine (and records could be added in
both instances). One join table that had an attribute of its own had
a 5-column PK as a result. That join table in turn had a
many-to-many relationship to another table, fortunately with a
single-column PK, so the PK on that secondary join table had six
columns.

And this was with SURROGATE KEYS.

That experience disabused me of the notion of multi-column keys
forever, natural or not.

Tom Ellison · Apr 12, 2007

Dear David:

This is a very real situation. When you have a multi-column natural key,
there is no simple, reasonable, and brief way to reference this.

This is not inherently a problem with natural keys. I suggest that a
database should allow the designer to designate the keys to any table and to
label that key with a simgle name that will then replace the enumeration of
all the columns comprising that key. For example, when writing a query that
JOINs 5 pairs of natural key columns in two tables, you would need reference
the keys by their composite name:

TableA
Key:
ColumnX
ColumnY
ColumnZ
Composite Key:
KeyM = ColumnW + ColumnX + ColumnY

TableB
Key:
ColumnT
ColumnU
ColumnV
ColumnW
Composite Key:
KeyN = ColumnT + ColumnU + ColumnV + ColumnW

Instead of writing:

SELECT *
FROM TableA
INNER JOIN TableB
ON ColumnX = ColumnT
AND ColumnY = ColumnU
AND ColumnZ = COlumnV

you could write:

SELECT *
FROM TableA
INNER JOIN TableB
ON KeyM(3) = KeyN(3)

thus matching the first 3 key columns of both keys.

The above would be used for a typical 1:M relationship.

By simply declaring the important sets of columns in a table as Keys, you
can reference those declarations to simplify query writing.

This shortcut could be implemented in the Link Fields declaration as well,
making it all much easier.

So, I think you bring up a very good point. The tools are skewed toward a
certain implementation, that of using Surrogate Identity Keys, and not for
using composite natural keys. This may discourage developers from using and
experimenting with natural key relationships. But it is not so much that
natural keys are not viable, but that the implementation of our tools is
slanted toward the use of surrogate keys.

So, I think the attitude against multi-column keys is artificial, being
based on shortcomings in our tools. I have taken the time to build them and
use them extensively, and I've quite gotten past this objection for my own
purposes. The debate over using natural keys has been slanted by the kind
of tools we have to use.

Tom Ellison
Microsoft Access MVP

Tony Toews [MVP] · Apr 12, 2007

Tom Ellison said:
This is not inherently a problem with natural keys. I suggest that a
database should allow the designer to designate the keys to any table and to
label that key with a simgle name that will then replace the enumeration of
all the columns comprising that key.

Good idea.

Trouble is we will probably never see this.

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm

i_takeuti · Apr 12, 2007

Tony Toews said:
I used natural keys in one of my first Access databases, back in 2.0.
It was quite a PITA. However in hindsight the reason it was a PITA
was that,

1) I didn't know Access all that well back then

2) Access wizards and screens simply don't do a good job with multi
column primary keys. I'm thinking of the Link Child Fields and Link
Master Fields fields in the subform property sheet to be specific.

If MS were to spend a bit of time there life would become much easier.

That said I'm still not quite convinced. <smile>

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm

David W. Fenton · Apr 13, 2007

So, I think you bring up a very good point. The tools are skewed
toward a certain implementation, that of using Surrogate Identity
Keys, and not for using composite natural keys. This may
discourage developers from using and experimenting with natural
key relationships. But it is not so much that natural keys are
not viable, but that the implementation of our tools is slanted
toward the use of surrogate keys.

Well, from the point of view of normalization, I have always thought
that natural keys were problematic. Pointing to a related record in
a different table is a meta-function. It's not really an attribute
of the data involved. That is, if you store the parent table's PK in
the child table, it's not really just that data that belongs there
-- the child relates to the entire parent record. So, if you're
being selective about which data to store in the child record, why
not use a meta field that has nothing to do with the actual data?
Your idea about embedding the pointer to the actual parent record's
location is exactly the same thing, but it's very dependent on the
database engine, rather than being a mathematical construct. Now, of
course, surrogate keys don't fit the set theory of Codd's rules and
SQL either, but they seem to me to be a perfectly valid approach to
the issue of relating records in different tables. The whole point
is that you need to relate the records, and the simplest way to
accomplish that is with a single-column surrogate key.

I can see no justification for natural keys at all except if you
start out with a rule that natural keys are somehow preferable.

They certainly don't make using any real-world database engine
easier.

Tom Ellison · Apr 13, 2007

Dear David:

See below:

David W. Fenton said:
Well, from the point of view of normalization, I have always thought
that natural keys were problematic. Pointing to a related record in
a different table is a meta-function. It's not really an attribute
of the data involved.

Referring to Northwind Traders, there is a table of Categories. The
category name from this table is every bit as much an attribute of the
Products table as any other column in that table. Indeed, if there were not
table of categories, there could still be a CategoryName column in the
Products table. There could be a fixed list of categories and the control
for Category in the datasheet could list these fixed categories. That would
not be a flexible as having a separate Categories table, but it would work.
So, having the natural value of the Category, whether it exists in another
table, is every much as valid an attribute of the Products table as any
other column in the Products table.

That is, if you store the parent table's PK in
the child table, it's not really just that data that belongs there
-- the child relates to the entire parent record. So, if you're
being selective about which data to store in the child record, why
not use a meta field that has nothing to do with the actual data?

I do not say you cannot, or should not "use a meta field," but that you need
not do so. There is an alternative. I am promoting the proposition that
there are advantages and disadvantages to both methods of designing a
database. But both are valid. However, I am finally saying there is a way
to operate databases that have all the advantages of both approaches, and
more advantages that neither offers.

Your idea about embedding the pointer to the actual parent record's
location is exactly the same thing, but it's very dependent on the
database engine,

Very much so! It would be entirely a function of the database engine.

rather than being a mathematical construct.

Implementing a surrogate key is now a mathematical construct? In that
context, so is using a natural key value. It's hard for me to see the
rationale here.

Now, of
course, surrogate keys don't fit the set theory of Codd's rules and
SQL either, but they seem to me to be a perfectly valid approach to
the issue of relating records in different tables.

I do not say they aren't .

The whole point
is that you need to relate the records, and the simplest way to
accomplish that is with a single-column surrogate key.

I'm not sure that surrogate keys are uniformly simpler. Natural keys are
often, even usually single column. Again, using a surrogate key tyupically
involves adding a column to the foreign key table and an index to that
table. That is NOT simpler than using a single column natural key.

I can see no justification for natural keys at all except if you
start out with a rule that natural keys are somehow preferable.

The justifications are:

- You don't have to add an additional column to the foreign key table.

- You don't add an index to the foreign key table.

- You don't have to make a JOIN in any query just to get the natural key
value, since that will be in every row of the dependent table.

- Because of the above, the query will be faster (not having a JOIN, not
having an additional declared table in the query.

They certainly don't make using any real-world database engine
easier.

Having fewer columns, fewer indices, and smaller queries is not easier?

Now, most of what I have written above is what I would have written 5 years
ago. However, I have modified my position, as I now feel the database
engine can be designed to take care of all this transparently, while giving
all the current function in a manner that is more efficient.

Tom Ellison
Microsoft Access MVP

Why do all my Relationships show as Indeterminate ?

'69 Camaro

Tom Ellison

Tom Ellison

Tom Ellison

David W. Fenton

David W. Fenton

'69 Camaro

Tom Ellison

David W. Fenton

Tony Toews [MVP]

Tom Ellison

David W. Fenton

Tony Toews [MVP]

Tom Ellison

David W. Fenton

Tom Ellison

Tony Toews [MVP]

i_takeuti

David W. Fenton

Tom Ellison