The natural key value would be in the foreign key table in any
case. The comparison is between adding the autonumber column to
both tables plus adding an index on the autonumber in the foreign
key table. The break even point is at about a 20 character
natural key length, which happens to be a very common length in my
designs. I figure adding 4 bytes to each table, plus at least 8
bytes of indexing, plus some overhead as indexes are never filled
to 100%.
I don't understand how a 20-character text key can be as efficient
in terms of storage and index maintenance as a long integer, which
my Access help file tells me is 4 bytes. For the data storage alone,
you're talking 5 times as much space per record. Secondly, my
understanding (which could be wrong) is that Jet (and other db
engines) have optimized retrieval of numeric values as compared to
text values, even when indexed. So there'd be a performance hit
because you're no longer benefiting from that optimization. Then
there's the index update hit because you've got to maintain more
data in the index tree.
David W. Fenton said:
[]
You lost me. Assuming we're talking about a single-column PK, the
difference in storage space between a long integer and a text
field is going to be proportional to the difference between the
storage space required for the one vs. the other. Since both
require storage in two tables in order to map the relationship,
and because both have to be indexed, it seems to me that all you
need to do is figure out the difference between the size of the
two data types as stored. Only very small text fields are going
to take up less space than a long integer (4 bytes? or is it 8?).
Not actually proportional. There are 3 entities involved.
1a. The foreign key table, which must grow by 4 bytes for each
row it contains for the autonumber column
2a. The dependent table, which will shrink by the difference
between the natural key and the autonumber key
3a. The index on the foreign key table for the autonumber column,
which is entirely avoided by the natural key method
Your terminology confused me for a while, and that caused me to not
understand your point. By "foreign key table" you mean the parent
table in the relationship. Yes, you are adding 4 bytes per record
*if* there is an actual candidate natural key.
So, yes, you are right that it is not directly proportional.
But only in cases where there really is a proper candidate natural
key, which I find very rare except for simply lookup tables.
There are also differences in performance.
1b. A lookup will be modesty faster for large tables using the
autonumber. The primary factor here is the number of levels in the
b-tree index. A 20 byte natural text key makes for a 24 byte
entry in each node of index, meaning there will be a maximum of
about 160 rows per node, while the 4 byte autonumber has an 8 byte
entry and about 500 entries per node. For, say, 1 million rows,
the natural key index will require 3 levels of index, meaning 3
disk accesses for every joined row of data (160 cubed is about 4
million). For the autonumber, it will still require 3 levels of
b-tree, and will perform at the same speed. However, 500 cubed is
125 million, so between 4 million rows and 125 million rows, the
autonumber index is faster by a 3:4 ratio. This is a very modest
gain in performance during lookups.
That is, if the only speed benefit is from the nodes that need to be
traverssed. My understanding is that there are other optimizations
for numeric values in most db engines that speed the processing of
joins beyond just the b-tree traversal. But I'm just going on
third-hand information there, so I could be wrong on that.
Certainly the need to handle double-byte data surely must introduce
some kind of additional overhead in the text-based indexes.
2b. When adding rows, the natural key method will be faster. The
foreign key row is smaller, and has one few indices to be
maintained. There is no index on the autonumber column to be
maintained, either.
I sure wish you'd use "parent row" instead of "foreign key row."
I don't think this is a huge issue, as you're on the 1 side of the
join. The many side is where most of the records will be added. I
can see where that would be less of an issue if most of you N's are
1 or 2, but once you have an average of 2 or more, you've really
multiplied the amount of maintenance hit well beyond the small
difference in updating the surrogate key index.
3b. When executing a query, it is often common to require only
the natural key value. See the Categories table in Northwind,
where the natural key value is the ONLY column in the foreign key
table. If you put this value in the dependent table, then you
avoid the need to JOIN to this table whatsoever.
But that's *precisely* the kind of table I'm *agreeing* works very
well with a natural key, becuase it's a one-column lookup table.
However, I don't always use a natural key, because the values
sometimes have to be updated (in some apps more often than others).
I propose to gain the advantages of using both methods, and to
gain additional advantages as well. When my work is finished, I
will be able to show these things quite well.
Well, I'm begining to see where you're going, but I just don't see
the advantages except in the type of case I was already agreeing
were just fine for natural keys.
Once you get to a 2-column or more PK, then I think your whole
theory breaks down. It certainly doesn't change the data storage
issues, but it does magnify the storage and index updating issues in
the child table (i.e., the one with the N records), and the join
issue only works when the data you need to filter on is only one
join away (i.e., it only helps with direct relationships). Once you
need data two joins away, you probably have greatly *increased* the
problems with join performance.
As I have been saying, both approaches have some potential
advantages, especially in very large databases, probably those in
excess of the 2 GB limit on Jet. But the whole debate is somewhat
frustrating since the best solution would be a database engine
that automatically provides all the best advantages of both
methods, and can indeed exceed either approach taken separately.
There exists a value like the autonumber but which is the value
recorded in the lowest level of the b-tree, which points to the
exact location of each row in the foreign key table. If this
value were recorded in the dependent table in place of the
autonumber value in 2a above and removing the index in 3a above.
This values would not be stored in the foreign key table, 1a
above, but would be external to that table.
But my understanding is that the value that is recorded there points
to a data page with an offset for the start of the record. Maybe the
offset is stored in the data page, instead. But when you compact,
the indexes have to be updated, and if you stored that value in each
child record instead, you'd have to update it in many more places
than you do with the current situation. I'm speaking of Jet here,
but surely every database engine has some similar kind of methods
(MySQL would have a file name and a row number).
In short, it's exactly the same problem I have with depending
CASCADE UPDATES for natural keys -- when the parent value changes
you have to do a bunch of updates to a lot of records and the
indexes for those records.
There are two challenges to this. One is to maintain integrity in
the database when a foreign key table entry is updated in the
natural key. This is currently handled by cascading updates. I
propose this be deferred, and instead keep a bitmap of all the
rows in the foreign key table, showing whether they have been
updated but not yet cascaded. When this happens, the database
engine would force a lookup to get the current value of the
natural key, and not use the natural key value in the dependent
table.
That requires branching logic to decide which to do, and the time it
takes to run that test could add significantly to data retrieval
time. This would greatly increase your need to perform data
maintenance to get the data defragmented.
The other situation occurs especially with a table that is
"clustered" (SQL Server terminology) or cleaned up (Jet). This
relocates the rows of tables to the order of the primary key. In
the case of surrogate autonumber keys, this places the rows in an
order that is not particularly useful.
On the contrary, it can be *very* useful in reducing concurrency. A
random Autonumber means that the data ends up randomly distributed,
so that updates are not as likely to collide on the same data page.
Certainly something similar is going to be the case in all data
stores, at some level or the other.
Indeed,
it is more efficient to use the natural key as the primary index,
and create a separate unique index for the surrogate keys. If,
for example, you are reporting the contents of the table, having
the rows in the same natural key order that the report requires
will almost certainly be more efficient than having the rows in
the autonumber order, which would generally be the order in which
they were entered.
This would be what I would call a "premature optimization," in that
you're trying to get application-level performance enhancements out
of operations at the lowest level of the database. I see this as a
mistaken approach, in that you're building a bias into the data
store that is not always going to be useful. And it violates the
principles behind Codd's rules and SQL in that you're worrying about
the data store.
No, a data engine that makes it easy to load into memory and
pre-optimize for a particular purpose, might very well show a real
performance boost without downgrading it elsewhere.
It certainly is true that most database engines are optimized for
reading/writing to disk, when these days there's enough RAM to run
production apps directly from the image in memory. But there would
be reliability issues with that and I'm not sure we're prepared for
that yet. But I do know that a lot of large databases are, in fact,
completely loaded into RAM to improve performance. I don't know how
many databases out there have added features to exploit running from
RAM.
In this case, the natural key values in the dependent table will
be correct, but the surrogate "row pointer" values will be
invalidated.
In both cases, I propose that the database engine should spend its
idle time cleaning up these situations, returning the database to
a state where these operations can proceed at the optimal speed.
That's exactly where the RAM vs. disk storage advantage could be
exploited. An engine designed around the capabilities of RAM would
fix this.
But, in order to do it, you would have re-introduced a hash table
that translates the original locations into the current locations of
the data, and you're basically back to the current index structure,
but with fragmentation inherent in your translation structure.
Well, now I've spilled the beans. Does anyone follow this mess?
Yes, I'm following you, but I'm not sure the performance
improvements would follow from what you're suggesting. But, I'm used
to thinking in terms of storage on disk, read serially, rather than
storage in RAM, read randomly (all locations are, theoretically,
just as close as all others, thought they aren't, really, but much
more so than on a hard drive). Perhaps you should add the whole RAM
vs. disk issue into your discussion of this, as I'm pretty sure it's
central to the current design of database engines.
Because of the way b-tree indexes are created, there is little,
and often no difference in performance, as I covered in my
discussion of the number of disk accesses needed to perform this
operation.
But you've tended to restrict your discussion of that issue to the
parent table, and rather ignored the way the problem can multiply in
the child table. In an app where every parent record has 100
children, the issue can become pretty great, both in storage and
performance, seems to me.
The time taken to access an index is not at all proportional to
its size. It is proportional to the number of levels required,
which grows in an inverse geometric style.
Again, I refer to my "impression" that there were index
optimizations that favored numeric values over text.
But index updating performance *does* degrade with the size of the
field being updated.
But this does not have the effect I believe you suppose it has.
It has some effect, but the geometric nature of the growth
surprisingly limits this effect.
Again, I think you're ignoring the asymmetric nature of the
performance issues. You concentrated mostly on the parent table.
I've concentrated on the results of duplicating the parent record's
data in multiple records in the child table. When it's 1:1, then
there's little issue. When it's 1:10 or 1:100, it becomes a whole
different ballgame.
But you lose the advantage of having the natural key value stored
in the dependent table.
Joins are so easy that I just don't see that as much of an
advantage. I'm not an end user, after all -- I understand how to
write SQL! And most db engines are optimized for joins, precisely
because of the issues involved.
And, of course, if you *do* need to join on your natural-key
indexes, you've probably reduced the efficiency of the join because
it's a text-based index.
[]
Any method that increases the number of hard disk accesses is the
biggest enemy of performance. Avoiding a JOIN remains one of the
biggest ways to keep query performance high.
Depends on what your app is doing most of -- retrieving data or
editing it.
And throw in the database-in-RAM issue and see if the results don't
change.
The requirement that the BE a unique natural key cannot be
camoflaged by avoiding HAVING a unique natural index. To fail to
do so prevents the database designer from ensuring that there be a
one-to-one correspondence between the entities in the real world
and the rows in his table. If you depend only on a surrogate key,
then you have done NOTHING to prevent the duplication of those
natural entities within your database. I do not agree that this
argument enters into consideration here. The need to have a
unique natural key is absolute, and no autonumber/identity key
helps remove that problem in any fashion.
I am not persuaded of the theoretical importance of your PK insuring
uniqueness. There's a difference between uniqueness of the record
and uniqueness of the entity being represented by that record. Given
the imperfections of the latter, I don't see that it makes sense to
work awfully hard attempting to make the two correspond.
Data is imperfect precisely because we never have complete
information. If you force the PK function onto the natural key
fields (and the PK function is a meta-function of the database
engine, not of your entities themselves), then you put requirements
on the data you can store in those fields, and thus have to make up
some values to make sure you have no Nulls. Then you have to
suppress those fake values in certain situations, all because of a
choice you've made to overload your data fields with both user-level
functions (fully representing the data known about the entity) and
meta-functions at the database level (relating the record to records
in other tables).
I suggest that human ingenuity will commonly find a way to
uniquely describe natural entities in ordinary speech, and that we
can tap into that to design unique natural keys. The proposal
that there is no way to specify a unique person when we talk about
John Smith in conversation means we cannot speak of him or
conceive of him uniquely. The fact that we consistently find a
way that does permit us to do so uniquely in natural speech tell
me we can have the computer do the same. Otherwise, there fails
to be a way to communicate at all.
If natural speech were so easy to represent in digital form, then I
think we'd have much more accurate voice recognition systems than we
do.
I disagree that there is such a lack of candidate natural keys.
This is, as I have just described, not something that happens in
normal speech and writing.
But that is precisely because we don't put our identifiers into neat
little columns. We have a Gestalt representation in our heads that
we use to identify the entity, one that often includes information
that is useless in a business application. Do you really want to
have your contact management app asking people for their hair and
eye color, just so we can use that information to distinguish the
John Smith with red hair and green eyes from the one with black hair
and brown eyes?
In a real-world application, when you ask for too much data, you end
up with fake data, or no data. So there's always a tension between
getting the most possible complete data and the practical realities
of incomplete information and what actual users will have the
patience to put in. If the attribute doesn't have any use in your
app other than to help establish uniqueness, then your users aren't
going to want to be bothered, and you'll end up with a column of
UNKNOWN.
We DO find a way to uniquely identify a person or entity when such
uniqueness is necessary. If we can convey that uniqueness to one
another, then we can convey it to the computer as well.
But our data storage systems are several orders of magnitude more
complex than anything even conceived of for the computer.
[]
Depending on the design of the database engine, there is certain
to be a librarian keeping track of the pages of information stored
in the database file. However, for efficiency, I am convinced
this index is always kept in memory. This whole point is to
eliminate hard drive accesses. That's where 99% of the
performance issue lies.
Bingo!
But you'll still end up maintaining a table mapping the RAM image to
the current disk image. That will be much faster than doing it on
disk, but will still require time/CPUs, so efficiency will still be
important.
It may be that current RAM prices make it possible to do this at a
level that makes the difference no longer relevant (just as in the
90s graphics speed became fast enough to make a GUI completely
viable without being too sluggish in comparison to character-based
UIs).
But the real question would be whether or not re-engineering a
database engine is going to give enough performance/ease-of-use
benefit to be worth the development investment.
I'm not convinced there's going to be enough of a difference to
justify it, nor that it's a good idea to prematurely optimize the
data storage structures for any particular application (see above).
A server that has no idle time isn't keeping up with user
requests. The possibility that the server always has just enough
power to perform exactly what is asked of it is too unlikely to
consider. A server that falls behind just 10 minutes each day
will soon be taking hours to answer each user request. This is
not what is happening in the real world. A server always have
some additional time available. Excess capacity is PLANNED. And,
what I propose will free up considerable additional server
processing time, making it possible to perform what I propose.
I'm always scared of delayed writes, which is basically what you're
proposing, even in a transactional system.
Let me point out that you've proposed exactly that in your
discussion of moving the data page pointers from the indexes to the
child records.
It is not at all the case that the implementation of relationships
is application specific. This is a general approach to a general
problem.
I think in retrospect what you're actually proposing amounts to
using hidden surrogate keys for the meta function of relating
records in different tables.
No, my proposal uses existing internal pointers, as found in the
lowest level nodes of any index, to implement what I propose.
The differences are: the natural values are found in the
dependent rows, and the surrogate key is completely hidden from
the database designer, and operates without the need for any
index.
Perhaps not an index created by the DBA, but there's an index in
there somewhere, behind the scenes.