[]
Have you done any thinking about my idea that your proposals
would be helped if the db were in RAM instead of stored on disk?
In that case dynamic linking of data structures is much simpler,
and that removes many of the objections about fragmentation and
updating that are present in disk images. A pointer in RAM is
much more efficient than a pointer on disk, for instance, because
that's the way memory and our code are engineered to work.
This is already quite possible. You set up a RAM Drive and put
the database there.
But that is just a faster hard disk -- it does *not* take advantage
of the random access nature of RAM as opposed to disk storage in the
data storage structures, which in all the database engines I know
anything about are optimized for disk access. The database engine
and the data structures used would have to be altered to take
advantage of RAM.
I've done that, and as far as performance is concerned, there's
really nothing like it. Whether the database is written as
efficiently as possible or not is no longer an issue. Virtually
everything is instantaneous.
But, I would think that databases should be written to operate
well under more typical circumstances.
I think you're missing a major point -- many of the things you are
criticizing about our traditional methods of structuring
relationships are entirely a legacy of database engines designed
around reading and writing data to/form a disk drive. A database
engine designed (or, at least, tuned) to run in RAM would need to
have a different design, and, I think, your ideas would work *much*
better in a RAM-optimized db engine than in the disk-optimized ones
we have to work with today.
Consider the mere issue of your physical pointers. In a disk
environment, you have a major performance drag if the disk image
becomes fragmented in multiple locations on the disk volume. In RAM,
fragmentation is only a problem when it becomes so severe that there
are no blocks large enough to load data/code segments that can't be
fragmented. There is little speed penalty for fragmenting your table
over different locations in RAM. Thus, the background "cleanup"
becomes much less important (and would possibly need to be done only
for writing to persistent storage).
It does not mean that at all. It does mean that a conversion to
another database engine would have to ignore the pointers which I
propose to use in place of autonumber/identity surrogate keys.
However, based on only the natural key, it would be entirely
possible to port to other database engines.
I would like to point out that using autonumber/identity values in
a database is not portable. Having such a datatype is NOT a
requirement. So, using surrogate keys currently facilitates
portability because surrogate keys are so popular that nearly all
databases have implemented the required datatype. If all
databases supported what I propose, then that would be the
standard instead.
No, it doesn't mean that at all, unless your pointer is to a lookup
table instead of to the actual storage location. For the direct
pointer to work, all database engines would have to use exactly the
same file format.
Obviously, portability of the actual data to another db engine would
be easily handled through a view that presents the "human-friendly"
version of the data.
But your app would not work the same way on a different db engine
unless you only ever worked with the "view" level of data in your
application. If that's the case, I don't see how you get any benefit
from your approach except (putatively) at the performance level.
What you say may have some practical meaning, but that is
arbitrarily based on the current culture in building databases and
database engines. I am proposing a solution that would be a large
improvements in database engines based on a different paradigm.
But it couldn't possibly be portable without an abstraction layer
running on top of the db engine's low-level data storage system.
That's exactly what we have today, so I expect that a lot of the
performance benefits you are seeking would disappear in the
re-translation back to the presentation that we need to understand
what data we're actually working with.
[]
I have no problem with this statement. There ARE potential
advantages in using an autonumber/identity key. You may not see
it, but there are also advantages in using the natural key. I
propose to implement all the advantages of both, and to give
additional advantages as well.
The way I see it, you're just creating a hidden surrogate key that
is tied to the data storage format. It gains performance because of
running so close to the metal, but is also not really transparent to
the DBA, as you have to depend on the db engine to keep it
up-to-date. I'd prefer to have that at a lower level so that if it
becomes corrupted, it's a matter of recovering dumb data structures
that store data that has its own logical structure. With your
approach, there aren't two independent logical structures, so if the
pointers get corrupted, there's no redundant information to recover
the data. I find that problematic to say the least.
[]
I believe you are saying that none of your tables have a single
column natural key. Is that the case?
Except for simple lookup tables with static values, yes.
The databases I build all have a deliberate structure in which
compound keys are built up from simpler ones. This structure
always begins with tables with a single column natural key and
build up from there when necessary. Of all the tables, between
30% and 80% of all the tables have a single natural key column. A
large proportion of my designs have compound tables only to the
depth of 2 or 3 columns. I have rarely reached the level of 5
tables compounding the key.
Well, it certainly depends on the type of data you're working with,
but when storing data about people, there are simply no natural keys
that are usable in tables like Person and Address.
I point out that, using only a single surrogate identity column to
relate tables when they are at a depth of 5 levels, you must
create all 4 JOINs in a query in order to retrieve any column from
the lowest level table in the hierarchy.
That almost never happens. I hardly ever have hierarchies of more
than 3 levels. I *do* often have joins to multiple tables at a
single level, but that's a different issue entirely.
If you store the natural key in the 5th level table, then you
can JOIN directly from the highest level to the lowest. By
subsuming all the keys into one, you lose this capability with
surrogate keys. In addition, it is not uncommon to need only the
single natural key column from the lowest level table in this
hierarchy. This value would be stored in the 5th level table
using the scheme I find useful. You would need to create 4 JOINs
in your query to obtain that. And if a different column in the
lowest level table is required, using compound natural keys allows
you to retrieve that in a single JOIN.
This debate is old, and I would not be participating in it except
that I now have worked out a methodology to permit the best of
both sides to be realized, and more besides.
I just don't see too much benefit and fear the problems it brings in
terms of mixing the data level and the storage level. Codd and all
the SQL theorists have always said that the storage should not
intrude into the logical data level, and I think that's a good
thing. It's why data is largely quite portable between all SQL
databases, because that logical abstraction is there, which makes
what the db engine is doing internally completely irrelevant. Your
proposal moves an element of the storage level into the logical
relationships of your actual data. That's a theoretical problem, but
there are good reasons theory advises us to avoid that, and I don't
see that you've addressed those problems at all.
Codd requires stability in a natural key, and I expect we are in
complete agreement on that.
Yes, and that's *precisely* the reason why very few entities that
I've ever worked with have any viable natural keys -- because the
only combination of fields that would guarantee uniqueness are quite
subject to change.
[]
What problems have you experienced with cascading updates? I
expect I have used them many more times than you have, and I
haven't experienced them.
It's a huge performance drain, as well as being problematic in a
replicated scenario, where it can hose your data structure if you
don't manage your FK edits and your synchs well. Sure, that's an
implementation problem within Jet replication, but it's a problem to
which there is probably no solution in any replication scenario (you
have to have *some* ordering of the synchronization of
edits/adds/deletes and without extremely complicated logic, it's
probably impossible for any db engine to always get it right).
As to the overhead, I have tracked the frequency of cascading
updates in a large, rather busy multi-user database that is
typical of my designs. Cascade updates commonly occur 1-2 times
per month (which confirms that my selection of natural keys is
stable as required) and typically take 1-3 seconds. A total
performance hit of 2-6 seconds per month is not much of an issue.
So, from my perspective, this is not much of an issue.
But you end up updating *two* indexes because of it, whereas the
surrogate key on the lookup table requires an update to one index
(and only one value). It really doesn't matter to me that it is fast
-- the issue is that you are updating a lot more data pages with the
cascade update than would be required with a surrogate key.
Well, not implementing schema changes properly sounds like a very,
very bad thing, whether you use surrogate or natural keys.
Replication invalidates some practices that cause no problems in a
non-replicated scenario. It's one of the things that makes
implementing replication complex, in that you can't always just take
a working schema and replicate it. This is not a problem with Jet
replication, but an aspect of the whole problem space for any db
engine's replication implementation, because replication imposes
requirements on order of transactions that are irrelevant in
non-replicated apps.
I'm not aware of exactly
what kind of improperly implement schema change you are
suggesting, and I cannot comment further because of this
vagueness.
If you were experienced with replication you'd probably know what
I'm talking about.
For instance, a client of mine with a replicated database changed
the value of a natural key in a lookup table and then moved all the
child records to a new parent record. They then deleted the original
parent record. The result was that when they synched all these
changes at one time, the child records got deleted because the order
of transactions is deletes/updates/adds. Validation rules and RI
alterations in your schema have to be handled very carefully -- for
instance, if you're adding a validation rule on a field, you have to
make sure all the data *globally* in your entire replica set is
valid *before* you implement the validation rule. That means: edit
the data to make it valid for the new validation rule, then synch
around the whole replica set, and only *then* apply the new
validation rule. The reason is because schema changes are applied
*before* edits, so when you synch, you may have valid data in the
replica you're synching from, but if the replica you're synching
with has invalid data under the new validation rule, you'll get an
error, because the schema change is propagated *before* the edits
that made the data valid.
Again, this is not an implementation problem in Jet replication --
it's a logical issue with *all* replication regimes.
Natural keys with cascading updates make this more problematic.
Of course, I also avoid cascading deletes, as well, because of this
very problem.
[]
Well, that's what I said. However, my 5 level scenario (above) is
very realistic, and illustrates the full case of what does happen.
Well, I would disagree that it may be realistic as a *possibility*,
but not necessarily terribly common. Again, I'd say you're going for
the "premature optimization," by adducing a value for edge cases
that won't be much of an assist for the core uses of the database.
And what you've gotten in return for it is substantial complexity in
handling the joins that you *do* need (any time you need any
attributes that are not part of the PK).
[]
Not at all. Again, I have extended my discussion to a 5 level, 5
key instance.
Five tables means 5 single-column PKs, which is what I said.
I agree somewhat. I have timed myself putting in the components
of a 3 level compound key JOIN, and it can take up to 30 seconds
to type. As I have proposed elsewhere, a shortcut for this should
be implemented in SQL. Like this:
SELECT *
FROM TableA A
INNER JOIN TableB B
ON A.Column1 = B.Column1 AND A.Column2 = B.Column2
AND A.Column3 = B.Column3
could be replaced with:
SELECT *
FROM TableA A
INNER JOIN TableB B
ON A.IndexX(3) = B.IndexY(3)
Now, the IndexX and IndexY will have been defined anyway as:
IndexX Column1
Column2
Column3
IndexY Column1
Column2
Column3
Column4
Well, that's kind of like the way a SEEK works, in that you specify
the index instead of the fields, but that's db-engine specific, and
one of the reasons why SQL doesn't support it.
I think your problem here is less with db engine implementations
than it is with the philosophy behind SQL.
These 2 indices being for TableA and TableB, respectively. I
refer to the way an index looks in the dialog where you define it.
So, the above new syntax I propose has the identical meaning and
function as the current standard method.
The way I'd see it is that your db engine would provide to your SQL
engine a "fake" field that would be your "surrogate" for the natural
keys. You'd get the advantage of the ease of use of the surrogate
key while being able to use your natural keys.
I recognize this as a potential advantage, but only for those using
natural keys. It would certainly negate the ease-of-use objections
I've made against natural keys.
But it would only really help when natural keys are viable, and, as
I've said, I don't encounter many scenarios where natural keys are
usable, and where they are, they aren't multi-column keys, anyway.
What I am trying to convey here is that it is the TOOLS we use
that make using natural keys a bit more difficult that is
necessary, but that this is not a necessary attribute of working
with natural keys. In terms of difficulty, I believe the above is
very nearly equivalent to using surrogate keys.
Again, you're complaining about SQL's logical abstraction,
independent of implementation. When you propose using a particular
index in a SQL statement, you've already gone outside the realm of
SQL.
[]
I expect, then, that you consider your usage of surrogate/identity
keys to be part of the logical layer, and that you WANT to be
dealing with the additional complexity it adds.
It's a complexity that comes very naturally from our real-world
experience. We've all got "identity" keys that we have to remember,
such as one's SSN, one's bank account numbers, etc.
My proposal is to use a superior method of
generating such keys automatically and transparently, but to use a
value in this key that avoids the need to generate and use an
index to accomplish what they perform.
You're basically trying to build Access's table-level lookups into
the db engine. I understand that. I just don't think it's going to
be as good an idea as it sounds. Had I no experience with Access's
table-level lookups, I'd say that was a good idea, too, but it turns
out, it doesn't belong at the table level.
I ask you to stop and think. What do you gain by having to add
identity columns to the database that you would lose if it were
done for you automatically? Why do you feel you need to be aware
of them and to be able to access their values?
Because I don't like giving up control over how they are handled. I
don't like losing the redundancy that comes from *not* mixing my
app's schema logic in with the organization of the actual data
store.