I ran the test you suggested. Perhaps I missed something (details
below), but I wasn't impressed with the results I got. Please let me
know if I did miss something here.
Vincent said:
How does knowing the "physical order of the disk" (assuming you can
determine what that is) help you do ordinary database operations?
Physical order of the disk (a.k.a the cluster index) for a table is
determined by its PRIMARY KEY designation.
Here's a quick demo of why for Jet you must choose a PRIMARY KEY
carefully.
I assume you have a 'sequence' table i.e. a table of integers e.g.
CREATE TABLE [Sequence] (seq INTEGER)
;
INSERT INTO [Sequence] VALUES (1)
;
INSERT INTO [Sequence] VALUES (2)
;
INSERT INTO [Sequence] VALUES (3)
;
etc etc.
In reality, my data was created in Excel and imported <g>.
You didn't say how many records you created. What I did was to put
100,000 records into [Sequence], with values of 1 to 100,000.
Now create two test tables:
CREATE TABLE Test1 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (key_col)
)
;
CREATE TABLE Test2 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (data_col, key_col)
)
;
I set [Test1].[key_col] and [Test2].[key_col] to be incremental
Autonumber (long integer) fields, and [Test1].[data_col] and
[Test2].[data_col] to be Number (long integer) fields. For [Test1], I
set the primary key field to be [key_col], and for [Test2] I set the
primary key to include both fields.
Note the difference in PK designation between them.
Now load some data using the 'sequence table':
INSERT INTO Test1 (data_col)
SELECT DT1.data_col FROM (
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
) AS DT1
;
INSERT INTO Test1 (data_col)
SELECT DT1... (as before)
;
For the second version, I assume you really meant
INSERT INTO Test2 (data_col)
.. Your SQL didn't work on my Access 2000, so what I actually did was to
define the following Queries:
[Q_ShowSeq] SQL:
SELECT S1.seq MOD 100 AS data_col
FROM Sequence AS S1
WHERE S1.seq<=50000;
[Q_UnionAll] SQL:
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq
UNION ALL
SELECT * FROM Q_ShowSeq;
[Q_PopulateTest1] SQL:
INSERT INTO Test1 ( data_col )
SELECT DT1.data_col
FROM [SELECT * FROM Q_UnionAll
]. AS DT1;
and similarly for [Q_PopulateTest2].
This put 250,000 records into each of [Test1] and [Test2].
At this point, compact the file. This has the effect of physically
rebuilding the tables based on their PRIMARY KEY designations.
To test the relative performance of the PRIMARY KEY choices, run a
query which uses a BETWEEN construct because this favours the physical
order:
SELECT key_col, data_col
FROM Test1
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95
;
SELECT key_col, data_col
FROM Test2
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95
In my test timings, table with PRIMARY KEY (key_col) is about 10 times
slower than the table with PRIMARY KEY (data_col, key_col).
My version of your Queries (essentially identical to yours) was
[Q_SelectTest1] SQL:
SELECT [key_col], [data_col]
FROM Test1
WHERE data_col Between 10 And 15
Or data_col Between 30 And 35
Or data_col Between 50 And 55
Or data_col Between 70 And 75
Or data_col Between 90 And 95
;
and similarly for [Q_SelectTest2].
In my test timings, after compacting the database, both versions
completed essentially instantly (for example, 1.23E-08) as measured by
the Time function (system time, = days since midnight), so both were
apparently completed in a few milliseconds or less.
But looking at your description, even assuming I had included enough
records to be able to notice a difference, it appears that you forgot to
INDEX the [Test1].[data_col] field! Declaring the primary key to
include only [Test1].[key_col], omitting the field on which you were
filtering the records, would also omit putting an index onto the other
field. I imagine that that's why the Query took longer when you ran it.
This is exactly what I was referring to earlier, specifically the
problem of FORGETTING to index a field that one needs to use for sorting
or filtering a recordset.
Choosing an index that fails to match the physical record order on disk
pales in comparison to this type of omission.
Bear in mind that putting lots of indices onto a Table may not improve
performance. Indices improve performance when searching and sorting,
but they reduce performance with updating (since each index must be
updated when a record is changed).
In summary, although it's perhaps useful to know that a primary key gets
slightly special treatment from Access, the improvement in performance
that you might experience is likely to be pretty marginal. Far more
important, IMHO, is organizing the information (via primary keys;
indices; helpfully named Tables, fields, and Queries; properly
human-engineered Forms; etc.) in your database so that it will do the
job that you need done and will be easy to maintain. Worrying too much
about internal implementation or "efficiency" can be counterproductive,
if it distracts you from the imporant parts of the design. (Worse than
any of this, of course, is failing to make the database model the
real-world system it's supposed to model, but that's another story.)
-- Vincent Johns <
[email protected]>
Please feel free to quote anything I say here.