Lets look at a very simple example:
You have a table which is basically a list of addresses. These could be
customers, suppliers, whatever. In this table you include a City column and
a State column. In one row someone enters Birmingham and Alabama, in another
San Francisco and California. So far so good. I know that in reality you'd
be more likely to use the abbreviated forms of the state names, but for this
example lets assume form the moment that the names are entered in full
Now someone enters a row with San Francisco and Callifornia. A simple typo,
but easily done (I once found three variations of my name in a database of
authors of technical papers!). If you then try to return all rows for
California this row is not going to be returned. So you create a States
table with one row per State and enforce referential integrity between this
and the table of addresses. Now it becomes impossible for a user to enter a
non-valid state name in the table of addresses.
Integrity is still at risk, however, as there is nothing to stop someone
entering an address in Birmingham, California. For all I know there may be a
Birmingham in California as well as the one in Alabama and the original one
30 miles down the road from me here, but if so lets assume that there isn't
and this is a mistake. If you then look for rows in Alabama this one won't
be returned of course. So anomalies are still possible. Imagine what either
of the above would do for a companies sales figures grouped by state or
regionally.
To ensure integrity we need a Cities table with a unique CityID primary key
column and a City text column (city names, like people's, can be duplicated
so are not suitable as a key). This table will also have a State column and
referential integrity will be enforced between the States and Cities tables.
The main table of addresses, however, will NOT have a State column but only a
CityID column referencing the primary key of Cities. As the state is known
from the CityID via the enforced relationships to have a State column in the
main table introduces redundancy and once again leaves the door open to
anomalies. In the jargon its what's known as a transitive functional
dependency, which indicates that a table is not properly normalized.
Really, of course, the process of normalization should be extended further
as you might have the same street entered in two rows in a table but,
incorrectly, with different CityID values, so integrity is still under
threat. In reality most people would not go so far as to cater for this by
decomposing the addresses table further, though here in the UK its now common
for addresses to be recorded by just the post code plus a house number or
similar, as the post code tells us everything else. My post code for
instance covers all properties on one side of my street, so the post code
identifies which street I live in and that this is in Stafford,
Staffordshire, England. All that's needed in addition is my house number to
pin me down precisely. Add-ins are available whereby entering a post code
automatically enters the other details, and drops down a combo box's list of
all house numbers or names in that post code, so the user is forced to select
a valid one. This need not necessarily be the correct one, of course; I
could still select one of my neighbours' house numbers! This illustrates the
difference between integrity and accuracy, the former can be catered for in
the design of the database, but its all but impossible to stop a user
entering an inaccurate yet valid value.
For an exhaustive examination of the theoretical basis of database integrity
issues Chris Date's 'Introduction to Database Systems' has a chapter of some
30 pages on the subject. It is quite abstract, however.
Ken Sheridan
Stafford, England