Data Normalization - Dispute

S. Jackson · Feb 14, 2004

I have designed a database to manage administrative law cases. An issue has
come up:

Cases are assigned a unique cause no. Example: 04-000-K. I have a field
called DHSNo that holds this data. This field is my primary key in my main
table. Someone pointed out that this was a bad way to enter this data -
leading zeros - something about truncating it in Excel -blah, blah, blah,
something about Y2k. I have no formal training and am self-taught using
Access so although I understand what they are saying, I don't understand how
to properly convey what this individual was trying to tell me. This
individual has suggested we use this format to enter the cause no.:
20040000K. This format is NOT user friendly and extremely difficult to
read and does not represent the true cause number assigned to the case and
has the potential to cause confusion and error. And, on a daily basis our
users refer to the case number in the first format: 04-000-K. It is the
format that is used on correspondence, pleadings, EVERYTHING.

Is there a compromise here? I think this individual has a VERY valid point
here because of the leading zero thangy, but i do NOT want it appearing in
that format (20040000K) on any of my forms or reports. This person
mentioned something about concatenating it somehow. How would we do that?

Any thoughts?

TIA
S. Jackson

MacDermott · Feb 14, 2004

When a person has just learned about the power of data normalization, it's
easy to want to apply it everywhere.
If you (or your users) are having trouble generating the Cause No. or typing
it in correctly, you will probably find it advantageous to change to a
multi-field Primary Key which can reliably generate the next available
number.
You would still be able to *display* (but not edit) the Cause No in the
format you want.

If the concern has to do with transfers to Excel (are you planning to
transfer this data to Excel, anyhow?), I would think the fact that this is
clearly a text field, not a numeric value, would cover you on that point.

Failing the above two concerns, it looks as if you've built something that's
working for you, at your personal level of expertise, so I don't really see
any good reason to change it.

HTH
- Turtle

Rolls · Feb 14, 2004

My preference in this instance would be to assign an autonumber field as the
PK (because it is unique) and have the case number as a separate field in
the exact text format in which you intend to use it. The reason for not
making the case number the PK is that this field HAS MEANING! The instant
that a key field has meaning becomes the instant that it should NOT be a
primary key.

If you're automatically assigning this case number it's possible to
concatenate several variables in an expression by separating the variables
with ampersands.

Example --> CaseNo: [FirstPart]&"-"&[SecondPart]

Suggest that you store the last used sequence number to a one record table.
Each time you need a new number call the last number with a function that
adds one to it and rewrites, updating the last used number, while including
the updated sequence number as a component of your case number expression.

To import or export the case number formatted text field Access <-> Excel,
the Excel field will have to be "Text" and/or the case number should be
preceded with an apostrophe; e.g., '12-3456.

Representation of this variable as a formatted text field really isn't a
"normalization" issue UNLESS you have multiple parts to the field that
exist in a one:many parent:child relationship AND you intend to do something
else with component parts such as summarize man-hours or billed dollars
based on this heirarchy. If that's the situation you could have a "Type"
table such as:

1 - Personal injury.
2 - Divorce
3 - Real estate
4 - Wills

etc., and additional tables for whatever other independent criteria you
would want to use in a future crosstab report for performance by these
criteria.

As usual, there is a choice of several ways to accomplish your intended
result. The general rule of thumb for normalization is, if there is not a
1:1 relationship between all the fields in a specific table, you need
additional tables.

Jeff Boyce · Feb 14, 2004

One more "chime in" on your post...

Your "04-000-K" appears to be the combination of multiple attributes. I'll
guess, from your reference to Y2K, that the "04" portion refers to the year
the Case was assigned a cause number (court, right?). I'll guess that the
"K" refers to a specific department, case type, or ...? And that the "000"
is supposed to be a sequential number (but I can't tell if it's supposed to
be sequential for EVERY case, or just the "04" & "K"-type cases).

Putting multiple attributes in a single field is a "no-no" in relational
(i.e., normalized) database design. If you need those three pieces of
information, use three fields to store them. If you need to SEE the three
combined together (i.e., "concatenation"), use a query to combine them, for
display in a form (you are using a form rather than the table, right?!).

Also, embedding "meaning" in a key (termed an "intelligent key", a
derogatory term) is frowned on ... hence one of the other responder's
suggestion of using an autonumber field as a primary key instead.

If your user is entering a case during 2004, I'll assume you have a date
field (which you could, in your data entry form, default to a value of
Date(), to save keystrokes -- the user could always type a different date if
they're behind in data entry). You already have the last two digits of the
year!

If you build a custom numbering scheme (if you want to start over each year,
you'll have to), you can have that automatically determine the
next-in-sequence, and automatically enter that value (more data entry
savings).

So, about the only thing your user would actually HAVE to enter (?by
selecting from a combo box?) is a case type (if I've guessed correctly).

If you build a query that retrieves the case information, and include an
additional field to concatenate the pieces for display, you can use your
"04-000-K" version in forms, in reports, etc.

Ken Snell · Feb 14, 2004

Also, embedding "meaning" in a key (termed an "intelligent key", a
derogatory term) is frowned on ...

Hey Jeff...let's rename this to a "know-it-all" key...that implies
derogation!

< g >

Cheryl Fischer · Feb 14, 2004

Hi Shelly (this is Shelly, right?)

I think that the person giving you advice about how to enter or display this
value is ill informed. Now that I have gotten that out of my system ...

If the Cause Number we are talking about is the Cause Number that is
assigned by the clerk of whatever court the action was filed in, it should
not be changed. It is REAL data, and you are correct to be concerned that
changing it will have a disastrous effect on communications (both official
and unofficial) with all parties. Attempting to reproduce the "thinking"
behind how or why a given Cause Number was assigned would be a waste of time
and, more important money, to your organization. Can the person suggesting
this change answer the following: "How would changing the true Cause
Number to something which is not used by the court or any of the parties add
value to the database?"

Now, the court or courts in your jurisdiction may *currently* be using a
system to assign cause numbers that uses the last two digits of the year +
three digits + some letter; but that numbering system could change at any
time and I doubt that the court will feel any responsibility to inform your
organization of any change in their numbering system methods. Also, if
there is any occasion where your department may become involved in or take
responsibility for a case from some other jurisdiction, your database may be
in trouble if the same numbering system is not used. I assure you that
courts in Harris County and some of the other counties in South Texas do not
use the same system to assign Cause Numbers.

Someone pointed out that this was a bad way to enter this data - leading

zeros

This would be an issue if you were dealing with a true number field - a
field in which the values would be used to do some sort of mathematics. Is
anybody planning to do math with a Cause Number? In that instance,
attempting to enter leading zeros in a number type field would not be
allowed - Access would simply dispense with the leading zeros. Making a
field which *should be* a number type into a text type so that you could
enter leading zeros is a bad practice; however, I do not see that as
applying here.

- something about truncating it in Excel -blah, blah, blah

Since the true Cause Number contains non-numeric characters, I see no way
that Excel can interpret the Cause Number as numeric data and strip off the
leading zero. (Just tested this by exporting the following values to Excel:
04-001, 04-002, 04-003-K) - Excel imported all of them as Text.)

something about Y2k

Well, it seems to me that the Y2K issue has been satisfactorily resolved for
some time now. Here is how my computer (Windows XP) handles the entry of
2-digit years: if a 2-digit year is entered, it will be interpreted as a
year between 1941 and 2040. You can check to see how your version of
Windows handles this by clicking Start, then Control Panel. Select
"Regional and Language Options".
Click the "Customize" button, then click the "Date" tab. Where real dates
(emphasize *real*) must be entered, particularly Birth Dates and/or
Retirement Dates to give two examples, it is always good practice to enter a
full 4-digit year.

Again, the assigned Cause Number *may* use a year as one of its components;
however, once assigned, *how it was assigned* becomes a non-issue, as all of
the parties are expecting to see and refer to the Cause Number itself - not
its components.

Shelly, I think you have already provided the answer to the question:

This format is NOT user friendly and extremely difficult to
read and does not represent the true cause number assigned to the case and
has the potential to cause confusion and error. And, on a daily basis our
users refer to the case number in the first format: 04-000-K. It is the
format that is used on correspondence, pleadings, EVERYTHING.

I would add the following to the above: "All of the aforementioned will add
significant dollars to the cost of ownership of the database, while
providing no return on the investment."

If you are not in a position to ignore these comments (and I would ignore
them until ordered by someone in authority to make the change), offer to
display the desired format (in some inconspicuous location in a very small)
in a form or report but do not dispense with the true Cause Number.

hth,

Rolls · Feb 14, 2004

A "smart-code" or "intelligent key" isn't either, because as soon as it
acquires meaning, somebody doesn't like the meaning and decides to change
it, destroying referential integrity! This is why a key field should be a
"dumb" field, and should ordinarily not be visible to the user. The choice
of an Autonumber datatype merely assures that each record is unique. If the
remaining fields in the record are concatenated, no duplicate records would
exist in the table. Nor should there be any significance to the order in
which the autonumbers appear. An autonumber, denoting "unique" can be
random as well as sequential.

No inference should be drawn from the sequence of a sequential autonumber
field, other than that's the order in which records were entered.

If the entry order of records in a table is meaningful, use a separate
=now() date/time-stamp field to derive sequence.

Jeff Boyce · Feb 15, 2004

Oooooh! Ken, you have a true evil streak! (Yes!)

Jeff

John Marshall, MVP · Feb 15, 2004

As a mainframer, there has always been a need for intelligent keys.
Rebuilding the index was (and still is) a common event. The information in
the key field becomes unreliable and the keys have to be rebuilt, unless you
store redundant information, the keys can not be rebuilt and any file
relationships based on the keys are lost.

John... Visio MVP

Need stencils or ideas? http://www.mvps.org/visio/3rdparty.htm
Need VBA examples? http://www.mvps.org/visio/VBA.htm
Common Visio Questions http://www.mvps.org/visio/common_questions.htm

Douglas J. Steele · Feb 15, 2004

??? I don't recall ever having any such problems in all of the years I've
worked with DB2.

Rolls · Feb 15, 2004

John - Do you mean a maniframe flat file database or a mainframe SQL
database? I don't agree with your opinion about either the use of
"meaningful" key values or the practice of storing redundant fields in a SQL
database. Appears to me that this "knowledge" should be unlearned.

In Access the key fields are used to join entity and relationship tables,
consistent with the E-R model. A key field must a) be unique and b) have NO
meaning. Use of the Autonumber datatype assures that all key values are
unique within a table and is also used to maintain referential integrity,
allowing cascading updates and (careful!) deletes. When key fields are
LongInt, datatype operation is relatively fast. Any meaning should be
assigned to non-key fields, which can be updated, sorted, resorted, indexed
and reindexed to one's heart's content without disturbing key values.
Proper table architecture is the first step in building a reliable database.
Of course there are valid reasons for denormalization, the first being the
ability to preprocess certain queries to obtain greater speed.

It's also possible to assign a unique key value to all RECORDS in an Access
database across all tables, via use of a function call that returns the next
unused sequential number. This way no PK value is ever used in more than
one table. This procedure makes replication more reliable and allows more
straightforward migration of Access to SQL Server tables. Further, if the
unique key (Long Integer) is replaced by a base-36 (10 digits + 26 letters)
key field the 2 billion limitation is broken. The 1GB or 2GB filesize
limitation ordinarily becomes a constraint well before the 2B record
limitation, unless there are many deletes.

John Marshall, MVP · Feb 15, 2004

Try working with COBOL. ;-) Sometines it is hardware (rare now) and
sometimes the code is just not quiite right.

--
John... Visio MVP

Need stencils or ideas? http://www.mvps.org/visio/3rdparty.htm
Need VBA examples? http://www.mvps.org/visio/VBA.htm
Common Visio Questions http://www.mvps.org/visio/common_questions.htm

John Marshall, MVP · Feb 15, 2004

I'm referring to mainframe "relational" databases made of flat files that
predated the concept of SQL. (Any true relational database is still a
collection of flat files with special fields called keys to "related" the
different files or "tables".) Yes inspite of Y2K these dinosaurs still run
and perform the tasks they were orignally created to do.

These systems are antiquated, but the reasoning behind using intelligent
keys was valid. This knowledge should not be unlearned, but understood. At
the beginning, databases were unreliable and mechanisms had to be put in
place to negate the effect of the keys being corrupted. Databases are now
more reliant, but that does not make the use of intelligent keys wrong. Yes
there are newer and more efficient ways of doing things, but the reasoning
behind why intelligent keys were used needs to be understood rather than
being totally dismissed as wrong.

John... Visio MVP

Randy Harris · Feb 15, 2004

John Marshall said:
I'm referring to mainframe "relational" databases made of flat files that
predated the concept of SQL. (Any true relational database is still a
collection of flat files with special fields called keys to "related" the
different files or "tables".)

I can't agree with your definition. By the text books, the relational
database model does not dictate the physical storage of data. It is
irrelevant whether relations are contained within flat files, or clustered
together in other files or stored in raw media. Keys alone, do not make a
database architecture relational.

I am certainly no expert, however, I suspect that the systems you describe
have need of intelligent keys and redundant storage of data because they do
NOT conform to the relational model (perhaps they pre date it).

My 2 cents worth

Rolls · Feb 15, 2004

Can't see an earlier post. Did I not "reply to group"?

John Marshall, MVP · Feb 15, 2004

It's not a definition, it is an explanation that even though it is a
collection of flat files that reside on a mainframe and does not use a
standard relational database that it is not a "relational" database.

John... Visio MVP

Albert D. Kallal · Feb 16, 2004

You got some good comments here. I almost hate to jump in when you got so
many cooks!

However:

This
individual has suggested we use this format to enter the cause no.:
20040000K.

The y2k issue is not so much of a problem, but there is certainly an issue
of sorting the data correctly. If you have a case from 1998, then you get:

98-000-K

The problem with the above, is that you can't sort correctly. So, there is
some valid point here. As you can see, the Y2K issue is that 98 comes after
04, and thus sorts don't work. (1998 and 2004). Further, the other y2k issue
is that you cannot subtract the year part and get a difference in years. So,
some arguments do exist that using only 2 digits to represent the date is a
bad idea. You likely would be better off to use a separate field for the
year. In fact, even better would be to use a real date field. You can
certainly use a date field, a case number field and whatever the "k" means
as 3 separate fields. ms-access allows you to create a multiple index on
more then one field.

However, these issues are COMPLETE separate from the issue of a primary key.
There was comments that using a autonumber for your relations is likely the
best approach. You might not have a case number or date yet. Regardless,
issues of relational tables should NOT have to require you to enter some
data in to a fields. You really don't care what memory cell that words gets
loaded into. You also really should not care what number or how ms-access
maintains a relationship. So, the issues of your number format, and the
issues of the primary key used for relationships is a SEPARATE issue.

Further, without a doubt, having 3 fields can often make data entry easy.
This is especially so if date is automatic or often the date is "today".
Further, this means that you can FAR more easily have a case number
increment automatically to the next case number for a given client. So, the
program code and logic is MUCH more easy if you split the data into 3
separate fields.

So, really, I am in full agreement with you that using a HORRIBLE format of
20040000K is not good at all. However, for all corresponds, and all reports
and all printouts etc it is a TRIVIAL matter to product output of

04-000-K

In fact, you could make a global public function that takes the 3 fields and
produces the results you want.

Assuming our 3 fields are

CaseDate: 02/15/2004
CaseNum : 23
CaseType: K

For a report to produce 04-023-K from the above, you can use a custom
function, and call it from a text box like:

=MyCaseFmt(CaseDate,CaseNum,CaseType)

The function that you can use in reports and even forms to display your
format could be:

public function MyCaseFormat(CaseDate as date, _
CaseNum as long, _
CaseType as string) as string

MyCaseFmt = format(CaseDate,"YY") & "-" & _
format(CaseNum,"000") & "-" &
CaseType

end function.

So, anytime you need your format, you can have it without any problems.
However, by using 3 fields, you have better options for setting up the case
number, you have better options for displaying the date (you can pluck out
any part of the date..not just the year). You also have better options for
sorting, and NONE of the Y2K options exist.

So, really, while some arguments can be made for adding all those zeros...I
would consider the suggestion VERY POOR as far more elegant and far more
flexible designs can be built by you! Those better designs solve the issues
of sorting and allow many other things to be implemented. So, while those
zeros might solve some problems you can one up the solution to a much higher
level and make the person look dumb!

Hum, so why is 20040000K bad? compared to our 3 field solution?

* Well, you can't write and make a report easily group by the "K" value.
With a separate field, grouping on reports is much easer by the "type"
fields.

* You are limited to using one value for "K". If we decide to come up with a
few new case types, then the above format is not gong to sort correctly
either. Using a separate field again allows more flexibility in design.

* It is VERY hard to do math on the case number field if it is just string.

* Storing a date field means we can do days between dates calculations. The
above format makes this very hard to do.

* As you mentioned, it sucks from a data entry point of view!

* It is hard to code in "auto" defaults with such a format. 3 separate
fields is far better.

S Jackson · Feb 17, 2004

Well, holy smokes! What a response I got and quite a discussion. I will
first humbly admit that most of your replies went well and truly over my
head!

Cheryl:

Once again, your reply was on a level that I could understand and you have a
true understanding of what I am trying to develop here based on the fact
that you are familiar with database case management systems for law firms.

First, I apolgize to everyone else for not providing a better explanation
regarding the data source for the DHSNo field and the purpose of my
database. The data is obtained from a separate source - the Hearings
Department for the government agency I work for. I am designing a database
to be used by the law offices within the same agency as the Hearings
Department. In the Hearings Department, the DHS cause number would be
sequential. In the Hearings Department's system (an extremely poorly
designed system I might add), they have one field for this information:
yyyyxxxxK. Lumping the year, case number and letter (case type) designation
together. However, when the case is assigned to the law office, it is
assigned by the Hearings Department as yy-kkkk-K, which is the format used
universally when referencing the case within the agency and outside the
agency. Since there are several different law offices within the agency
that are assigned cases from the Hearings Department (this is based on a
different critieria: Region - a different field within the database), each
office will not have cases assigned in a sequential DHS cause number order,
but in a random order. I am designing a database to be used by the law
offices, not the Hearings Department. However, since we do work for the
same Agency, it could be considered foreseeable that we may want to merge
our information and use the same database at some point in the future,
therefore making it necessary to utilize some of the excellent suggestions
given in the responses I have received here (example: Albert's suggestion).
Therefore, Cheryl when you ask:

Can the person suggesting this change answer the following: "How would changing the true Cause
Number to something which is not used by the court or any of the parties add
value to the database?"

Possibly the answer is "yes" it can add value to the database if at some
point in the future, the Hearings Department and the law offices decide to
use one database together (cringe). However, the issues following still
remain as you pointed out:

If the Cause Number we are talking about is the Cause Number that is
assigned by the clerk of whatever court the action was filed in, it should
not be changed. It is REAL data, and you are correct to be concerned that
changing it will have a disastrous effect on communications (both official
and unofficial) with all parties.

AND:

Also, if

there is any occasion where your department may become involved in or take
responsibility for a case from some other jurisdiction, your database may be
in trouble if the same numbering system is not used. I assure you that
courts in Harris County and some of the other counties in South Texas do not
use the same system to assign Cause Numbers.

There most definitely is a possiblity that the law offices within the agency
will take on a case that will not be assigned by the Hearings Department,
but rather a local district court, so creating a number system to
incorporate the Hearings Department system in that instance would NOT be
beneficial.

As for the issue regarding sorting the cases by DHSCause Number, I currently
have resolved this problem by sorting the cases on a separate date field
within the database which contains the date the case has been assigned to
the law offices. In my opinion, this will sort the cases in chronological
order. If the Hearings Department and the law offices merge into using the
same system, we can add a new date field if necessary: Date the formal
appeal was received by the Hearings Department, if the Department wishes to
have a more precise chronological order to the cases (this is assuming the
cause numbers are assigned by order in which the cases are received - sigh.
.. .).

ONE MORE THING (and probably the most important issue):

As many pointed out, using the DHSNo (a text field) as the primary key in
the database is NOT a good idea (this comes from relying on someone like me
without formal training to design this darn thing in the first place - your
local government tax dollars at work). However, how do I change this now?
Can I make a change at this point? AFTER I have designed the entire db and
created relationships? (Note that no "real" data has been entered into the
system yet as it is still in its testing phase to this is not an issue.)

IN CONCLUSION:

Thank you to all that provided me with excellent responses. As usual, the
answer is never simple. There are so many other extenuating circumstances
that make it a complicated decision regarding what to do. It is my opinion
that we should leave the field alone (with the exception of the PK issue).
If at some point the Hearings Department and the law offices want to utilize
the same database, then creating a SEPARATE autonumber field may be the
simplest solution here (not to mention this would resolve the PK issue
also).

Cheryl, your response is extremely beneficial to me. With your premission,
I would like to share portions of your response to the individual who is
requesting the change to my system.

S. Jackson

Cheryl Fischer · Feb 17, 2004

Hi Shelly,

Thanks for your kind comments. Feel free to use any portion of my comments
in your discussions about the design of your database.

Just one additional point, with regard to Primary Key:

I strongly believe that you should use an AutoNumber as the Primary Key in
your main table, as opposed to Cause Number. AutoNumber fields essentially
guarantee uniqueness and are pretty much fool-proof in defining
relationships between parent and child tables (and in linking forms and
subforms). While you may be able to count on cases with Cause Numbers
assigned by your Hearings Dept to have unique Cause Numbers, it is very
possible that you could get a case from another court or jurisdiction with
the same Cause Number (if they are using the same sort of numbering system).
It may be somewhat painful to make this change now, but it will be awful to
have to make this change after your database is deployed and has real data
in it. Should you have problems or questions while you are making this
change, the newsgroup is here to help you through it.

If you do not already have them, I would suggest that you add Court and
County fields to your table.

hth,

Time format hh:nn drops leading zero	0	Aug 6, 2014
Data Sort Question	5	Feb 17, 2004
Missing data when using TREND function	0	Jan 9, 2013
Project Online Timesheets Actuals Mismatches - data not publishing to projects	0	Sep 12, 2017
4 out of 35 data connections running high cpu/memory	0	Oct 5, 2011
Leading Zeros are lost when importing spreadsheet data	7	Mar 20, 2007
issues copying data from one file to another	2	May 16, 2007
data from internet not showing up	6	Feb 18, 2010

Data Normalization - Dispute

S. Jackson

MacDermott

Rolls

Jeff Boyce

Ken Snell

Cheryl Fischer

Rolls

Jeff Boyce

John Marshall, MVP

Douglas J. Steele

Rolls

John Marshall, MVP

John Marshall, MVP

Randy Harris

Rolls

John Marshall, MVP

Albert D. Kallal

S Jackson

Cheryl Fischer

Ask a Question

Similar Threads