Normalizing for statistical analysis

F

Fred Thomas

Making a database with one form being used for inputting
inspection data. This form among other things needs 40
fields to input up to 40 length measurements and 40 more
fields to input up to 40 diameter measurements. I need
statistical analysis for these measurements such as
Standard Deviation, Average and Range. I believe Access
has built in statistical functions to do this. What is
the best way to normalize this data to allow working with
these built in functions? Really need some input, as I
seem to always think flat rather then relationally.
 
L

Lynn Trapp

Actually, I think you will find that the Standard Deviation and Range
functions are Excel functions, not Access functions. Access does have an
Avg() function to calculate averages. I also suspect that you really don't
need so many fields, but can't say for sure without knowing more about your
data.
 
J

John Vinson

Making a database with one form being used for inputting
inspection data. This form among other things needs 40
fields to input up to 40 length measurements and 40 more
fields to input up to 40 diameter measurements. I need
statistical analysis for these measurements such as
Standard Deviation, Average and Range. I believe Access
has built in statistical functions to do this. What is
the best way to normalize this data to allow working with
these built in functions? Really need some input, as I
seem to always think flat rather then relationally.

I'd suggest the following table structure. I'll say you're measuring
Parts, and that each measurement on a given part will be repeated
ReplicateNo times (leave this out if you measure each part only once).

Parts
PartID
Description
<any other info about what is being measured>

Measures
MeasureID
MeasureType <"diameter" or "length" perhaps>
Description <what's being measured, e.g. "Handle to shank"
<maybe other fields, e.g. acceptable low and high value>

Measurements
PartID
MeasureID
ReplicateNo
Measurement

This gives you a "tall thin" Measurements table with one row per
measurement; a Totals query such as

SELECT Parts.Description, Measures.Description, Min([measurement]) AS
MinMeasure, Max([Measurement]) AS MaxMeasure, Avg([Measurement]) AS
Average, StDev([Measurement]) AS Std;
 
T

TC

(snip)
But, let me tell you normalization is not always the best choice, because
it's slows performance by needing to mix data from different tables.

Sorry, but that is absolute nonsense. As long as the relevant table are
primary-keyed correctly, you will not have any problems joining data from
various tables. There are *very few cases* where significant denormalization
is justfied for performance reasons. Your statement is a bit like saying,
"Do not put a foundation under your house. They are big & expensive, & they
will make the house slide down the hill."
The
simplest way to decide in this case if you need or not normalize this, must
be based on how much queries you suppose the data will have (in times per
second or per minute), and if this data can be empty or not, 'cause will be
a big space lost if 2 of this fields are filled and you have de-normalized
data.
The other important thing is the following, think Access really is not
(many people here probably scares with this point of view, and yell me many
ugly things) a database engine because doesn't work at record level, what
this means, Access will retrieve all data from the database and pass it by
the network as a file block, so then you locally filter this table viewing
the rows that you want. For example, think you have a 20000 records table
and you query by one record (ex: where id=3) access will pass to your
computer the 20000 records taking a lot of time of network use (always
precious) and worst, your time, and then when all the table arrives to your
computer, your local copy of access runs the query and shows you the record
number 3.

Nonsense. If you select from a field that is indexed, Access will only pull
the relevant index blocks down the netork, then it will only pull the
specific data blocks required. So in your example, if the id field is
properly indexed, this *will not* pull 20000 records over the network. You
probably just do not know how to index properly!

In conclusion normalization is a matter iof balancing performance vs size
and redundance ('cause normallization have many laws and in the case you
applied all, in the most case you will finish with an "never redundant
data - better size" turtle), but this really must be appreciated when you
use a real engine for this. es: MySQL, Postgre, Oracle, SQL Server, MSDE,
etc.

Few if any professional database designers would be thinking about
performance *at all*, when they designed their table structures.

I'm sorry to be so blunt, but the advice that you have given is really very
misleading. The biggest & most important challenge for people new to
databases, is to learn about normaluzation & how to do it properly. IMO,
newbies should be told to *always* normalize - because without a thorough
understanding of normalization, you can not make a proper decision on when &
how to denormalize.

TC
 
G

Giorgio Gentili

Fred:
The better way to do that is putting a table with one measurement and one
diameter (if they are related, if not just one for measurements and other
for diameters), an id of register (if it's belongs to the register 1 or 2
etc.) and an order number (if it's matters).

apart of this you must make a main table with the rest of the information
you have with an id. (used as id of register in the others tables)

But, let me tell you normalization is not always the best choice, because
it's slows performance by needing to mix data from different tables. The
simplest way to decide in this case if you need or not normalize this, must
be based on how much queries you suppose the data will have (in times per
second or per minute), and if this data can be empty or not, 'cause will be
a big space lost if 2 of this fields are filled and you have de-normalized
data. The other important thing is the following, think Access really is not
(many people here probably scares with this point of view, and yell me many
ugly things) a database engine because doesn't work at record level, what
this means, Access will retrieve all data from the database and pass it by
the network as a file block, so then you locally filter this table viewing
the rows that you want. For example, think you have a 20000 records table
and you query by one record (ex: where id=3) access will pass to your
computer the 20000 records taking a lot of time of network use (always
precious) and worst, your time, and then when all the table arrives to your
computer, your local copy of access runs the query and shows you the record
number 3.

In conclusion normalization is a matter iof balancing performance vs size
and redundance ('cause normallization have many laws and in the case you
applied all, in the most case you will finish with an "never redundant
data - better size" turtle), but this really must be appreciated when you
use a real engine for this. es: MySQL, Postgre, Oracle, SQL Server, MSDE,
etc.

Hope this will be useful for you.

Questions?

Good Luck
Giorgio Gentili
Chile
 
G

Giorgio Gentili

if forgot the calculated data, this, in a normalized way, must always be
done in a query. (the values must be recalculated every times you call). But
remember the performance.
 
G

Giorgio Gentili

I'm sorry about what I'm saying then, I aparently are very uninformed.

Giorgio
 
T

Tim Ferguson

I'm sorry about what I'm saying then, I aparently are very uninformed.

Sorry to jump in, but that does seem to be the case. You really appear to
have got hold of the wrong end of the stick when it comes to the reasons
for and background to R theory.

Normalisation is not about saving space -- in many cases a fully normalised
database will take more space than a big flat file. It is all to do with
guaranteeing integrity. Think about that word guarantee. R theory is the
first (and, afaik, the only) part of computing that is solidly based on
mathematics, so that it is possible to prove that a given model is correct.
Now, there are some situations where consistency and accuracy in an
application take second place to looking pretty, or response speed, or some
other objective. Think of the QuickSR function[1], which is not always
accurate but scores very highly on tests of speed.

So, where it is vital that a data value once entered remains intact and
uncorrupted, correct implementation of R theory is required and there is
simply no substitute. To suggest otherwise is to try to overrun the last
forty years of research with a "well, I just thought of this and so it must
be better". For sure: read, learn, and argue with the details, because
there are plenty of holes, but do it from a standpoint of knowing and
understanding, not just ignorant prejudice.

PS [1]-

Public Function QuickSR( _
ANumber As Double _
) As Double
' Quick Square Root
' Not entirely accurate, but very fast sqr estimation

' Please change this number according to the
' needs of the application

QuickSR = 2.09294378

End Function

Best wishes


Tim F
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top