Math question

Μ

µ

I've got this range that contains values from a lot of samples.
They are all about the same, but some are much too big because of
faulty measurement.

Let say the range contains values like: 30,35,28,123,31,35,29
Everybody can see that the 123 is not a valid sample.
But how can i tell Excel that?
I need to have the average of this range without the wrong values.
Excel should look at the list and only use: 30,35,28,31,35,29

This is a large list of values and i need this solution for more than
one type of measurement.
Is there a way to filter the values that don't fit in the range?

Thanks
 
G

Gary''s Student

Use the STDEV() function on your data to calculate the standard deviation.
Then for each data point, calculate the absolute value of the difference
between the data point and the average.


If the difference is reallly big compared to the standard deviation,
consider removing the value or telling Excel to exclude it.
 
N

Niek Otten

I don't know if this is acceptable from a statiscal point of view, but consider this:

Let's say your figures are in A1:A7
In A9:
=AVERAGE(A1:A7)*1.5
Or maybe * 2 or 3
In A11 an array formula: do not commit with just ENTER but with CTRL+SHIFT+ENTER
=AVERAGE(IF(A1:A7<A9,A1:A7))
It should appear in the formula bar between curly brackets

--
Kind regards,

Niek Otten
Microsoft MVP - Excel


| I've got this range that contains values from a lot of samples.
| They are all about the same, but some are much too big because of
| faulty measurement.
|
| Let say the range contains values like: 30,35,28,123,31,35,29
| Everybody can see that the 123 is not a valid sample.
| But how can i tell Excel that?
| I need to have the average of this range without the wrong values.
| Excel should look at the list and only use: 30,35,28,31,35,29
|
| This is a large list of values and i need this solution for more than
| one type of measurement.
| Is there a way to filter the values that don't fit in the range?
|
| Thanks
 
Μ

µ

This looks really difficult :-(

So i need to make the STDEV() of the whole range (let say A1 till
A500)
Then i need to calculate the difference between every data point and
the value of STDEV()?
So A1-STDEV() [of the whole range] and make it a positive number?
And after that i need to look if A1 is (let's say) more that 10%
larger than STDEV() and if this is so, exclude it?
 
G

Gary''s Student

More like:

=ABS(A1-AVERAGE($A$1:$A$500)) and copy down the column.

Compare the results of this formula with

=STDEV($A$1:$A$500)
--
Gary's Student


µ said:
This looks really difficult :-(

So i need to make the STDEV() of the whole range (let say A1 till
A500)
Then i need to calculate the difference between every data point and
the value of STDEV()?
So A1-STDEV() [of the whole range] and make it a positive number?
And after that i need to look if A1 is (let's say) more that 10%
larger than STDEV() and if this is so, exclude it?




Use the STDEV() function on your data to calculate the standard deviation.
Then for each data point, calculate the absolute value of the difference
between the data point and the average.


If the difference is reallly big compared to the standard deviation,
consider removing the value or telling Excel to exclude it.
 
J

joeu2004

µ said:
This looks really difficult :-(

Yes; and it is intended to be. It would be prudent to manually and
physically exclude "outliers" because the decision to exclude an
outlier involves more than simply identifying a number that is "out of
whack". It should be "out of whack" due data entry error or faulty
measurement, as you noted.

The reason for physically removing outliers is that typically you will
perform numerous statistical analyses of the data. It is painful (and
error-prone) to identify and exclude outliers each time.
So A1-STDEV() [of the whole range] and make it a positive number?
And after that i need to look if A1 is (let's say) more that 10%
larger than STDEV() and if this is so, exclude it?

Usually not "10% larger", but typically beyond +/-3 or more standard
deviations. (Although there is no agreement on "3". Some people go as
low as 2.)
 
S

Shirish

I agree with the ideas of Joeu2 and Gary's Student. They are simple and
effective. if you want, you could take them further to use conditional
statements (IF) and conditional formatting (to make things visually
stand out). something like:

=IF(ABS(A1-AVE($A$1:$A$7))>=2*STDEV($A$1:$A$7);"OUTLIER";ABS(A1-AVE($A$1:$A$7)))

1. The formula will first compute the difference between the data entry
and the Average of the range.
2. The IF clause will insert the word OUTLIER where this difference
exceeds 2 or 3 times the Stdev(), depending upon what you input into
the formula
3. You could then use Conditional Formatting (or AUTOFILTER) to either
highlight or colour fill / filter out only the cells containing
"Outlier"

shirish
 
J

joeu2004

µ said:
I've got this range that contains values from a lot of samples.
They are all about the same, but some are much too big because of
faulty measurement.

Let say the range contains values like: 30,35,28,123,31,35,29
Everybody can see that the 123 is not a valid sample.
But how can i tell Excel that?

Google "outlier". Hmm, I am not finding as good references today as I
know I have in the distant path :-(.

The method that I prefer relies on the interquartile range (IRQ). In
Excel terms, the IRQ is QUARTILE(range,3) - QUARTILE(range,1). The IRQ
is the center 50% of the data. Most commonly, potential outliers are
data whose value is less than QUARTILE(range,1) - 1.5*IRQ or greater
than QUARTILE(range,3) + 1.5*IRQ. Although "1.5" is most commonly
used, some people use other factors. (I prefer 1.7 because it
corresponds to +/-3 standard deviations in a normal distribution --
another common method for identifying potential outliers.)

If your data are in A1:A500, you might set up the following formulas:

C1: =quartile(A1:A500, 1)
C2: =quartile(A1:A500, 3)
C3: =C1 - 1.5 * (C2 - C1)
C4: =C2 + 1.5 * (C2 - C1)
B1: =if(or(A1 < $C$3, A1 > $C$4), "outlier?", "")

Copy B1 down through B500.
 
J

joeu2004

Errata....
Most commonly, potential outliers are
data whose value is less than QUARTILE(range,1) - 1.5*IRQ or greater
than QUARTILE(range,3) + 1.5*IRQ. Although "1.5" is most commonly
used, some people use other factors. (I prefer 1.7 because it
corresponds to +/-3 standard deviations in a normal distribution --
another common method for identifying potential outliers.)

I intended to write: because 1.7 IRQ corresponds to +/3 standard
deviations -- another common method for identifying potential outliers
in a normal distribution. (Conversely, 1.5 IRQ corresponds to +/- 2
standard deviations.)

The point is: whereas the 1.5-IRQ method is applicable to all
distribution, I would use the 3-sd method only with normal
distributions. However, many people use the 3-sd method without taking
the distribution into consideration :-(.
 
D

David F Cox

"Removing outliers" makes me nervous. One of my boyhood scientific heros was
Millikan who measured the relationship between the mass and the charge of an
electron. Compared to normal experience this is about as close to comparing
nothing to nothing as one can get. A beautiful experiment.

I have read that years later other scientists proposed new particled with
masses of 1/3 and 2/3 of an electron, and one of the first objections was
that Millikans experiment should have found them. It turned out that it did,
but Millikan had removed those few mischevious "outliers". "Not rock" may be
a diamond.

µ said:
This looks really difficult :-(

Yes; and it is intended to be. It would be prudent to manually and
physically exclude "outliers" because the decision to exclude an
outlier involves more than simply identifying a number that is "out of
whack". It should be "out of whack" due data entry error or faulty
measurement, as you noted.

The reason for physically removing outliers is that typically you will
perform numerous statistical analyses of the data. It is painful (and
error-prone) to identify and exclude outliers each time.
So A1-STDEV() [of the whole range] and make it a positive number?
And after that i need to look if A1 is (let's say) more that 10%
larger than STDEV() and if this is so, exclude it?

Usually not "10% larger", but typically beyond +/-3 or more standard
deviations. (Although there is no agreement on "3". Some people go as
low as 2.)
 
J

Jerry W. Lewis

Excel does not do this automatically, because the identification and
treatment of outliers is a huge subject, and the best approach depends upon
what you are willing to assume about your data (Excel cannot read your mind).
A couple of good places to start would be
www.itl.nist.gov/div898/handbook/toolaids/pff/7-prc.pdf
http://www.amazon.com/dp/0471930946

As a general procedure, routinely eliminating outliers can be dangerous.
The outliers may be the most important observations in your data set, since
they may be telling you that your assumptions are wrong. In addition to
David F. Cox's example, here are a couple of others:

British scientists found the hole in the ozone layer using US satellite
data, because the US scientists were looking at "cleaned" data where the
important (for this purpose) observations had been removed because they were
deemed to be outliers.

In graduate school, a friend who was trying to develop a more sensitive
spectroscopic method, on reflection realized that the outlier" from his
photon counter (when he ran blanks) was the only valid observation in the
data set, because all of the other observations had overflowed the display!
He was trying to read a laser induced florescence in a flame using the same
wavelength as the flame itself. Once he adjusted to a more appropriate
wavelength, he was able to develop the method and earn his PhD.

For automatic procedures, it is frequently recommended to use estimation
methods that are robust to outliers, instead of testing for outliers. Such
approaches may give better answers if the analysis assumptions are wrong. A
robust measure of the middle of the distribution that is natively available
in Excel is the median.
=MEDIAN(30,35,28,123,31,35,29)
is 31, compared to
=MEDIAN(30,35,28,31,35,29)
which is 30.5 , and
=AVERAGE(30,35,28,31,35,29)
which is 30.33333.

Jerry
 
D

Dana DeLouis

I need to have the average of this range without the wrong values.

I know it's not exactly what you wanted, but just to mention. If you would
like to remove 1 high value, and 1 low value, there is the function
"TrimMean."

=TRIMMEAN(A1:A7,2/7)

This would remove 2 out of the 7 points in the Average calculation. It's
designed to exclude outlying data from the calculation.
 
J

jpdphd

µ (Mike Rowe?)

I would take a graphical approach.
Make a simple scatter plot with just y-values
=SERIES(,,Sheet1!$B$2:$B$1000,1)
This will give you a quick overview of the data and you might be able
to decide where you would want to make the cutoff.
Maybe you could use this in conjunction with the more rigorous
statistical approaches.

jpdphd
 
J

jpdphd

David,

Although I agree 1000% with your concerns about ignoring data, your
story must be apocryphal!
I think you must mean that the charge on the proposed particles
(quarks) were 1/3 and 2/3 of the charge on an electron. But, to split a
proton or neutron into its constituent quarks (electrons are not made
of quarks), takes a huge amount of energy. If a quark happened to
escape its mates, they would get back together long before Millikan
could detect them on an oil droplet!

Sorry if this spoils your fun :)
jpdphd
"Removing outliers" makes me nervous. One of my boyhood scientific heros was
Millikan who measured the relationship between the mass and the charge ofan
electron. Compared to normal experience this is about as close to comparing
nothing to nothing as one can get. A beautiful experiment.

I have read that years later other scientists proposed new particled with
masses of 1/3 and 2/3 of an electron, and one of the first objections was
that Millikans experiment should have found them. It turned out that it did,
but Millikan had removed those few mischevious "outliers". "Not rock" maybe
a diamond.

µ said:
This looks really difficult :-(

Yes; and it is intended to be. It would be prudent to manually and
physically exclude "outliers" because the decision to exclude an
outlier involves more than simply identifying a number that is "out of
whack". It should be "out of whack" due data entry error or faulty
measurement, as you noted.

The reason for physically removing outliers is that typically you will
perform numerous statistical analyses of the data. It is painful (and
error-prone) to identify and exclude outliers each time.
So A1-STDEV() [of the whole range] and make it a positive number?
And after that i need to look if A1 is (let's say) more that 10%
larger than STDEV() and if this is so, exclude it?

Usually not "10% larger", but typically beyond +/-3 or more standard
deviations. (Although there is no agreement on "3". Some people go as
low as 2.)
 
J

joeu2004

Dana said:
=TRIMMEAN(A1:A7,2/7)
This would remove 2 out of the 7 points in the Average calculation. It's
designed to exclude outlying data from the calculation.

I was wondering if someone would bring this up (sigh).

I recognize that some people use the trimmed mean with the intention of
excluding outliers. But of course, it has nothing to do with
identifying outlier data. And even it did, as I and others have
pointed out, automatically excluding outliers without taking the reason
into account will lead to faulty statistics.

Excel's TRIMMEAN excludes 2x% of the data, resulting in excluding the
upper x% and lower x% of the data. Obviously that fails to take into
account whether or not the excluded data are outliers, by any
definition.

TRIMMEAN should only be used in procedures that call for excluding the
upper and lower x% of the data. For example, some sports scoring
methods call for excluding the highest and lowest score before
computing the combined score. (And sometimes it is only the lowest 1
or more scores, in which case TRIMMEAN is not useful.) The purpose is
not to provide a better statistic in the mathematical sense, but to
"level the playing field".
 
Μ

µ

Thanks all for your contribution.
But the data i mentioned isn't that crucial.
I've gathered al lot of data, and i know that many of the samples are
useless.
Nevertheless i need to make averages from the data that proves to be
useful.
And i know that not everything is relevant so i can throw away the
outliers and keep the other part.

I just have to have a relevant sample of data, the rest can be
ignored..

I read a lot of stuff that i'm gonna try the next couple of days.
Thanks to each and everyone of you!
 
D

David F Cox

Always open to the possibility of being wrong, (much experience helps with
this :)-<)
I did not uncover direct substantiation for the story I read, but various
sources which indicated that it might be correct, including:-

http://www.science-frontiers.com/sf015/sf015p07.htm

http://www.oufusion.org.uk/newswinter01/fusionnewswinter01.htm

Whatever, thanks for sending me on a voyage of discovery, I learned a lot,
including more cases where experiments were adapted to fit the theory. A
very satisfying meal for a cynic.

David F. Cox



David,

Although I agree 1000% with your concerns about ignoring data, your
story must be apocryphal!
I think you must mean that the charge on the proposed particles
(quarks) were 1/3 and 2/3 of the charge on an electron. But, to split a
proton or neutron into its constituent quarks (electrons are not made
of quarks), takes a huge amount of energy. If a quark happened to
escape its mates, they would get back together long before Millikan
could detect them on an oil droplet!

Sorry if this spoils your fun :)
jpdphd
"Removing outliers" makes me nervous. One of my boyhood scientific heros
was
Millikan who measured the relationship between the mass and the charge of
an
electron. Compared to normal experience this is about as close to
comparing
nothing to nothing as one can get. A beautiful experiment.

I have read that years later other scientists proposed new particled with
masses of 1/3 and 2/3 of an electron, and one of the first objections was
that Millikans experiment should have found them. It turned out that it
did,
but Millikan had removed those few mischevious "outliers". "Not rock" may
be
a diamond.

µ said:
This looks really difficult :-(

Yes; and it is intended to be. It would be prudent to manually and
physically exclude "outliers" because the decision to exclude an
outlier involves more than simply identifying a number that is "out of
whack". It should be "out of whack" due data entry error or faulty
measurement, as you noted.

The reason for physically removing outliers is that typically you will
perform numerous statistical analyses of the data. It is painful (and
error-prone) to identify and exclude outliers each time.
So A1-STDEV() [of the whole range] and make it a positive number?
And after that i need to look if A1 is (let's say) more that 10%
larger than STDEV() and if this is so, exclude it?

Usually not "10% larger", but typically beyond +/-3 or more standard
deviations. (Although there is no agreement on "3". Some people go as
low as 2.)
 
J

jpdphd

David,

Thanks for the references. I've been trying to find followups to the
experiments which seemed to give evidence for free quarks. Here's the
most recent I could find, and its negative.

http://prola.aps.org/pdf/PRL/v84/i12/p2576_1

No word on why previous experiments may be suspect. I will ask some of
my former physics profs.

A fun adventure for a sunday eve.

Cheers,
jpdphd
Always open to the possibility of being wrong, (much experience helps with
this :)-<)
I did not uncover direct substantiation for the story I read, but various
sources which indicated that it might be correct, including:-

http://www.science-frontiers.com/sf015/sf015p07.htm

http://www.oufusion.org.uk/newswinter01/fusionnewswinter01.htm

Whatever, thanks for sending me on a voyage of discovery, I learned a lot,
including more cases where experiments were adapted to fit the theory. A
very satisfying meal for a cynic.

David F. Cox



David,

Although I agree 1000% with your concerns about ignoring data, your
story must be apocryphal!
I think you must mean that the charge on the proposed particles
(quarks) were 1/3 and 2/3 of the charge on an electron. But, to split a
proton or neutron into its constituent quarks (electrons are not made
of quarks), takes a huge amount of energy. If a quark happened to
escape its mates, they would get back together long before Millikan
could detect them on an oil droplet!

Sorry if this spoils your fun :)
jpdphd
"Removing outliers" makes me nervous. One of my boyhood scientific heros
was
Millikan who measured the relationship between the mass and the charge of
an
electron. Compared to normal experience this is about as close to
comparing
nothing to nothing as one can get. A beautiful experiment.

I have read that years later other scientists proposed new particled with
masses of 1/3 and 2/3 of an electron, and one of the first objections was
that Millikans experiment should have found them. It turned out that it
did,
but Millikan had removed those few mischevious "outliers". "Not rock" may
be
a diamond.

µ said:
This looks really difficult :-(

Yes; and it is intended to be. It would be prudent to manually and
physically exclude "outliers" because the decision to exclude an
outlier involves more than simply identifying a number that is "out of
whack". It should be "out of whack" due data entry error or faulty
measurement, as you noted.

The reason for physically removing outliers is that typically you will
perform numerous statistical analyses of the data. It is painful (and
error-prone) to identify and exclude outliers each time.
So A1-STDEV() [of the whole range] and make it a positive number?
And after that i need to look if A1 is (let's say) more that 10%
larger than STDEV() and if this is so, exclude it?

Usually not "10% larger", but typically beyond +/-3 or more standard
deviations. (Although there is no agreement on "3". Some people go as
low as 2.)
 
B

bplumhoff

Hello,

A "naive" approach which I did in a financial environment recently:

Option Explicit

Function orb(rY As Range, rX As Range, _
Optional dSigmaFactor As Double = 3#, _
Optional dMaxOutlierPercentage As Double = 0.5) As Variant
'orb() = outlier resistant beta returns a beta and
'an alpha where y = beta * x + alpha is most accurate
'for (almost) all given x in rX and y in rY.
'"Almost" means that we successively (one by one) throw out outliers
'which have a distance of > dSigmaFactor * STDEV_of_all_Distances
'from the least square (LS) proxy.
'PB V0.80 27-Sep-2006
Dim vLinEst As Variant 'store LinEst() result of recent LS proxy during
iterations
Dim dm2 As Double 'ortogonal slope to recent LS proxy
Dim dc As Double 'constant c in: y2 = m2 * x2 + c which is ortogonal to
LS proxy through a given point
Dim dx2 As Double 'x2 in: y2 = m2 * x2 + c which is ortogonal to LS
proxy through a given point
Dim dy2 As Double 'y2 in: y2 = m2 * x2 + c which is ortogonal to LS
proxy through a given point
Dim i As Long, j As Long
Dim lcount As Long 'holds current number of live points
Dim lcount_orig As Long 'original (starting) number ot points
Dim lcount_old As Long 'holds number of live points of previous
iteration
Dim daverage As Double 'average of distances to LS proxy of current
iterations' live points
Dim dstdev As Double 'standard deviation of distances to LS proxy of
current iterations' live points
Dim dDistMax As Double
Dim lDistMaxIdx As Long

lcount = rX.Rows.Count
If rX.Columns.Count > lcount Then
lcount = rX.Columns.Count
End If
lcount_orig = lcount
lcount_old = lcount + 1

ReDim dDist(1 To lcount) As Double 'store distances of live points to
recent LS proxy (line)
ReDim dX(1 To lcount) As Double
ReDim dY(1 To lcount) As Double 'store coordinates of "live" points
during iterations

'read data row-wise or column-wise
If rX.Rows.Count > rX.Columns.Count Then
For i = 1 To lcount
dX(i) = rX.Cells(i, 1)
dY(i) = rY.Cells(i, 1)
Next i
Else
For i = 1 To lcount
dX(i) = rX.Cells(1, i)
dY(i) = rY.Cells(1, i)
Next i
End If

Do While lcount_old > lcount And lcount / lcount_orig >= 1# -
dMaxOutlierPercentage
lcount_old = lcount
ReDim Preserve dDist(1 To lcount) As Double 'store distances of
live points to recent LS proxy (line)
ReDim Preserve dX(1 To lcount) As Double
ReDim Preserve dY(1 To lcount) As Double 'store coordinates of
"live" points during iterations
vLinEst = Application.WorksheetFunction.LinEst(dY, dX, True, True)
dDistMax = 0#
lDistMaxIdx = 1
For i = 1 To lcount
'Calculate distances of live points to recent LS proxy
dm2 = -1# / vLinEst(1, 1)
dc = dY(i) - dX(i) * dm2
dx2 = (dc - vLinEst(1, 2)) / (vLinEst(1, 1) - dm2)
dy2 = dm2 * dx2 + dc
dDist(i) = Sqr((dX(i) - dx2) * (dX(i) - dx2) + (dY(i) - dy2) *
(dY(i) - dy2))
'remember largest distance and its index
If dDist(i) > dDistMax Then
dDistMax = dDist(i)
lDistMaxIdx = i
End If
Next i
'calculate average and standard deviation of live points' distances
to LS proxy
daverage = Application.WorksheetFunction.Average(dDist)
dstdev = Application.WorksheetFunction.StDev(dDist)
'kill point with largest distance > dSigmaFactor * dstdev
If dDist(lDistMaxIdx) >= dstdev * dSigmaFactor Then
Debug.Print "Lcount: " & lcount & ". Throwing out (" &
dX(lDistMaxIdx) & _
";" & dY(lDistMaxIdx) & ")"
dX(lDistMaxIdx) = dX(lcount)
dY(lDistMaxIdx) = dY(lcount)
lcount = lcount - 1
End If
Loop

orb = vLinEst

End Function

HTH,
Bernd
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top