RK_Excel said:

I used the following formula to find the median for a range

of numbers. I assume variable p in the below formula as 0.5.

=PERCENTILE(A1:A7,PERCENTRANK(PROB(A1:A7,B1:B7/SUM(B1:B7),,A1:A7),p,20)) [....]

|Download:

http://www.excelbanter.com/attachment.php?attachmentid=663|

The short answer: this is a problem in PROB. It is arguable whether or not

to call it a defect. (Although I think it is avoidable.)

And I do not think __we__ can do anything to avoid the problem reliably

because it arises from a sum that is computed internally.

Ergo, this paradigm is not a reliable way to compute the median of grouped

data.

There are steps that we can take that might mitigate the problem with

specific sets of data. See the details below.

But first....

-----

You have some mistakes. But ironically, they help to identify the root

cause of the problem.

First, the formula in G3 is not array-entered (you pressed just Enter

instead of ctrl+shift+Enter). It should be, just as the formula in G8 is.

Ironically, that is unrelated to the #NUM error in G3. But if PROB had not

found the conditions that triggered the #NUM error, PERCENTRANK would have

returned a #N/A error.

Second, the debug formulas in D3

198 are incorrect. That is, they are a

misinterpretation of the array-entered formula in G3.

The correct interpretation is to normally-enter the following formula into

E3 (just press Enter), then copy down through E197:

=PROB($A$3:$A$197,$B$3:$B$197/SUM($B$3:$B$197),,A3)

Note: I changed the range from rows 3:198 to rows 3:197 to avoid the PROB

defect, just for discussion purposes.

Alternatively, you can select E3:E197 and array-enter the following formula

(press ctrl+shift+Enter instead of just Enter):

=PROB(A3:A197,B3:B197/SUM(B3:B197),,A3:A197)

This has the same effect as the normally-entered formulas. But it is easier

to create directly from the original formula in G3.

I could explain the difference in interpretation, if you wish. But it is

not germane to the central issue, namely: why does PROB return #NUM errors?

-----

Now select E3:E198 and array-enter the same formula (press ctrl+shift+Enter

instead of just Enter), extending the ranges to row 198, to wit:

=PROB(A3:A198,B3:B198/SUM(B3:B198),,A3:A198)

Note that PROB returns an array of #NUM errors.

Why? Because PROB has determined that SUMPRODUCT(B3:B198/SUM(B3:B198)) is

not 1.

And indeed it is not. Note that =SUMPRODUCT(B3:B198/SUM(B3:B198))-1-0

returns about -1.22E-15, indicating that SUMPRODUCT(B3:B198/SUM(B3:B198)) is

infinitesimally less than 1.

(Note: The "redundant" -0 is needed to ensure that Excel does not

arbitrarily "correct" the true arithmetic result to be exactly zero, based

the dubious heuristic poorly described under the misleading title "Example

When a Value Reaches Zero" at

http://support.microsoft.com/kb/78113.)

The inaccuracy is a normal by-product of the native binary computer

arithmetic, which Excel relies on. Such infinitesimal differences are very

common. Usually they simply go unnoticed.

Given that they are "very common", the real mystery is: why does PROB raise

a #NUM error for some infinitesimal differences, but not for others?

I believe your incorrect formulas in D3

198 provide a hint. Note that some

of the formulas in D3

198 return a #NUM error.

Copying the text of the second PROB parameter in D3, if we normally-enter

the formula =SUMPRODUCT($B$3:B3/SUM($B$3:B3))-1-0 into F3 and copy down

through F198, some analysis reveals the following observations:

1. Only 32 differences are exactly zero.

2. Of the remaining 166 differences, the min difference is about -2.33E-15,

and the max difference is about about 2.66E-15.

3. But there are only 46 #NUM errors in D3

198.

4. Among the rows with #NUM errors, the min difference is about -2.33E-15,

and the max difference is about -5.55E-16 (closer to zero).

Ergo, PROB does tolerate some infinitesimal differences from 1 for the sum

of the probabilities, as it really should, IMHO.

But it draws the line at "large" infinitesimal differences.

What is "large"?

I speculate that it is any infinitesimal difference that causes the result

to be something other than 1 when rounded to 15 significant digits.

Specifically: -5*2^-53 (about -5.55E-16) or smaller (further from zero);

and about 23*2^-52 (about 5.11E-15) or larger.

------

I believe that explains the anomalous #NUM errors from PROB. The question

is: can we avoid them?

In general, I do not believe there is any reliable way to avoid them, since

the sum is computed internally.

But the following steps, together or separately, might remedy #NUM errors

for __specific__ sets of data.

1. Reduce the number of line items (rows). For example, in RK-Excel's data,

several rows of data could be combined because the prices are identical.

This is even more true if the prices were rounded to the cent, as I believe

they should be.

The fewer the number of items summed, the smaller the accumulated

infinitesimal differences.

2. Instead of entering the array expression B3:B198/SUM(B3:B198) as a

parameter of the PROB function, enter the normally-entered formula (just

press Enter) into J4 and copy down through J198:

=B4/SUM($B$3:$B$198)

Then normally-enter the following formula into J3 (just press Enter):

=1-SUMPRODUCT(B4:B198/SUM($B$3:$B$198))-0

Then array-enter the following formula in G3 (press ctrl+shift+Enter, not

just Enter):

=PERCENTILE(A3:A198,PERCENTRANK(PROB(A3:A198,J3:J198,,A3:A198),0.5,20))

Caveat: Although I suspect this will work 99.9999%

of the time, it

might not always work because of the specific way that Excel might compute

the sum within the PROB function.

3. I might add that the following "common sense" alternative probably will

__not__ remedy the problem reliably. In fact, it might exacerbate it.

(In fact, it does not remedy the problem with RK-Excel's data.)

That would be: replace that array expression B3:B198/SUM(B3:B198) with the

array expression array

expression --TEXT(B3:B198/SUM(B3:B198),"0.00000000000000E+0"), which rounds

each probability to 15 significant digits.

Although it might seem to remedy the problem with some data sets, that would

be a coincidence.

There still might be too large a cumulative infinitesimal difference in the

sum due to the fact that Intel-compatible CPUs actually do arithmetic with

larger precision.