Multiple Criteria

M

Michael Dobony

I am using Excel 2010

I have this formula right now
=countifs(A:A,g1,B:B,h1,C:C,I1)

I need to count for A:A =G1 OR G2 with the rest of the calculation. Is
there an easy way to do this other than adding two full countifs together?

Mike D
 
J

joeu2004

Michael Dobony said:
I have this formula right now
=countifs(A:A,g1,B:B,h1,C:C,I1)
I need to count for A:A =G1 OR G2 with the rest of the
calculation. Is there an easy way to do this other than
adding two full countifs together?

First, using full-column ranges like A:A is a bad idea. It forces Excel to
process all 1+ million rows. In your example, that's as many as 3+ million
comparisons.

We rarely have that much data. Even A1:A100000 is better; but you might be
able to reduce it even more.

As to your question, COUNTIFS cannot handle "or" conditions.

If G1=G2 is never true, you might write:

=COUNTIFS(A1:A10000,G1,B1:B10000,H1,C1:C10000,I1)
+ COUNTIFS(A1:A10000,G2,B1:B10000,H1,C1:C10000,I1)

But generally, I think the following is more efficient (compare columns B
and C only once):

=SUMPRODUCT(((A1:A10000=G1)+(A1:A10000=G2)>0)*(B1:B10000=H1)*(C1:C10000=I1))

which can also be written:

=SUMPRODUCT(--((A1:A10000=G1)+(A1:A10000=G2)>0),--(B1:B10000=H1),--(C1:C10000=I1))

If G1=G2 is never true, we do not need ">0". But it's a good habit to get
into for those times when it is needed.
 
J

joeu2004

Ron Rosenfeld said:
This formula must be **array-entered**:
=SUM(COUNTIFS(A:A,INDIRECT("g1:g2"),B:B,H1,C:C,I1))

INDIRECT is a "volatile" function. So that formula (and any dependent
cells) will be recalculated whenever any cell in any worksheet in the
workbook is modified.

That makes the 3+ million comparisons due to the use of A:A, B:B and C:C
even more ill-advised.
 
M

Michael Dobony

First, using full-column ranges like A:A is a bad idea. It forces Excel to
process all 1+ million rows. In your example, that's as many as 3+ million
comparisons.

We rarely have that much data. Even A1:A100000 is better; but you might be
able to reduce it even more.

As to your question, COUNTIFS cannot handle "or" conditions.

If G1=G2 is never true, you might write:

=COUNTIFS(A1:A10000,G1,B1:B10000,H1,C1:C10000,I1)
+ COUNTIFS(A1:A10000,G2,B1:B10000,H1,C1:C10000,I1)

But generally, I think the following is more efficient (compare columns B
and C only once):

=SUMPRODUCT(((A1:A10000=G1)+(A1:A10000=G2)>0)*(B1:B10000=H1)*(C1:C10000=I1))

which can also be written:

=SUMPRODUCT(--((A1:A10000=G1)+(A1:A10000=G2)>0),--(B1:B10000=H1),--(C1:C10000=I1))

If G1=G2 is never true, we do not need ">0". But it's a good habit to get
into for those times when it is needed.

That works on my test table, thanks! But I forgot, I also have a similar
issue with a Sumifs, a sum of weights of two codes. How do I do the same
thing with a sum of the weights?
 
J

joeu2004

Michael Dobony said:
But generally, I think the following is more efficient
(compare columns B and C only once):
=SUMPRODUCT(((A1:A10000=G1)+(A1:A10000=G2)>0)*(B1:B10000=H1)*(C1:C10000=I1))
which can also be written:
=SUMPRODUCT(--((A1:A10000=G1)+(A1:A10000=G2)>0),--(B1:B10000=H1),--(C1:C10000=I1))
[....]
That works on my test table, thanks! But I forgot, I also
have a similar issue with a Sumifs, a sum of weights of
two codes. How do I do the same thing with a sum of the weights?

Suppose what you want to sum is in D1:D10000. Just another parameter, to
wit:

=SUMPRODUCT(((A1:A10000=G1)+(A1:A10000=G2)>0)*(B1:B10000=H1)*(C1:C10000=I1),D1:D10000)
or
=SUMPRODUCT(--((A1:A10000=G1)+(A1:A10000=G2)>0),--(B1:B10000=H1),--(C1:C10000=I1),D1:D1000)

If you opted for the COUNTIFS+COUNTIFS approach (not recommended), use
SUMIFS similarly, to wit:

=SUMIFS(D1:D10000,A1:A10000,G1,B1:B10000,H1,C1:C10000,I1)
+ SUMIFS(D1:D10000,A1:A10000,G2,B1:B10000,H1,C1:C10000,I1)

PS: I am not sure what you mean by a sum of weights "of two codes". If
simply adding D1:D10000 as another parameter does not work for you, please
explain what you want to sum in more detail.
 
R

Ron Rosenfeld

INDIRECT is a "volatile" function. So that formula (and any dependent
cells) will be recalculated whenever any cell in any worksheet in the
workbook is modified.

That makes the 3+ million comparisons due to the use of A:A, B:B and C:C
even more ill-advised.

I don't disagree with you at all about the performance hit on using whole column references. But I have found that COUNTIF has less of a penalty than do array type formulas.
 
J

joeu2004

Ron Rosenfeld said:
I don't disagree with you at all about the performance hit on
using whole column references. But I have found that COUNTIF
has less of a penalty than do array type formulas.

Generally, performance generalizations are incorrect. ;-)

I am not quite sure what you classify as "array type" v. not "array type".
Are you referring to the fact the SUMPRODUCT allows parameters to be arrays,
whereas COUNTIF(S) is limited to ranges?

That would be an interesting theory, one that requires careful study to
prove or disprove.

I must say: it seems plausible, especially if COUNTIFS is implemented
efficiently. By "efficiently", I mean: for COUNTIFS(cond1, cond2,
cond3,...), it does not evaluate "cond2", "cond3" et al if "cond1" is false
(on a row-by-row basis, of course).

Certainly, SUMPRODUCT((cond1)*(cond2)*(cond3)) cannot take advantage of such
efficiency.

And I would certainly opt for a __simple__ COUNTIF(S) instead of SUMPRODUCT
myself, if they are interchangeable one-to-one, if for no other reason than
the easy of writing and understanding.

On the other hand, in this particular situation (array-entered SUM), we are
comparing a single iteration over "cond2" and "cond3" (and some additional
arithmetic overhead) with multiple iterations (and less additional
arithmetic overhead).

I would not want to predict the winner. In my experience with comparing
such alternatives, either there is no statistical difference, or the
relative difference is too small to really make a difference. But that is a
generalization; and as I said, such generalizations are unreliable.

I avoid array-entered formulas for practical reasons, not for performance
reasons.

It is easy to inadvertently press just Enter instead of ctrl+shift+Enter,
even when we know better. And all too often, that does not result in an
Excel error. The consequence is overlooked bogus results.

That risk far outweighs any potential performance benefit, IMHO.
 
R

Ron Rosenfeld

Generally, performance generalizations are incorrect. ;-)

I am not quite sure what you classify as "array type" v. not "array type".
Are you referring to the fact the SUMPRODUCT allows parameters to be arrays,
whereas COUNTIF(S) is limited to ranges?

That would be an interesting theory, one that requires careful study to
prove or disprove.

I must say: it seems plausible, especially if COUNTIFS is implemented
efficiently. By "efficiently", I mean: for COUNTIFS(cond1, cond2,
cond3,...), it does not evaluate "cond2", "cond3" et al if "cond1" is false
(on a row-by-row basis, of course).

Having noted the difference purely by observation some time ago, I undertook some simple comparisons. In VBA, using Evaluate and a HiRes timer, countif seems to generally be about 7 times faster for whole column ranges in Excel 2007. I did not set up to try countifs vs sumproduct on multiple ranges, nor did I try smaller ranges.

Whether these differences are enough to "make a difference" likely depends on the particular worksheet.
 
J

joeu2004

Ron Rosenfeld said:
In VBA, using Evaluate and a HiRes timer, countif seems
to generally be about 7 times faster for whole column
ranges in Excel 2007. I did not set up to try countifs
vs sumproduct on multiple ranges

Although I do not agree with your methods, my measurements do agree with
your conclusion.

I measured the simple COUNTIF v. SUMPRODUCT scenario that you had compared
as well as the array-entered SUM(COUNTIFS) v. SUMPRODUCT scenario that we
are discussing, with both whole-column ranges and limited ranges of 100,000
rows.

The difference of a factor of 6 to 7 times is consistent.

And I do agree that the difference is significant, expecially with
whole-column ranges.
 
R

Ron Rosenfeld

The difference of a factor of 6 to 7 times is consistent.

And I do agree that the difference is significant, expecially with
whole-column ranges.

Thank you for confirming that, and also for doing the more relevant comparison.
 
J

joeu2004

Ron Rosenfeld said:
joeu2004 said:
The difference of a factor of 6 to 7 times is consistent.
[....]
Thank you for confirming that, and also for doing the more
relevant comparison.

FYI, I erred in setting up the measurement experiments. The difference
between array-entered SUM(COUNTIFS) and SUMPRODUCT is even greater in
practical scenarios.

Previously, I had measured the worst-case scenario of 1,048,576 rows of
data. I used the same data when measuring limited ranges of 100,000 rows.

Arguably, a more reasonable scenario has 10,000 rows of data.

Then with whole-column ranges of the form A:A, SUMPRODUCT takes more than 50
times longer than array-entered SUM(COUNTIFS). On my computer, SUMPRODUCT
takes about 1.6 seconds, which is significant.

With limited ranges of the form A1:A100000 (100,000 rows), SUMPRODUCT takes
about 10 times longer. On my computer, SUMPRODUCT takes about 0.15 seconds,
which might or might not be significant to the user.

And with 1000 rows of data and limited ranges of the form A1:A10000 (10,000
rows), SUMPRODUCT again takes about 10 times longer, about 0.025 seconds on
my computer. Again, that might or mightnot be significant to the user.
 
R

Ron Rosenfeld

Ron Rosenfeld said:
joeu2004 said:
The difference of a factor of 6 to 7 times is consistent.
[....]
Thank you for confirming that, and also for doing the more
relevant comparison.

FYI, I erred in setting up the measurement experiments. The difference
between array-entered SUM(COUNTIFS) and SUMPRODUCT is even greater in
practical scenarios.

Previously, I had measured the worst-case scenario of 1,048,576 rows of
data. I used the same data when measuring limited ranges of 100,000 rows.

Arguably, a more reasonable scenario has 10,000 rows of data.

Then with whole-column ranges of the form A:A, SUMPRODUCT takes more than 50
times longer than array-entered SUM(COUNTIFS). On my computer, SUMPRODUCT
takes about 1.6 seconds, which is significant.

With limited ranges of the form A1:A100000 (100,000 rows), SUMPRODUCT takes
about 10 times longer. On my computer, SUMPRODUCT takes about 0.15 seconds,
which might or might not be significant to the user.

And with 1000 rows of data and limited ranges of the form A1:A10000 (10,000
rows), SUMPRODUCT again takes about 10 times longer, about 0.025 seconds on
my computer. Again, that might or mightnot be significant to the user.

Thank you.
By the way, I am trying to set up the sum(countifs(... indirect(... formula and cannot seem to get it to work with evaluate. It performs OK IF I replace the INDIRECT("G1:G2") part with an array constant where the elements are the contents of those cells; but if I use INDIRECT(range reference) , the formula returns an incorrect (zero) result.

How did you implement that?
 
J

joeu2004

Ron Rosenfeld said:
I am trying to set up the sum(countifs(... indirect(... formula
and cannot seem to get it to work with evaluate. It performs OK
IF I replace the INDIRECT("G1:G2") part with an array constant
where the elements are the contents of those cells; but if I use
INDIRECT(range reference) , the formula returns an incorrect
(zero) result. How did you implement that?

I confirm your observation.

I did not use INDIRECT("G1:G2"). In this case, it has dubious value in an
Excel formula. And it is especially superfluous in an Evaluate parameter,
other than for the possible purpose of comparing performance of
INDIRECT("G1:G2") with G1:G2. (But there is a better way.)

Apparently, Evaluate does not like INDIRECT. Note that
x=Evaluate("sum(indirect(""g1:g2""))") returns Error 2023 when x is a
Variant variable. Error 2023 is a #REF error. Yet x=Evaluate("sum(g1:g2)")
works just fine, of course.

Be that as it may, we should not use Evaluate to measure formula performance
in the first place, unless our intent is measure compare performance with
and without Evaluate per se. First, our measurement would include formula
translation time, which we should not be measuring in this case. Second,
our measurement would include Evaluate overhead, which is unknown.

On my computer, Evalute performance is about 2 times the performance of the
same formula in an Excel cell, sans INDIRECT in both cases of course.

(I am surprised that it is "x" times, not some fix amount of overhead. I
have measured both long and short operations. That makes me even more
suspicious of using Evaluate for measuring performace of a formula, albeit a
no-brainer not to.)
 
R

Ron Rosenfeld

Apparently, Evaluate does not like INDIRECT. Note that
x=Evaluate("sum(indirect(""g1:g2""))") returns Error 2023 when x is a
Variant variable. Error 2023 is a #REF error. Yet x=Evaluate("sum(g1:g2)")
works just fine, of course.

Another formula that works properly is:

evaluate("=index(indirect(""j1:j2""),1)")
or
evaluate("=index(indirect(""j1:j2""),2)")
Be that as it may, we should not use Evaluate to measure formula performance
in the first place, unless our intent is measure compare performance with
and without Evaluate per se. First, our measurement would include formula
translation time, which we should not be measuring in this case. Second,
our measurement would include Evaluate overhead, which is unknown.

I was concerned about that, but could not come up with another method to measure the performance of the formula I had proposed. I was hoping that the INDIRECT overhead would be fixed (but did not measure it directly as you have) but realized there might be more overhead depending on the complexity of the formula to be translated.
(I am surprised that it is "x" times, not some fix amount of overhead. I
have measured both long and short operations. That makes me even more
suspicious of using Evaluate for measuring performace of a formula, albeit a
no-brainer not to.)

So how did you do this, when dealing with these formulas? The constructs I tried using worksheetfunction returned errors.
 
J

joeu2004

PS.... I said:
On my computer, Evalute performance is about 2 times the performance of
the same formula in an Excel cell, sans INDIRECT in both cases of course.
(I am surprised that it is "x" times, not some fix amount of overhead.

We might think that perhaps Evaluate takes 2 times longer in this case
because with SUM(COUNTIFS(...,G1:G2,...)), we are summing an array of 2
elements, as if to theorize that Evaluate is evaluating each array element
separately.
 
J

joeu2004

Ignore this. Pressed Send by mistake.

-----

joeu2004 said:
We might think that perhaps Evaluate takes 2 times longer in this case
because with SUM(COUNTIFS(...,G1:G2,...)), we are summing an array of 2
elements, as if to theorize that Evaluate is evaluating each array element
separately.
 
J

joeu2004

Ron Rosenfeld said:
Another formula that works properly is:
evaluate("=index(indirect(""j1:j2""),1)")
or
evaluate("=index(indirect(""j1:j2""),2)")

I confirm your observation. So Evaluate does not have a problem with
INDIRECT per se.

Hmm.... Biting my tongue to avoid wild speculation about the failures of
Evaluate("...INDIRECT...").


Ron Rosenfeld said:
So how did you do this, when dealing with these formulas?
The constructs I tried using worksheetfunction returned errors.

We certainly do not want to measure WorksheetFunction time. The WF
implementation is not Excel.

Normally, I use the following paradigm. (See below for myTimer and
myElapsedTime.)

Sub test1()
Dim st As Currency, et As Currency, i As Long
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Range("e1").Clear
Range("e1").FormulaArray = _
"=sum(countifs(a1:a10000,d1:d2,b1:b10000,d3,c1:c10000,d4))"
For i = 1 To 5
st = myTimer
Range("e1").Calculate
et = myTimer
Debug.Print "time1: " & _
Format(myElapsedTime(et - st), "0.000000") & " sec"
Next
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True
End Sub

Replace d1:d2 with indirect(""d1:d2"") to measure the difference that
INDIRECT makes. Not much, as I would expect.

Sometimes I use the following paradigm for a more real-life metric.

Sub test2()
Dim st As Currency, et As Currency, i As Long
Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic
Range("e1").Clear
Range("e1").FormulaArray = _
"=sum(countifs(a1:a10000,d1:d2,b1:b10000,d3,c1:c10000,d4))"
For i = 1 To 5
st = myTimer
Range("e1").Dirty
et = myTimer
Debug.Print "test2: " & _
Format(myElapsedTime(et - st), "0.000000") & " sec"
Next
End Sub

In either case, usually I bootstrap the number of iterations based on the
relative standard error of the 95% confidence interval around the average,
excluding "obvious" outliers. For example, for 1% rel std err, the
approximate number of iterations is n = 1.96*sd/(avg*1%) [1].

(I might throttle n back due to practical considerations.)

"Obvious" outliers are values that are many sd away from the avg. Typical
causes: (1) the first execution of code after editing; and (2) system
interrupt processing.

Caveat: Single-cell Range.Calculate or Range.Dirty might incur too much
inter-thread overhead; that is, the communication between VBA and Excel
might outweigh the execution time of the formula. Often, it is better to
set up the same formula in multiple cells, then divide the total execution
time by the number of cells. However, in doing so, there is a potential
"observer effect"; that is, increasing the number of formulas increases the
complexity of the worksheet, and that might alter Excel's behavior. The
same can be said for increasing the amount of data processed by formulas
like those above. This is the point where the scientific methodology of
performance measurement becomes more of an art.


-----
[1] 1.96 = NORMSINV(1-(1-95%)/2)


-----
Public Declare Sub Sleep Lib "kernel32" (ByVal msec As Long)
Public Declare Function QueryPerformanceFrequency Lib "kernel32" _
(ByRef freq As Currency) As Long
Public Declare Function QueryPerformanceCounter Lib "kernel32" _
(ByRef cnt As Currency) As Long

Private freq As Currency, df As Double

Function myTimer() As Currency
QueryPerformanceCounter myTimer
End Function

Function myElapsedTime(ByVal dt As Currency) As Double
If freq = 0 Then QueryPerformanceFrequency freq: df = freq
myElapsedTime = dt / df
End Function
 
M

Michael Dobony

Michael Dobony said:
But generally, I think the following is more efficient
(compare columns B and C only once):
=SUMPRODUCT(((A1:A10000=G1)+(A1:A10000=G2)>0)*(B1:B10000=H1)*(C1:C10000=I1))
which can also be written:
=SUMPRODUCT(--((A1:A10000=G1)+(A1:A10000=G2)>0),--(B1:B10000=H1),--(C1:C10000=I1))
[....]
That works on my test table, thanks! But I forgot, I also
have a similar issue with a Sumifs, a sum of weights of
two codes. How do I do the same thing with a sum of the weights?

Suppose what you want to sum is in D1:D10000. Just another parameter, to
wit:

=SUMPRODUCT(((A1:A10000=G1)+(A1:A10000=G2)>0)*(B1:B10000=H1)*(C1:C10000=I1),D1:D10000)
or
=SUMPRODUCT(--((A1:A10000=G1)+(A1:A10000=G2)>0),--(B1:B10000=H1),--(C1:C10000=I1),D1:D1000)

If you opted for the COUNTIFS+COUNTIFS approach (not recommended), use
SUMIFS similarly, to wit:

=SUMIFS(D1:D10000,A1:A10000,G1,B1:B10000,H1,C1:C10000,I1)
+ SUMIFS(D1:D10000,A1:A10000,G2,B1:B10000,H1,C1:C10000,I1)

PS: I am not sure what you mean by a sum of weights "of two codes". If
simply adding D1:D10000 as another parameter does not work for you, please
explain what you want to sum in more detail.

Example:
Sum of the weights when the weight is above 499 and less than 1000 and the
corresponding code of that line of data is either TPGR or TPGC (A1 and
B1)and the zone is 3 (A2).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top