GROUP BY vs DISTINCT

M

Michel Walsh

To illustrate what GROUP BY is and is NOT, so things can be clear in mind.



SELECT f1, f2, f3, MAX(f4), MIN(f5), SUM(f6)
FROM somewhere
GROUP BY f1, f2, f3


can be 'described' as: do all different (distinct) groups of possible
triples {f1, f2, f3} and once these groups are made, for each of them,
compute the aggregate.

If you imagine that it is what is effectively done, ie. first, define 'm'
buckets, one per triple {f1, f2, f3}, fill the each bucket with the records
that belong to them, THEN, once it is done, look again on each bucket and
compute the aggregate,... well, that is a possible 'plan', but a very
not-efficient one since you 'touch' each record twice.

We can easily solve the query by touching each record only ONCE. Indeed,
assume there is already a "group" created for f1='a', f2='b', f3='c', then
once a new record has the same values for that triple, do we have to store
it? Well, surely we don't need to store f1, f2, and f3 again, since they are
the same values than those of the 'group' identifier. What about f4 of the
new record? again, the existing group can already has the MAXupToNowOf(f4),
MINupToNowOf(f5) and SUMupToNowOf(f6). So, why not using f4 of the new
record and 'aggregate it one term at a time' with MaxUpToNowOf(f4), and same
thing with f5, and with f6. Do we need to 'touch' that new record? no! So
not only you don't need to touch each record more than once, but each
'bucket' is at most, one row deep. That is a 'line', or a 'row', if you
wish. Then, what look can take the whole set of 'rows'? you got it, a list,
a SORTED list, no dup on the sorted fields: f1, f2, f3.


So, what DISTINCT is?




Vanderghast, Access MVP
 
R

Rick Brandt

Michel said:
To illustrate what GROUP BY is and is NOT, so things can be clear in
mind.
[snip]

I don't see where you are going here. The OP was about what was quickest;
Group By (without aggregating) or DISTINCT. You are now going on about the
fact that Jet might not use the best plan for Group By queries. That is
beside the point isn't it? If I need to aggregate on groups then I must use
Group By and whether Jet does that in the absolute most efficient manner is
irrelevant. And as I stated before if I don't need to aggregate but only
want distinct values then I use the DISTINCT clause.

I see no point in creating off the wall query solutions that are not going
to self document in order to save a few fractions of a second. If it will
save more than that then either the database is poorly designed or has been
built with the wrong database engine.
 
M

Michel Walsh

The OP did asked what was the quickest, yes, but some answers, to stay
polite, let call it that way, he got where about there was a whole world of
difference between a GROUP BY and DISTINCT. I continue to claim both syntax
are logically the same, and bring back-up about my claim (as MS SQL Server
using the same plan of executions, and, recently, explanations about this
plan, since some people seem to continue to think those are two different
things).

Few milliseconds, for small tables, but as someone else pointed it out, and
I found the same order of magnitude with using one indexed field, with JET,
the ratio of execution time is by an order of magnitude, so that while
DISTINCT took, for me, on average, 1 sec, the GROUP BY took 11 sec. It is
not ONLY by a few milliseconds.

And since the message seems to have problem to pass, I am glad to repeat it:
SQL is not about what are your intentions are, neither how to solve a
problem but ABOUT A DESCRIPTION of what we want. DISTINCT and GROUP BY
(without aggregate) are logically equivalent, and even more, GROUP BY WITH
AGGREGATE is free lunch by comparison with the same statement WITHOUT
AGGREGATE, since computing the aggregate is negligible in comparison with
the sorting.


If you use DISTINCT when you don't want aggregate, great, but you still use
inefficient GROUP BY when you need aggregate. If you are glad and happy
camper with that, I am glad for you, should I not?



Vanderghast, Access MVP


Rick Brandt said:
Michel said:
To illustrate what GROUP BY is and is NOT, so things can be clear in
mind.
[snip]

I don't see where you are going here. The OP was about what was quickest;
Group By (without aggregating) or DISTINCT. You are now going on about the
fact that Jet might not use the best plan for Group By queries. That is
beside the point isn't it? If I need to aggregate on groups then I must
use Group By and whether Jet does that in the absolute most efficient
manner is irrelevant. And as I stated before if I don't need to aggregate
but only want distinct values then I use the DISTINCT clause.

I see no point in creating off the wall query solutions that are not going
to self document in order to save a few fractions of a second. If it will
save more than that then either the database is poorly designed or has
been built with the wrong database engine.
 
D

David Cox

If the user is not expert and not familiar with editing SQL, have only used
the GUI for simple queries, then a click on the GROUP BY button, however
amatuerish, will probably get them to the desired result quicker.
If they are even less expert than that, learning what the GROUP BY button
does will probably be beneficial sooner that learning what DISTINCT does,
and how to apply it.


Michel Walsh said:
The OP did asked what was the quickest, yes, but some answers, to stay
polite, let call it that way, he got where about there was a whole world
of difference between a GROUP BY and DISTINCT. I continue to claim both
syntax are logically the same, and bring back-up about my claim (as MS SQL
Server using the same plan of executions, and, recently, explanations
about this plan, since some people seem to continue to think those are two
different things).

Few milliseconds, for small tables, but as someone else pointed it out,
and I found the same order of magnitude with using one indexed field, with
JET, the ratio of execution time is by an order of magnitude, so that
while DISTINCT took, for me, on average, 1 sec, the GROUP BY took 11 sec.
It is not ONLY by a few milliseconds.

And since the message seems to have problem to pass, I am glad to repeat
it: SQL is not about what are your intentions are, neither how to solve a
problem but ABOUT A DESCRIPTION of what we want. DISTINCT and GROUP BY
(without aggregate) are logically equivalent, and even more, GROUP BY WITH
AGGREGATE is free lunch by comparison with the same statement WITHOUT
AGGREGATE, since computing the aggregate is negligible in comparison with
the sorting.


If you use DISTINCT when you don't want aggregate, great, but you still
use inefficient GROUP BY when you need aggregate. If you are glad and
happy camper with that, I am glad for you, should I not?



Vanderghast, Access MVP


Rick Brandt said:
Michel said:
To illustrate what GROUP BY is and is NOT, so things can be clear in
mind.
[snip]

I don't see where you are going here. The OP was about what was
quickest; Group By (without aggregating) or DISTINCT. You are now going
on about the fact that Jet might not use the best plan for Group By
queries. That is beside the point isn't it? If I need to aggregate on
groups then I must use Group By and whether Jet does that in the absolute
most efficient manner is irrelevant. And as I stated before if I don't
need to aggregate but only want distinct values then I use the DISTINCT
clause.

I see no point in creating off the wall query solutions that are not
going to self document in order to save a few fractions of a second. If
it will save more than that then either the database is poorly designed
or has been built with the wrong database engine.
 
W

Warrio

The original question was about to know which one was the quickest indeed...
since they give the same result.

some people were saying that each thing has its own purpose, I agree with
you, but I have the right to do not follow this rule, if it's more efficient
for what I want to do.



it's like leaving a room using the window instead of the door. most people
would take the door, but, for me if I can get out of the room quicker and
more efficiently, I'll take the window, no matter what others would say : ).
Maybe it's politically, usually incorrect, but both enables me to leave the
room. it's not because the English word given to the function says DISTINCT,
that its only function should be eliminating the double records.



the good thing in this discussion is that I know that the GROUP BY should be
quicker except if it's used with the jet engine. and I got the answer to my
question. so in the future, when I'll have the choice, I'll use the DISTINCT
function with a jet until it gets repaired.

And maybe one day I'll have to find more details about how it really works
and why, for now I don't have really the time for it. Even though it still
very interesting!



Michel Walsh said:
The OP did asked what was the quickest, yes, but some answers, to stay
polite, let call it that way, he got where about there was a whole world
of difference between a GROUP BY and DISTINCT. I continue to claim both
syntax are logically the same, and bring back-up about my claim (as MS SQL
Server using the same plan of executions, and, recently, explanations
about this plan, since some people seem to continue to think those are two
different things).

Few milliseconds, for small tables, but as someone else pointed it out,
and I found the same order of magnitude with using one indexed field, with
JET, the ratio of execution time is by an order of magnitude, so that
while DISTINCT took, for me, on average, 1 sec, the GROUP BY took 11 sec.
It is not ONLY by a few milliseconds.

And since the message seems to have problem to pass, I am glad to repeat
it: SQL is not about what are your intentions are, neither how to solve a
problem but ABOUT A DESCRIPTION of what we want. DISTINCT and GROUP BY
(without aggregate) are logically equivalent, and even more, GROUP BY WITH
AGGREGATE is free lunch by comparison with the same statement WITHOUT
AGGREGATE, since computing the aggregate is negligible in comparison with
the sorting.


If you use DISTINCT when you don't want aggregate, great, but you still
use inefficient GROUP BY when you need aggregate. If you are glad and
happy camper with that, I am glad for you, should I not?



Vanderghast, Access MVP


Rick Brandt said:
Michel said:
To illustrate what GROUP BY is and is NOT, so things can be clear in
mind.
[snip]

I don't see where you are going here. The OP was about what was
quickest; Group By (without aggregating) or DISTINCT. You are now going
on about the fact that Jet might not use the best plan for Group By
queries. That is beside the point isn't it? If I need to aggregate on
groups then I must use Group By and whether Jet does that in the absolute
most efficient manner is irrelevant. And as I stated before if I don't
need to aggregate but only want distinct values then I use the DISTINCT
clause.

I see no point in creating off the wall query solutions that are not
going to self document in order to save a few fractions of a second. If
it will save more than that then either the database is poorly designed
or has been built with the wrong database engine.
 
W

Warrio

one question then :

Does the jet engine sort the selected data before grouping them? If I had to
write the function my self, I would create two queries one for a small
amount of data and the other for considerable amount of data.

Because sorting the data would be useful to avoid make a useless loop on the
list of data to display to check if it's displayed already or not, because
it's the same as the previous record of the sorted data...






Michel Walsh said:
The OP did asked what was the quickest, yes, but some answers, to stay
polite, let call it that way, he got where about there was a whole world
of difference between a GROUP BY and DISTINCT. I continue to claim both
syntax are logically the same, and bring back-up about my claim (as MS SQL
Server using the same plan of executions, and, recently, explanations
about this plan, since some people seem to continue to think those are two
different things).

Few milliseconds, for small tables, but as someone else pointed it out,
and I found the same order of magnitude with using one indexed field, with
JET, the ratio of execution time is by an order of magnitude, so that
while DISTINCT took, for me, on average, 1 sec, the GROUP BY took 11 sec.
It is not ONLY by a few milliseconds.

And since the message seems to have problem to pass, I am glad to repeat
it: SQL is not about what are your intentions are, neither how to solve a
problem but ABOUT A DESCRIPTION of what we want. DISTINCT and GROUP BY
(without aggregate) are logically equivalent, and even more, GROUP BY WITH
AGGREGATE is free lunch by comparison with the same statement WITHOUT
AGGREGATE, since computing the aggregate is negligible in comparison with
the sorting.


If you use DISTINCT when you don't want aggregate, great, but you still
use inefficient GROUP BY when you need aggregate. If you are glad and
happy camper with that, I am glad for you, should I not?



Vanderghast, Access MVP


Rick Brandt said:
Michel said:
To illustrate what GROUP BY is and is NOT, so things can be clear in
mind.
[snip]

I don't see where you are going here. The OP was about what was
quickest; Group By (without aggregating) or DISTINCT. You are now going
on about the fact that Jet might not use the best plan for Group By
queries. That is beside the point isn't it? If I need to aggregate on
groups then I must use Group By and whether Jet does that in the absolute
most efficient manner is irrelevant. And as I stated before if I don't
need to aggregate but only want distinct values then I use the DISTINCT
clause.

I see no point in creating off the wall query solutions that are not
going to self document in order to save a few fractions of a second. If
it will save more than that then either the database is poorly designed
or has been built with the wrong database engine.
 
D

David W. Fenton

In Jet, the second query uses the same plan than MS SQL
Server, so say the second query, in Jet, also takes 1 UET, but the
first query, in Jet, easily takes up to 10 times that.

Where do you get that information? I didn't see anyone post any such
results.
 
D

David W. Fenton

Michel said:
To illustrate what GROUP BY is and is NOT, so things can be clear
in mind.
[snip]

I don't see where you are going here. The OP was about what was
quickest; Group By (without aggregating) or DISTINCT. You are now
going on about the fact that Jet might not use the best plan for
Group By queries. That is beside the point isn't it? If I need to
aggregate on groups then I must use Group By and whether Jet does
that in the absolute most efficient manner is irrelevant.

He does have one point which he could prove by showing that a in Jet
GROUP BY on a SELECT DISTINCT subquery is faster than a GROUP BY on
the raw SELECT subquery (this would introduce joins, so that might
muck things up, of course).
And as I stated before if I don't need to aggregate but only
want distinct values then I use the DISTINCT clause.

He is arguing that the fact that the results are identical means
that they should always be optimized the same. This is definitely
false as a generalized case -- it's too easy to come up with
counterfactual examples. I'm not sure if it's true in this case. I
would think that a GROUP BY would need to evaluate other factures
(what kind of aggregation is being used, for instance -- in the case
of FIRST() or LAST() it might be counterproductive to do a DISTINCT
first).
I see no point in creating off the wall query solutions that are
not going to self document in order to save a few fractions of a
second. If it will save more than that then either the database
is poorly designed or has been built with the wrong database
engine.

I don't understand why someone would want to write a new line of SQL
when they could get the same results by adding a single SQL command.
 
D

David W. Fenton

If the user is not expert and not familiar with editing SQL, have
only used the GUI for simple queries, then a click on the GROUP BY
button, however amatuerish, will probably get them to the desired
result quicker.
If they are even less expert than that, learning what the GROUP
BY button
does will probably be beneficial sooner that learning what
DISTINCT does, and how to apply it.

And it will only be an efficient choice if you're using a database
engine that makes the same assumptions the naive user is making,
that this is equivalent to a DISTINCT.
 
D

David Cox

David W. Fenton said:
And it will only be an efficient choice if you're using a database
engine that makes the same assumptions the naive user is making,
that this is equivalent to a DISTINCT.

I am not clear what point you are trying to make. I have no quarrel in that
DISTINCT is the "proper" way to do the task. We are talking about ACCESS. My
test produced 39,000 DISTINCT records from a 400,000 data set in about a
quarte of the time it took to doa GROUP BY to produce the same data set.

Starting from the moment I had entered the field into the query grid until
the solution appeared running the GROUP BY query by clicking the two
buttons was finished before I had started to type into the SQL window. It
dependshow you measure efficient..
If it is a question of explaining how to do it I suspect "just click that
button first' would be factors of ten times faster for getting a trainee to
do the task.

David F. Cox
 
D

David Cox

Rick Brandt said:
To me, "Group By" implies aggregating across the groups created. If you
have no intention to aggregate anything then why create groupings?

I mean I *can* turn off the light by unscrewing the bulb, but why do that
when there is a switch on the wall? If I want distinct values I use the
DISTINCT clause. Some other mechanism that might provide the same result
is of no interest to me as long as the mechanism specifically designed for
what I want to accomplish is there and works.

Much as I sympathise with your intent I have to point out that in Access it
is Group By that has the switch on the wall, and DISTINCT that requires you
screw something in. :-<
 
D

David W. Fenton

I am not clear what point you are trying to make. I have no
quarrel in that DISTINCT is the "proper" way to do the task. We
are talking about ACCESS. My test produced 39,000 DISTINCT records
from a 400,000 data set in about a quarte of the time it took to
doa GROUP BY to produce the same data set.

And if you were running it against SQL Server, it would have taken
the same amount of time for both, according to Michel (and the query
plan he posted).
Starting from the moment I had entered the field into the query
grid until the solution appeared running the GROUP BY query by
clicking the two buttons was finished before I had started to type
into the SQL window. It dependshow you measure efficient..
If it is a question of explaining how to do it I suspect "just
click that button first' would be factors of ten times faster for
getting a trainee to do the task.

No one can get by with GROUP BY as a replacement for DISTINCT in all
cases. Consider if you have joins to other tables. My bet is that It
will have an effect on the Jet query plan.

GROUP BY is for aggregating data.

DISTINCT is for eliminating duplicates.

That GROUP BY happens to eliminate duplicates in the process of
aggregating data is not a sufficient reason to use GROUP BY,
espeiclaly with a database engine wherein the GROUP BY is slower
than DISTINCT.
 
D

David Cox

David W. Fenton said:
And if you were running it against SQL Server, it would have taken
the same amount of time for both, according to Michel (and the query
plan he posted).


No one can get by with GROUP BY as a replacement for DISTINCT in all
cases. Consider if you have joins to other tables. My bet is that It
will have an effect on the Jet query plan.

GROUP BY is for aggregating data.

DISTINCT is for eliminating duplicates.

That GROUP BY happens to eliminate duplicates in the process of
aggregating data is not a sufficient reason to use GROUP BY,
espeiclaly with a database engine wherein the GROUP BY is slower
than DISTINCT.

Most peoples cars could run faster if they opened the hood and made a few
adjustments. . There are a lot of Access users out there that do not want to
"open the hood", or know anything about what SQL is or does. For those less
skilled users learning how to use Group By on the QBE grid will be far more
productive in the short to medium term than learning how to edit SQL to add
"DISTINCT".
Using my test query on an old slow PC I would need to run the query over 20
times before it saved me time.
The other, more worrying, thing about distinct is that I can see no
indication that it is in the query when viewed from the grid design window.
Stranger still when I added a field x:1 to the design grid the "Group By"
version ran faster than Distinct. by a factor around 8:7

DISTINCT shows 1 record for N duplicates.
GROUP BY shows 1 record for N duplicates

I think I have discovered, at last, why Microsoft did not feel the need for
a "DISTINCT" button.
 
M

Michel Walsh

I mention should, you added the always. Should, as in a GOAL. These queries
are basic and elementary ones.


You are also missing that GROUP BY WITH AGGREGATE can use the similar query
plan than the one used by DISTINCT and GROUP BY without aggregate (as you
can check my 'claim' using MS SLQ Server), where the aggregates are done for
a small fraction (less than half of one percent of the whole query
execution). Jet is inefficient in both cases, I mean, with or without
aggregate.

I NEVER said that result that are identical implies the execution plan
should be the same. I said LOGICALLY EQIVALENT description SHOULD be
optimized the same way. And DISTINCT and GROUP BY without aggregate ARE
logically equivalent descriptions.





Vanderghast, Access MVP


David W. Fenton said:
Michel said:
To illustrate what GROUP BY is and is NOT, so things can be clear
in mind.
[snip]

I don't see where you are going here. The OP was about what was
quickest; Group By (without aggregating) or DISTINCT. You are now
going on about the fact that Jet might not use the best plan for
Group By queries. That is beside the point isn't it? If I need to
aggregate on groups then I must use Group By and whether Jet does
that in the absolute most efficient manner is irrelevant.

He does have one point which he could prove by showing that a in Jet
GROUP BY on a SELECT DISTINCT subquery is faster than a GROUP BY on
the raw SELECT subquery (this would introduce joins, so that might
muck things up, of course).
And as I stated before if I don't need to aggregate but only
want distinct values then I use the DISTINCT clause.

He is arguing that the fact that the results are identical means
that they should always be optimized the same. This is definitely
false as a generalized case -- it's too easy to come up with
counterfactual examples. I'm not sure if it's true in this case. I
would think that a GROUP BY would need to evaluate other factures
(what kind of aggregation is being used, for instance -- in the case
of FIRST() or LAST() it might be counterproductive to do a DISTINCT
first).
I see no point in creating off the wall query solutions that are
not going to self document in order to save a few fractions of a
second. If it will save more than that then either the database
is poorly designed or has been built with the wrong database
engine.

I don't understand why someone would want to write a new line of SQL
when they could get the same results by adding a single SQL command.
 
M

Michel Walsh

Where? Here. You don't see it? Well, re-initialize your newsreader. Baz
reports it, and, since I was not believing its experimentation was right, I
decided to do it myself and got a similar result that I also reported.


Vanderghast, Access MVP.
 
M

Michel Walsh

I really don't know about what Jet does in the step "group" of its query
plan. MS SQL Server is more detailed and uses the index (if it exists, and
if the number of records is high enough) of the field to be DISTINCT/GROUP
BY.

If you implant that strategy yourself, in this case, and if the index
exists, since the reading of the records (note the query is covered by the
index) is done already sorted, so you can dump the result list onto a stack
rather than on a sorted list. Indeed, to find if a record is already in the
"selected/grouped" list, you only have to compare it with the top value on
the stack. If the index does not exist, you have to relay on stats (if they
are available): does the result will have much less rows than the initial
data, or not. If the number of row is much smaller in the result, may be
preferable to use a sorted list (faster to sort, a little bit longer to
search) than to sort the whole initial set first, then use a stack (longer
to sort the whole initial set, faster to 'search' for existence). Again, if
the index exists, the query being covered by the index, you don't need to
touch the data at all, just the index, so, you have less IO (since index
would be, generally, more compact, than the 'table', it will resides on less
'pages', so less requirements to hit the hard disk). You see, that is what
SQL is: it knows different way of doing the described set, and decide which
strategy it will take. If you want to do the same, clearly, you should
weight the similar alternatives, and 'hard code' multiple solutions, not
just one (in general).



Vanderghast, Access MVP
 
D

David W. Fenton

Most peoples cars could run faster if they opened the hood and
made a few adjustments. . There are a lot of Access users out
there that do not want to "open the hood", or know anything about
what SQL is or does. For those less skilled users learning how to
use Group By on the QBE grid will be far more productive in the
short to medium term than learning how to edit SQL to add
"DISTINCT".

Er, you don't have to edit SQL to do that.
Using my test query on an old slow PC I would need to run the
query over 20 times before it saved me time.

I agree that it's not really a significant difference for easy test
cases.
The other, more worrying, thing about distinct is that I can see
no indication that it is in the query when viewed from the grid
design window. Stranger still when I added a field x:1 to the
design grid the "Group By" version ran faster than Distinct. by a
factor around 8:7

DISTINCT shows 1 record for N duplicates.
GROUP BY shows 1 record for N duplicates

I think I have discovered, at last, why Microsoft did not feel the
need for a "DISTINCT" button.

I think you're imagining things.
 
D

David W. Fenton

Vanderghast, Access MVP



Because it is far from being optimal.

Well, when used WRONG, perhaps.

But have you tested it for its actual performance in aggregating
data? If Jet would aggregate data more slowly if it optimized GROUP
BY to DISTINCT, then I'm glad that it's slower when used WRONGLY as
the equivalent of DISTINCT.

But nobody has tested that.

I see no reason to, since I have nothing to prove here.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top