GROUP BY vs DISTINCT

Warrio · Jun 1, 2007

it's not that I want to waste my time saying that you are right or write
saying the only way is testing...

but the good thing in computing is that everything has a reason, somewhere
hidden or invisible, but it does exist! and saying that testing will give
the right answer is like gambling! how many parameters can influence your
testing????

even if you make it on a new machine, without any connection and only access
running! you can't have the same result.

come on, computing is one of the rare exact science, don't make something
esle!

Baz · Jun 2, 2007

If you are so convinced that Michel is right, on the basis of no evidence
whatsoever, I suggest that YOU better try it.

If someone here told you that the moon is made of green cheese I guess you
would believe that too.

Baz · Jun 2, 2007

My God what a pair of idiots (I mean Walsh and Warrio, not you David).

Probably better to let these two fools massage each other's egos in French
while we go and do something more worthwhile.

David W. Fenton said:
and here is the query plan, again, the same, for both statements
(excuse the HTML format, I suspect that without it, it turns out
un-readable):

|--Sort(DISTINCT ORDER BY[pubs].[dbo].[authors].[au_fname]
|ASC))

|--Index Scan(OBJECT[pubs].[dbo].[authors].[aunmind]))

Click to expand...

Er, that's a SQL Server query plan, not a Jet one.

Who said we were using SQL Server as the back end?

Baz · Jun 2, 2007

If you really believe that garbage then you have no chance of getting an
answer to your question. Oh, but I'm forgetting, you have already accepted
Michel's answer, even though he hasn't a clue what he's talking about.

Warrio said:
it's not that I want to waste my time saying that you are right or write
saying the only way is testing...

but the good thing in computing is that everything has a reason, somewhere
hidden or invisible, but it does exist! and saying that testing will give
the right answer is like gambling! how many parameters can influence your
testing????

even if you make it on a new machine, without any connection and only access
running! you can't have the same result.

come on, computing is one of the rare exact science, don't make something
esle!

"David W. Fenton" <[email protected]> a écrit dans le message de

Baz · Jun 2, 2007

It's a good thing he doesn't because speculation and misinformation are no
use to anyone.

Michel Walsh · Jun 4, 2007

Indeed, I assumed that Jet would solve these simple basic queries the same
way MS SQL Server would do, but it seems Jet optimizer fails to optimize the
second query.

In MS SQL Server, there is no difference between the queries plan, even if
there are multiple fields (indexed, or not) between the DISTINCT and the
logical equivalent GROUP BY.

Vanderghast, Access MVP

David W. Fenton said:
and here is the query plan, again, the same, for both statements
(excuse the HTML format, I suspect that without it, it turns out
un-readable):

|--Sort(DISTINCT ORDER BY[pubs].[dbo].[authors].[au_fname]
|ASC))

|--Index Scan(OBJECT[pubs].[dbo].[authors].[aunmind]))

Click to expand...

And if you run it in Jet, the SHOWPLAN gives this (Jet 3.5 or 4.0,
run both wiht LastName indexed and indexed (non-unique)):

--- Query1 ---
[SELECT DISTINCT tblPerson.LastName
FROM tblPerson;]

- Inputs to Query -
Table 'tblPerson'
- End inputs to Query -

store result in temporary table

--- Query2 ---
[SELECT tblPerson.LastName
FROM tblPerson
GROUP BY tblPerson.LastName;]

- Inputs to Query -
Table 'tblPerson'
- End inputs to Query -

01) Group table 'tblPerson'

This shows that the answer to the question depends on the database
engine being used, and, as expected, Jet handles the two completely
differently.

Michel Walsh · Jun 4, 2007

Half of the pair will humbly tell you that if it just happen that Jet is ...
crippled... in the particular case, that is surely not 'by design', since
clearly, other SQL optimizer achieve to find the SAME plan of execution for
a DISTINCT and the logically equivalent GROUP BY.

There is nothing 'built-in' the concepts of DISTINCT or of GROUP BY that may
leads you to think they logically 'differ', exception made for some syntax
particularities like DISTINCT which can be used with *, while GROUP BY
cannot (due to extra validations carried over when GROUP BY is explicitly
used).

Vanderghast, Access MVP.

Baz said:
My God what a pair of idiots (I mean Walsh and Warrio, not you David).

Probably better to let these two fools massage each other's egos in French
while we go and do something more worthwhile.

David W. Fenton said:

and here is the query plan, again, the same, for both statements
(excuse the HTML format, I suspect that without it, it turns out
un-readable):

|--Sort(DISTINCT ORDER BY[pubs].[dbo].[authors].[au_fname]
|ASC))

|--Index Scan(OBJECT[pubs].[dbo].[authors].[aunmind]))

Click to expand...

Er, that's a SQL Server query plan, not a Jet one.

Who said we were using SQL Server as the back end?

Click to expand...

Keith Wilby · Jun 4, 2007

Baz said:
If someone here told you that the moon is made of green cheese I guess you
would believe that too.

Ridiculous. Everyone knows it's cheddar, right Gromit?

Michel Walsh · Jun 4, 2007

The test he does did NOT prove the second part of your (or is it his)
affirmation (DISTINCT and GROUP BY being for different 'purposes') since
other alternatives exist to explain the difference, such as that Jet not
doing as a good job as it should, in a specific case. Generalization from a
single case is unwise.

Vanderghast, Access MVP

Michel Walsh · Jun 4, 2007

And to further clarify the spirit of my last message, I will like to recall
to your attention that SQL is not an imperative language, but a descriptive
one, a language where you specify what you want, not HOW TO get it. In that
'spirit' of the SQL language, DISTINCT and GROUP BY are NOT for different
purposes.

*If* someone, unaware, just realized that a DISTINCT can be replace by a
GROUP BY to get the same 'description' of what is wanted, well, that is
great. Newsgroup are to educate, too.

Vanderghast, Access MVP.

Michel Walsh said:
The test he does did NOT prove the second part of your (or is it his)
affirmation (DISTINCT and GROUP BY being for different 'purposes') since
other alternatives exist to explain the difference, such as that Jet not
doing as a good job as it should, in a specific case. Generalization from
a single case is unwise.

Vanderghast, Access MVP

Michel Walsh · Jun 4, 2007

Can you repeat that with my name explicitly written in it?

Vanderghast, Access MVP

David W. Fenton · Jun 5, 2007

Half of the pair will humbly tell you that if it just happen that
Jet is ... crippled...

Jet is not in any way crippled.

in the particular case, that is surely not 'by design', since
clearly, other SQL optimizer achieve to find the SAME plan of
execution for a DISTINCT and the logically equivalent GROUP BY.

There is nothing 'built-in' the concepts of DISTINCT or of GROUP
BY that may leads you to think they logically 'differ', exception
made for some syntax particularities like DISTINCT which can be
used with *, while GROUP BY cannot (due to extra validations
carried over when GROUP BY is explicitly used).

They seem obviously different to me, as one is eliminating
duplicates, while the other is going group on like values for the
purpose of summarizing other field(s). It makes perfect sense to me
for them to not be optmized the same, though I guess a query
optimizer that recognizes there's nothing to summarize could then
use the same methods as with DISTINCT.

But that looks to me more like a fall-back to fix user error. I see
no logical reason why someone should expect them to be treated
identically. There are any number of methods in SQL that can produce
results that appear identical, but that doesn't mean that all
methods for doing so should be optimized exactly the same way.

David W. Fenton · Jun 5, 2007

And to further clarify the spirit of my last message, I will like
to recall to your attention that SQL is not an imperative
language, but a descriptive one, a language where you specify what
you want, not HOW TO get it. In that 'spirit' of the SQL language,
DISTINCT and GROUP BY are NOT for different purposes.

Yes, different database engines treat them differently.

*If* someone, unaware, just realized that a DISTINCT can be
replace by a GROUP BY to get the same 'description' of what is
wanted, well, that is great. Newsgroup are to educate, too.

They aren't the same just because they can be used to produce the
same results. That's a major logical fallacy.

The SQL Server optimization is a fix for USER ERROR.

DAVID · Jun 5, 2007

Warrio said:
Hello!

Which of GROUP BY and DISTINCT is the quickest? and why?

Thanks for any relevant answer!

PS: Please don't tell me that there is no difference because it's a matter
of one millionth second.

"When a SELECT statement is executed, the JET engine always performs an
implicit sort if a DISTINCT modifier is present; an ORDER BY clause
causes another sort to be done, which occurs after all other sorts (if
any)."

Both "group by" and "distinct" are normally limited by the speed of the
sort operation.

(david)

Michel Walsh · Jun 5, 2007

The purpose of an optimizer is to find the best method. Jet optimizer fails
to find the best method in that specific case, so Jet optimizer didn't
optimize as it should. I hardly see how can someone can say that this is
fine and acceptable state. So, indeed, Jet is crippled, in that specific
case, as it is in the NOT IN() construction.

If a better method is called a 'fix-up' in your dictionary, then, no problem
with me, as long as NO ONE think there is no need to bring these 'fix-ups'
into Jet.

Finally, again, SQL is descriptive, and logically equivalent results should
be optimized the same way, indecently of the exact syntax... in particular
if the query is a basic elementary one.

Vanderghast, Access MVP

Michel Walsh · Jun 5, 2007

I should have said NOT EXISTS rather than NOT IN.

Vanderghast, Access MVP

David W. Fenton · Jun 5, 2007

The purpose of an optimizer is to find the best method.

Yes. Given the commands given.

Jet optimizer fails
to find the best method in that specific case, so Jet optimizer
didn't optimize as it should.

Because it was given the wrong commands.

I hardly see how can someone can say that this is
fine and acceptable state. So, indeed, Jet is crippled, in that
specific case, as it is in the NOT IN() construction.

All SQL variants have certain things that aren't well-implemented.

If a better method is called a 'fix-up' in your dictionary, then,
no problem with me, as long as NO ONE think there is no need to
bring these 'fix-ups' into Jet.

It would be nice if Jet could fix the user error, as long as it
didn't break performance in other cases.

Finally, again, SQL is descriptive, and logically equivalent
results should be optimized the same way, indecently of the exact
syntax... in particular if the query is a basic elementary one.

But they aren't logically equivalent -- they are only equivalent in
the *results* displayed.

David W. Fenton · Jun 5, 2007

I should have said NOT EXISTS rather than NOT IN.

There are also problems with Jet's implementation of IN clauses, as
it doesn't seem to always use the indexes on both sides, one case
where it often doesn't is NOT IN(), in fact.

Michel Walsh · Jun 5, 2007

They are NOT logically equivalent? Well, I totally disagree.

To summarize:

SELECT DISTINCT f1 FROM foobar

is logically the same as

SELECT f1 FROM foobar GROUP BY f1

You may claim the reverse with all your might, it would be better to supply
a counter example where these two statements differ, else, your might won't
matter much: SQL is math (set theory) based, not "intention" based, not
"claim" based.

The results described by each of the above statement ***are*** logically the
same.

Vanderghast, Access MVP

Michel Walsh · Jun 6, 2007

Ok, let us do it the other way.

Consider:

SELECT f1
FROM table
GROUP BY f1

and

SELECT DISTINCT f1
FROM table

In MS SQL Server, both queries takes the same execution time, say 1 Unit of
Execution Time. In Jet, the second query uses the same plan than MS SQL
Server, so say the second query, in Jet, also takes 1 UET, but the first
query, in Jet, easily takes up to 10 times that. So, up to now, someone can
tell, ok, Jet is not as performing for the GROUP BY than for the DISTINCT,
so, as you said, let use DISTINCT when possible, and GROUP BY otherwise.
(same as not using NOT EXISTS when an outer join can do).

BUT now, that *is* the problem, consider:

SELECT f1, MAX(f2)
FROM table
GROUP BY f1

Again, MS SQL Server uses 1 UET ! because computing the MAX once the groups
are 'made' is almost negligible in comparison (less that half one percent of
the UET). And that last query plan uses the SAME plan that for the very
first query, technique that Jet knows and used for DISTINCT, remember that
point. This time, though, YOU cannot write the query, explicitly, using
DISTINCT, so, Jet, uses the same procedure than for the first query, which
is up to10 times slower than it should, and all that while Jet has the
capabilities, as MS SQL Server does, to do it efficiently. So,while Jet has
already the knowledge about how to do it 10 times faster (it did, proof
being that it does it with DISTINCT), it is a BUG if it does not uses it.
And this time, the user cannot do anything for it. I was not aware of that
crippling feature of Jet until recently, since the query IS basic and
elementary, I was on the impression that Jet was indeed doing the same as MS
SQL Server. But now that I know that ALL Jet queries implying GROUP BY are
probably up to 10 times slower than they should, I should say everything is
well, no problemo, business as usual?

Now, about "intention". Again, no, no, and no, SQL is not about "if I want".
In theory, two logically equivalent statements should be evaluated the same
way. Example:

SELECT *
FROM f1 INNER JOIN f2 ON f1.g1=f2.g2

is equivalent to

SELECT *
FROM f1, f2
WHERE g1=g2

and NO, it is NOT a valid reason to say that since the SYNTAX is not the
same, it is then acceptable that the query plan would not be the same!

In SQL, since you don't say HOW TO solve the problem, just what is the
result you want, you delegate the responsibility to find the optimum way to
solve the problem to the database engine, and the database should then be
able to do an acceptable work in finding the best solution (based on what is
described by the SQL statement). In fact, the db can use an approach you
would have never ever imagine! That is not a problem, your work is to
describe the result, and the optimizer job is to find the best solution
given the circumstances (which depends on statistics too, such as the size
of the table, if there are indexes or not, etc).

And in the same way, it is not relevant if you want to aggregate or not,
DISTINCT and GROUP BY are of the same family of descriptors. Your
"intentions" are not part of the SQL language, neither as tag, comment,
neither in the case of DISTINCT/GROUP BY, if you want or not aggregate.
(Aggregating is almost free, anyhow, in comparison with the execution time
for the sorting).

And in ANY CASES, even * if * it was 'intention based'' as in an imperative
programmation language, and it is not, unless you can re-write:

SELECT f1, MAX(f2)
FROM table
GROUP BY f1

using DISTINCT, then JET ***is crippled***, as it is now, since there is an
execution plan that Jet could run up to 10 times faster, and while Jet has
all the knowledge to do it, it is just too ... crippled... to do it. And
that, for ALMOST ALL queries with a GROUP BY clause, ALMOST ALL of them! up
to 10 times!

Now, back to your question: if I don't want to use group, as with your
'intention' in using DISTINCT, why creating them? simply because BY
DEFINITION, they are the same thing. With DISTINCT, to know if f1="a",
f2="b" has already been selected, or not, what will you do? With GROUP BY,
to know if f1="a" and f2="b" has already been created or not, what will you
do? In both cases, you create some LIST of what is already
"selected/created", and find a way to easily answer to the question: is
(f1, f2) already in this LIST or not. You see, in terms of bits and bytes,
"selected" (DISTINCT) or "created" (GROUP) is irrelevant, same thing in math
and set theory ... a bit and a byte don't really see what is your
'intention'.

Vanderghast, Access MVP

React JS Outlook Add-in - VS Code + WebPack - Unable to call external Web Service - CORS Issue	0	Aug 25, 2022
Outlook Add-in - Javascript based using React JS - VS Code + WebPack - Unable to call external Web Service	0	Aug 23, 2022
Work Budget Salary vs Real Time Salary	0	Nov 20, 2021
Trying to create a report that will list unmatched records by Ticket Primary Key	0	Mar 11, 2022
Create a distinct Join	3	Jan 7, 2010
distinct DCount - so close! or so close?	3	Aug 5, 2009
Automation of scanned file storage based on content	0	Dec 9, 2021
DISTINCT vs GROUPBY ?	5	Sep 12, 2005

GROUP BY vs DISTINCT

Warrio

Baz

Baz

Baz

Baz

Michel Walsh

Michel Walsh

Keith Wilby

Michel Walsh

Michel Walsh

Michel Walsh

David W. Fenton

David W. Fenton

DAVID

Michel Walsh

Michel Walsh

David W. Fenton

David W. Fenton

Michel Walsh

Michel Walsh

Ask a Question

Similar Threads