Any suggestions for optimising this very slow (45 minute) query?

D

dochsm

This query processes 200,000 records each week. It calculates two
cumulative attendance totals by weeknumber for students who are
identified by their admission number (Adno). Trouble is it takes
around 45 minutes to run. Is there a better way to write it?

Howard



INSERT
INTO Tbl_CumulativeAttendance
(
Adno ,
ExportWeekNumber ,
Used_Surname ,
CumulativePresentNumber ,
CumulativePossibleNumber,
CumulativePercent
)
SELECT DISTINCT
tmp_attends.Adno ,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname ,
(SELECT SUM(presentMark)
FROM tbl_TEMP_all_attendances_with_week_number
WHERE AdNo = tmp_attends.Adno
AND val( ExportWeekNumber) <=
val(tmp_attends.ExportWeekNumber)
) AS CumulativePresentNumber,

(SELECT SUM(PossibleMark)
FROM tbl_TEMP_all_attendances_with_week_number
WHERE AdNo = tmp_attends.Adno
AND val(ExportWeekNumber) <=
val(tmp_attends.ExportWeekNumber)
) AS CumulativePossibleNumber,

IIf([CumulativePossibleNumber]>0,
([CumulativePresentNumber]/[CumulativePossibleNumber])*100,
0) AS CumulativePercent

FROM tbl_TEMP_all_attendances_with_week_number AS tmp_attends
ORDER BY
tmp_attends.Adno,
tmp_attends.ExportWeekNumber;
 
M

Marshall Barton

dochsm said:
This query processes 200,000 records each week. It calculates two
cumulative attendance totals by weeknumber for students who are
identified by their admission number (Adno). Trouble is it takes
around 45 minutes to run. Is there a better way to write it?

INSERT
INTO Tbl_CumulativeAttendance
(
Adno ,
ExportWeekNumber ,
Used_Surname ,
CumulativePresentNumber ,
CumulativePossibleNumber,
CumulativePercent
)
SELECT DISTINCT
tmp_attends.Adno ,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname ,
(SELECT SUM(presentMark)
FROM tbl_TEMP_all_attendances_with_week_number
WHERE AdNo = tmp_attends.Adno
AND val( ExportWeekNumber) <=
val(tmp_attends.ExportWeekNumber)
) AS CumulativePresentNumber,

(SELECT SUM(PossibleMark)
FROM tbl_TEMP_all_attendances_with_week_number
WHERE AdNo = tmp_attends.Adno
AND val(ExportWeekNumber) <=
val(tmp_attends.ExportWeekNumber)
) AS CumulativePossibleNumber,

IIf([CumulativePossibleNumber]>0,
([CumulativePresentNumber]/[CumulativePossibleNumber])*100,
0) AS CumulativePercent

FROM tbl_TEMP_all_attendances_with_week_number AS tmp_attends
ORDER BY
tmp_attends.Adno,
tmp_attends.ExportWeekNumber;


Too bad the ExportWeekNumber field is Text. If you could
arrange for it to be numeric, then you could index it and
make the comparisons a lot faster. I think it would be
worth to use a temp table to convert the value and index the
field.

Whether you can do that or not, I think you can still get a
significant improvement by using a non equi join (only
available in SQL view) instead of subqueries:

SELECT
tmp_attends.Adno,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname,
SUM(Q.presentMark) AS CumulativePresentNumber,
SUM(QPossibleMark) AS CumulativePossibleNumber,

IIf(SUM(QPossibleMark)>0,
(SUM(Q.presentMark)/SUM(QPossibleMark))*100,
0) AS CumulativePercent

FROM tbl_TEMP_all_attendances_with_week_number AS
tmp_attends
INNER JOIN tbl_TEMP_all_attendances_with_week_number AS Q
ON Q.AdNo = tmp_attends.Adno
And val(Q.ExportWeekNumber) <=
val(tmp_attends.ExportWeekNumber)

GROUP BY tmp_attends.Adno,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname,

ORDER BY
tmp_attends.Adno,
tmp_attends.ExportWeekNumber

Remember that you must stay in SQL view. Switching to
design view will destroy part of the query and don't ask me
why we are even allowed to switch.

I suspect that if you can make both changes, the speed
improvement would be seriously dramatic.
 
D

dochsm

That was indeed faster. I made a temp table first using the val() so I
could index by week number and got rid of the Val() in the query. At
first I thought nothing was happening as the status bar showed almost
no progress for about 5 minutes. Then it shot to the end in about two
seconds. Interestingly though I did consider a self join first of all
but was put off doing it by comments on some groups that a subqueries
are faster. Obviously not in this case.
Thanks for the help
Howard
 
M

Marshall Barton

dochsm said:
That was indeed faster. I made a temp table first using the val() so I
could index by week number and got rid of the Val() in the query. At
first I thought nothing was happening as the status bar showed almost
no progress for about 5 minutes. Then it shot to the end in about two
seconds. Interestingly though I did consider a self join first of all
but was put off doing it by comments on some groups that a subqueries
are faster. Obviously not in this case.


I strongly suspect that most of the 5 minutes are used to
copy the records and index the temp table. Is there any way
you could change the original table to make the week number
field a Number data type?
 
D

dochsm

Yes I did. I made a temp table with the same structure as my original
data (a linked csv fife) except that I changed the weeknum type to
integer and indexed it. Then I used a delete query to clear out all
the records followed by an append with a Val() to convert my weeknum
to integer and refill it each week. Then I used that temp table as the
source of the query you helped me with. I'd prefer to use a make table
instead of delete followed by an append as I suspect that would be
faster to do that bit but don't know how to do a make table query with
data while at the same time defining a data type and an index (as
opposed to using DDL to to just make an empty table) Howard
 
D

dochsm

A small question though.
Although your new code generates exactly the same percentage figures
as the original code, the cumulative present and possible figures are
very different.
I would expect the possible to be 10 more than it was last week each
time,
as they are in the original query but I cannot see where the new
figures are coming from.
Is there a way to retain the original sums?

eg
original result

WeekNumber CumulativePresentNumber CumulativePossibleNumber
CumulativePercent
1 3 4 75
2 13 14 92.8
3 23 24 95.8
4 33 34 97.0
5 42 44 95.4
6 48 52 92.3
7 54 62 87.0
8 64 72 88.8

result of faster query

ExportWeekNumber CumulativePresentNumber CumulativePossibleNumber
CumulativePercent
1 36 48 75
2 182 196 92.8
3 322 336 95.8
4 462 476 97.0
5 588 616 95.4
6 672 728 92.3
7 756 868 87.0
8 1792 2016 88.8


Howard


Weeknum CumPresentr CumPossibler CumPercent
1 3 4 75
2 13 14 92.8
3 23 24 95.8
4 33 34 97.0
5 42 44 95.4
6 48 52 92.3
7 54 62 87.0
8 64 72 88.8
9 74 82 90.2

ExportWeekNumber CumulativePresentNumber CumulativePossibleNumber
CumulativePercent
1 36 48 75
2 182 196 92.8
3 322 336 95.8
4 462 476 97.0
5 588 616 95.4
6 672 728 92.3
7 756 868 87.0
8 1792 2016 88.8
 
M

Marshall Barton

dochsm said:
Yes I did. I made a temp table with the same structure as my original
data (a linked csv fife) except that I changed the weeknum type to
integer and indexed it. Then I used a delete query to clear out all
the records followed by an append with a Val() to convert my weeknum
to integer and refill it each week. Then I used that temp table as the
source of the query you helped me with. I'd prefer to use a make table
instead of delete followed by an append as I suspect that would be
faster to do that bit but don't know how to do a make table query with
data while at the same time defining a data type and an index (as
opposed to using DDL to to just make an empty table) Howard


I don't agree that a make table query is faster than a
delete all and append query. I do know that a creating a
temp table contributes to a higher degree of bloat over time
requiring more frequent compact operations. Try it an see
if one or the other is faster or not.

You can use the Val function in the make table query to
convert the data, but that will not create an index.
 
M

Marshall Barton

dochsm said:
A small question though.
Although your new code generates exactly the same percentage figures
as the original code, the cumulative present and possible figures are
very different.
I would expect the possible to be 10 more than it was last week each
time,
as they are in the original query but I cannot see where the new
figures are coming from.
Is there a way to retain the original sums?

eg
original result

WeekNumber CumulativePresentNumber CumulativePossibleNumber
CumulativePercent
1 3 4 75
2 13 14 92.8
3 23 24 95.8
4 33 34 97.0
5 42 44 95.4
6 48 52 92.3
7 54 62 87.0
8 64 72 88.8

result of faster query

ExportWeekNumber CumulativePresentNumber CumulativePossibleNumber
CumulativePercent
1 36 48 75
2 182 196 92.8
3 322 336 95.8
4 462 476 97.0
5 588 616 95.4
6 672 728 92.3
7 756 868 87.0
8 1792 2016 88.8


I don't see how that can happen. For the percent to come
out correctly, the other values have to be multiplied by the
same number.

WeekNum multiplier
1 12
2 14
3 14
4 14
5 14
6 14
7 14
8 28

The only ways I can see that happening is if there are
multiple duplicate records in the table for each week number
or if the query has something odd. Post back with the query
that produced those results so I can double chack it.
 
D

dochsm

Here it is.
I altered your code slightly to do without the extra alias but it
produced the same result as that your code
Thanks for your continued help with this.
Howard

INSERT INTO Tbl_CumulativeAttendance
( Adno,
ExportWeekNumber,
Used_Surname,
CumulativePresentNumber,
CumulativePossibleNumber,
CumulativePercent )
SELECT
tmp_attends.Adno,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname,
SUM(Q.presentMark) AS CumulativePresentNumber,
SUM(Q.PossibleMark) AS CumulativePossibleNumber,
IIf(SUM(Q.PossibleMark)>0,(SUM(Q.presentMark)/
SUM(Q.PossibleMark))*100,0) AS CumulativePercent
FROM
tbl_TEMP_all_attendances_with_week_number AS tmp_attends
INNER JOIN
tbl_TEMP_all_attendances_with_week_number AS Q ON
(Q.AdNo=tmp_attends.Adno) AND
(Q.ExportWeekNumber<=tmp_attends.ExportWeekNumber)
GROUP BY
tmp_attends.Adno,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname
ORDER BY tmp_attends.Adno, tmp_attends.ExportWeekNumber;


The first few rows of the table it is processing look like this. It
represents the attendance mark for lots of students,
Each row is one mark,
2 marks each day - these are often, but not always the same eg 3rd
September,
Usually 10 marks a week unless there are short weeks at the beginning
of terms.
I pre-processed the raw data to get this table by translating the
various raw mark codes entered by the staff into a 1 or a 0 for
both presentmark and possiblemark so that it was easier to add up. 1 =
it was a present mark or a possible mark, 0 means it wasn't
The week number is the third column from the right

(BTW I realise this table is not normalised, The derivable columns are
only there while I debug it and check the maths.
The finished thing will only have the Adno and the rightmost 4
columns)


Adno UPN Used_Surname Used_Forename Legal_Surname Legal_Forename
Gender Reg Year Mark markdate ExportWeekNumber PresentMark
PossibleMark
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 31-
Aug-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 31-
Aug-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 01-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 01-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 02-
Sep-10 1 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 02-
Sep-10 1 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 M 03-
Sep-10 1 0 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 03-
Sep-10 1 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 04-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 04-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 05-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 05-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 06-
Sep-10 2 1 1 <---- this is where week 2 starts
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 06-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 07-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 07-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 08-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 08-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 09-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 09-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 10-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 10-
Sep-10 2 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 11-
Sep-10 2 0 0
 
T

Tony Toews

Yes I did. I made a temp table with the same structure as my original
data (a linked csv fife) except that I changed the weeknum type to
integer and indexed it. Then I used a delete query to clear out all
the records followed by an append with a Val() to convert my weeknum
to integer and refill it each week. Then I used that temp table as the
source of the query you helped me with.

I'd prefer to use a make table
instead of delete followed by an append as I suspect that would be
faster to do that bit

There is very little difference in performance between a make table
vs. a delete all followed by an append.
while at the same time defining a data type and an index (as
opposed to using DDL to to just make an empty table)

When doing an insert of a lot of records, and it isn't a multi-user
table, it is much faster to remove all indexes, insert the records and
then create the indexes in code.

Also to avoid the bloating mentioned by Marshall see the
TempTables.MDB page at my website which illustrates how to use a
temporary MDB in your app.
http://www.granite.ab.ca/access/temptables.htm

Tony

--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
 
M

Marshall Barton

dochsm said:
I altered your code slightly to do without the extra alias but it
produced the same result as that your code

INSERT INTO Tbl_CumulativeAttendance
( Adno,
ExportWeekNumber,
Used_Surname,
CumulativePresentNumber,
CumulativePossibleNumber,
CumulativePercent )
SELECT
tmp_attends.Adno,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname,
SUM(Q.presentMark) AS CumulativePresentNumber,
SUM(Q.PossibleMark) AS CumulativePossibleNumber,
IIf(SUM(Q.PossibleMark)>0,(SUM(Q.presentMark)/
SUM(Q.PossibleMark))*100,0) AS CumulativePercent
FROM
tbl_TEMP_all_attendances_with_week_number AS tmp_attends
INNER JOIN
tbl_TEMP_all_attendances_with_week_number AS Q ON
(Q.AdNo=tmp_attends.Adno) AND
(Q.ExportWeekNumber<=tmp_attends.ExportWeekNumber)
GROUP BY
tmp_attends.Adno,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname
ORDER BY tmp_attends.Adno, tmp_attends.ExportWeekNumber;


The first few rows of the table it is processing look like this. It
represents the attendance mark for lots of students,
Each row is one mark,
2 marks each day - these are often, but not always the same eg 3rd
September,
Usually 10 marks a week unless there are short weeks at the beginning
of terms.
I pre-processed the raw data to get this table by translating the
various raw mark codes entered by the staff into a 1 or a 0 for
both presentmark and possiblemark so that it was easier to add up. 1 =
it was a present mark or a possible mark, 0 means it wasn't
The week number is the third column from the right

(BTW I realise this table is not normalised, The derivable columns are
only there while I debug it and check the maths.
The finished thing will only have the Adno and the rightmost 4
columns)


Adno UPN Used_Surname Used_Forename Legal_Surname Legal_Forename
Gender Reg Year Mark markdate ExportWeekNumber PresentMark
PossibleMark
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 31-
Aug-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 31-
Aug-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 01-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 01-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 02-
Sep-10 1 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 02-
Sep-10 1 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 M 03-
Sep-10 1 0 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 \ 03-
Sep-10 1 1 1
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 04-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 04-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 05-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 # 05-
Sep-10 1 0 0
000001 E573222599033 Bloggs Sarah Bloggs Sarah F 11IC Year 11 / 06-


Sorry, somehow I had a brain fault. In spite of your
detailed explation, I missed the point about there being
multiple records for each week number. Because of that I
was calculating the sum across all records for every record
and that ends up being (sum of the records per week)
multiplied by (number of records per week), In your
original query, you calculated the (sum of the records per
week) (number of records per week) times, but your DISTINCT
removed the duplicate results.

We need to make sure there is only one record in the driving
table so there is only one set of records to sum:

SELECT
T.Adno,
T.ExportWeekNumber,
T.Used_Surname,
SUM(Q.presentMark) AS CumulativePresentNumber,
SUM(Q.PossibleMark) AS CumulativePossibleNumber,
IIf(SUM(Q.PossibleMark)>0,(SUM(Q.presentMark)/
SUM(Q.PossibleMark))*100,0) AS CumulativePercent
FROM
(SELECT DISTINCT
tmp_attends.Adno,
tmp_attends.ExportWeekNumber,
tmp_attends.Used_Surname
FROM
tbl_TEMP_all_attendances_with_week_number AS tmp_attends
) AS T
INNER JOIN
tbl_TEMP_all_attendances_with_week_number AS Q
ON Q.AdNo=T.Adno
AND Q.ExportWeekNumber<=T.ExportWeekNumber
GROUP BY
T.Adno,
T.ExportWeekNumber,
T.Used_Surname
ORDER BY T.Adno, T.ExportWeekNumber

Shesh, I hate to think that might ruin the performance gain
and make this entire exercise a waste of time.
 
D

dochsm

Yes, that mod completely changed the performance gain.
Its now down to just 20 seconds!! (and that's returning all the
derivable data and also doing a sort, neither of which are really
necessary at this stage)

Brilliant. I had been running my query overnight so that the data was
ready for use by later queries initiated by the user.
At this new speed though I could run it in real time and save myself a
lot of bother.

I must study the code, and that of Tony, so that I learn from it and
don't make the same mistake again.

Thanks for all your help.

Howard
 
D

dochsm

There is very little difference in performance between a make table
vs. a delete all followed by an append.


When doing an insert of a lot of records, and it isn't a multi-user
table, it is much faster to remove all indexes, insert the records and
then create the indexes in code.

Also to avoid the bloating mentioned by Marshall see the
TempTables.MDB page at my website which illustrates how to use a
temporary MDB in your app.http://www.granite.ab.ca/access/temptables.htm

Tony

--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages -http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog -http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
  updated seehttp://www.autofeupdater.com/

Thanks for the advice Tony, I'll give that a go.
Howard
 
M

Marshall Barton

dochsm said:
Yes, that mod completely changed the performance gain.
Its now down to just 20 seconds!! (and that's returning all the
derivable data and also doing a sort, neither of which are really
necessary at this stage)

Brilliant. I had been running my query overnight so that the data was
ready for use by later queries initiated by the user.
At this new speed though I could run it in real time and save myself a
lot of bother.

I must study the code, and that of Tony, so that I learn from it and
don't make the same mistake again.

That saved nearly 5 minutes? Great, I was afraid that using
a subquery, even in the From clause, might slow it down.
I'm suprised that a DISTINCT subquery seriously improved
performance.

If you must use make table queries, then using a temp
database as Tony explaines, really is the right way to go.
But that's a generalization, if you have other code that's
deleting/creating objects and causing bloat, you may be
covering up the temp table bloat by already using frequent
Compact operations.

OTOH, a more careful design should be able to aviod bloat
altogether in the front end db. In the back end too if
users are not deleting records on a regular basis.

I'm also surprised re your claim that a make table query was
dramtically faster than a delete all and append query,
especially if you restarted your system before each timing
run.

Speaking of restarting before timing something that is only
used once a week, it is mandatory to avoid system caching
from badly skewing the test compared to your real usage
scenario,
 
T

Tony Toews

it is mandatory to avoid system caching
from badly skewing the test compared to your real usage
scenario,

I noticed many years ago that running a complex report took 40 seconds
the first time. But after flipping to design view, making some
changes and then rerunning it only took 20 seconds. Until the report
was closed. Whereupon opening it took 40 seconds.

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
 
Top