SQL TOP 50,000 Help

  • Thread starter Brandon Johnson
  • Start date
B

Brandon Johnson

Basically what I'm trying to do is take ALL the information from a
130,000 record table and send chunks of 50,000 to its own individual
excel documents. The problem lies when the record count in excel
exceeds the 50,000 mark after i send the information over. I don't
understand whats going on. If ANYONE coudl help that would be
EXCELLENT! Thank you in advance.

Here's an example of my sql statements:

For i = 1 To totalDocsNeeded
If i = 1 Then 'if i=1 then use simple
query to grab top 50000
Set tempT = CurrentDb.OpenRecordset("SELECT TOP 50000 * FROM
Results ORDER BY Branch ASC, ClientCode ASC;", dbOpenDynaset)
Else
If per * i >= 100 Then
'if at end of table just send rest of the
remaining to document. per is
the
' percentage of: 50000 / total records in the
table
Set tempT = CurrentDb.OpenRecordset("SELECT TOP 50000 *
FROM Results WHERE Branch IN (SELECT TOP 100 percent Branch FROM
Results ORDER BY Branch ASC) ORDER BY Branch DESC;", dbOpenDynaset)
Else 'Take 100000 records and take the last
50,000
Set tempT = CurrentDb.OpenRecordset("SELECT TOP 50000 *
FROM Results WHERE Branch IN (SELECT TOP " & 50000 * i & " Branch FROM
Results ORDER BY Branch ASC) ORDER BY Branch DESC;", dbOpenDynaset)
End If
End If
Next
 
K

Klatuu

The TOP predicate is misleading to some degree. It returns the TOP number of
values, not the number of rows. For example, if you say TOP 3 and there are
two records with the same vaule, you will get 4 rows, not 3.
 
B

Brandon Johnson

The TOP predicate is misleading to some degree. It returns the TOP number of
values, not the number of rows. For example, if you say TOP 3 and there are
two records with the same vaule, you will get 4 rows, not 3.

hmm okay. well would you have any suggestions for my situation?
 
K

Klatuu

Sorry, Brandon, I really don't know of a solution using a query to get
exactly 50000. The only way I would know how would be to use recordset
processing to create a temporary table with a For Next loop to control the
number of records created.
 
B

Bob Hairgrove

Basically what I'm trying to do is take ALL the information from a
130,000 record table and send chunks of 50,000 to its own individual
excel documents. The problem lies when the record count in excel
exceeds the 50,000 mark after i send the information over. I don't
understand whats going on. If ANYONE coudl help that would be
EXCELLENT! Thank you in advance.

Is there a documented limit to the number of rows Excel can handle, or
is this perhaps a limitation of hard disk or temp directory space, or
maybe virtual memory?

With this many rows, I would try to import everything first to a
tab-delimited text file. You can change the ending of such a text file
from *.txt to *.xsl and Excel will open it without a problem
(assuming, of course, that there aren't other problems such as too
many rows).

As someone else pointed out, the "TOP n" predicate in SQL doesn't
always do what you need it to. I would think that it is probably very
important that you don't have any duplicate rows in the resulting
Excel file(s), and with "TOP n" you might.

Does this table have a primary key or unique index? If so, and you
still run into problems when doing the export via a text file, I would
write a little hash function which generates a number, or "bucket",
which can be used to filter the records. If you pass the primary key
value as an argument, it should perform fairly quickly ... 100,000
rows or so isn't really that many. If the primary key field type is
AutoNumber or long integer, you might simply use the MOD function for
this purpose. Besides, you avoid the "ORDER BY" clause doing it that
way -- that can be a potential performance killer (unless you have an
index on that column or group of columns).

For example, say you have 130,000 rows or so and want to split it up
into 5 buckets of approx. 20,000 rows each. Assuming that the PK
values are more or less sequential and/or randomly distributed, you
could do five SELECT statements using "WHERE ([primary key] Mod 5)=n"
as criteria where n is an integer between 0 and 4.
 
K

Klatuu

--
Dave Hargis, Microsoft Access MVP


Bob Hairgrove said:
Is there a documented limit to the number of rows Excel can handle, or
is this perhaps a limitation of hard disk or temp directory space, or
maybe virtual memory?

Excel versions up through 2003 is 65K rows
Non of the other items you listed should be any problem.
The problem is how the TOP predicate works in SQL
With this many rows, I would try to import everything first to a
tab-delimited text file. You can change the ending of such a text file
from *.txt to *.xsl and Excel will open it without a problem
(assuming, of course, that there aren't other problems such as too
many rows).

An exhaustive waste of time and energy
As someone else pointed out, the "TOP n" predicate in SQL doesn't
always do what you need it to. I would think that it is probably very
important that you don't have any duplicate rows in the resulting
Excel file(s), and with "TOP n" you might.

Does this table have a primary key or unique index? If so, and you
still run into problems when doing the export via a text file, I would
write a little hash function which generates a number, or "bucket",
which can be used to filter the records. If you pass the primary key
value as an argument, it should perform fairly quickly ... 100,000
rows or so isn't really that many. If the primary key field type is
AutoNumber or long integer, you might simply use the MOD function for
this purpose. Besides, you avoid the "ORDER BY" clause doing it that
way -- that can be a potential performance killer (unless you have an
index on that column or group of columns).

For example, say you have 130,000 rows or so and want to split it up
into 5 buckets of approx. 20,000 rows each. Assuming that the PK
values are more or less sequential and/or randomly distributed, you
could do five SELECT statements using "WHERE ([primary key] Mod 5)=n"
as criteria where n is an integer between 0 and 4.

Sorry, Bob, I don't see the point of all this. The OP is trying to return
the TOP 50,000 records based on some criteria which to work requires the
ORDER BY.
 
K

Ken Snell \(MVP\)

Perhaps adding a "ranking" field (based on a subquery) to the query and then
filtering on its value being 50000 or less.....
 
B

Bob Hairgrove

Sorry, Bob, I don't see the point of all this. The OP is trying to return
the TOP 50,000 records based on some criteria which to work requires the
ORDER BY.

That is the whole beauty of a hash function ... no ORDER BY is
necessary! When I worked on the Swiss census in 2000, we had
partitioned tables with > 12 million rows (in an Oracle database, not
Access <g>) and our nightly batch programs had to select exactly one
of eight possible partitions because they ran in parallel and couldn't
be allowed to process the same data. It was also important WRT load
balancing. The Oracle built-in hash function works VERY fast
(actually, I believe it resides in the package DBMS_UTIL, but I could
be mistaken). Unfortunately, Access has no such built in function, but
using MOD etc. on the primary key, you can achieve much the same
results.

As to your concerns WRT to the text export, it is really much faster,
and also gives you a smaller file.

The trouble with recommending TOP with ORDER BY is when you start to
have more than 1,000,000 and not just 130,000 rows. It is OK for
quick-and-dirty things, but I try to avoid it at all cost. It is a bad
habit, and most other RDBMS don't even offer it.
 
K

Klatuu

I don't disagree; however, the OP wants the top 50K records based on some
unspecified criteria. The TOP predicate does not work correctly without
using ORDER BY on the fields involved in the criteria. We are not talking
massive amounts of data, so I don't think this is a big issue. Certainly, it
will win no performance awards, but it will get the job done.

I'm with you on limited use of ORDER BY, but sometimes it is necessary. It
also helps to have an index specified for any fields in a table to be used in
the ORDER BY.

I like your hashing algorithm, it is going in my "bag of tricks", but unless
I am too thick to catch your point, I don't see how it applies to this
particular question.
 
B

Bob Hairgrove

I don't disagree; however, the OP wants the top 50K records based on some
unspecified criteria. The TOP predicate does not work correctly without
using ORDER BY on the fields involved in the criteria. We are not talking
massive amounts of data, so I don't think this is a big issue. Certainly, it
will win no performance awards, but it will get the job done.

I'm with you on limited use of ORDER BY, but sometimes it is necessary. It
also helps to have an index specified for any fields in a table to be used in
the ORDER BY.

I like your hashing algorithm, it is going in my "bag of tricks", but unless
I am too thick to catch your point, I don't see how it applies to this
particular question.

As I understand it, the only reason the OP wanted to use TOP 50,000 is
in order to split up his data in order to get around the 64K row
number limitation of Excel? But maybe I'm being dense here. :)

Agreed, if he is ordering by an indexed or key column, it will go
pretty fast. I think the hashing method is much cleaner, though.
 
B

Bob Hairgrove

I like your hashing algorithm, it is going in my "bag of tricks", but unless
I am too thick to catch your point, I don't see how it applies to this
particular question.

A lot of the time, people use AutoNumber as their primary key, and the
Mod function works well enough with that. However, there are times
when it isn't easy to use it, e.g. when multiple columns are involved
in the primary key or unique index, when the PK is textual, or when
the column used to generate the hash number contains many equal
values. This is similar to the problems which arise with TOP (as you
pointed out).

To work around that problem, it sometimes helps to store a random
number in an extra column and run the hash (or Mod) function over that
to ensure equal distribution of values. The random number should be an
integer if you want use Mod because that function is only meaningful
for integers. The integer range should then be 0 to Count(<table
rows>)-1. Of course, there might be a few duplicate numbers generated,
but for this purpose only the statistical distribution is meaningful.

Also, if the table doesn't already contain a column of type
AutoNumber, one can be added. Access will automatically populate it
for existing records, and it is very quick. There is also a way to
generate a sequential number using Recordset.AbsolutePosition, but
that is usually way too slow.

In general, it is often better (i.e. more precise, but not necessarily
faster) to open a recordset in code and iterate over N rows than to
use TOP, perhaps creating a temporary table from the result set and
running additional queries on that. That way, the duplicate values
don't really matter.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top