May 7, 2010 at 3:10 am
Just to throw in my 2c, I try to stay away from techniques that do not work reasonably efficient or completely break down in other ways when the query to be paged spans multiple tables. It is far easier to design paging for a single table then it is for multiple.
Thats is why I like the idea of doing first a limited run (top X) initial query that fetches all the PKs in the desired order. Followed by queries that each window over the fetched keys to quickly and efficiently fetch the actual data to display. This allows for 'browsing' over the results in a detail view if the user clicks on the grid.
If browsing is not required you can ** maybe ** suffice with recording only the keys of page starts. And then build a (more complex) query that uses a page start key as a reference to get the next PageSize-1 results for each page requested. Assuming the data to query has a low mutation rate.
But whatever method you use in complex queries, having a narrow index that covers the fields being filtered and sorded on is a must. Because other then that, there is (i think) no way to get to the minimum possible I/O.
May 7, 2010 at 6:15 am
I quite enjoyed reading this article. I found myself going come-on come-on the solution the solution, but then was pleased when the solution was presented. I find the keeping of the count of records within each record with a constant big O great.
May 7, 2010 at 8:21 am
can someone give me a real world example of when this would be used in production? i've never seen it used, but it could be because a lot of our code was written for SQL 2000 and just migrated over with minimal changes
May 7, 2010 at 9:32 am
I haven't implemented this yet, but my thoughts on paging is this:
1) Store each unique combination of request parameters in a table (parent)
2) Run the query, store the results in another table with a foreign key to the parent table. Since this result is now read-only and there will be no provision for updating the rows, update the parent row's resultrowscount for easy access to the total.
3) Also set cache timeout values on the parent. Suppose large result sets that took a long time to run are given large cache timeout while smaller results are given shorter timeouts. Perhaps the data itself determines its life: a query on real-time data has is stale in 3 minutes while first quarter report (run in may) has a life of 30 days or more.
4) Pages are retrieved from the prefetched (and denormalized) child table.
Further requests for the same query are served from the prefetched result if the results are within the staleness tolerance. In the case the data is stale, the old key is marked obsolete and the process starts over. (This allows a nice audit/analysis to be done on cache hits and results rather than deleting old results)
If business rules allow for Top N to prevent absurdly large results then go for it, otherwise the cache at least mitigates the need to redundantly query the results from the source system.
I think a better question is to ask if users really need pages of data more than they need better results in general. "Eyeballing" long lists is a huge waste of time. Thanks to Google (et al.) we expect relevant results of search to be on the first page. If we're not sure what we're looking for, we might desperately look at page 2 or 3. After that, we hopefully have an idea how to search more specifically for better results. This makes search an iterative process of refining criteria rather than a brute-force perusal of pages of results. I was going to make this concept the focus of my seminar paper this semester but the day the topic proposal was due I read the SQLServerCentral article on genetic algorithms in T-SQL so I did that instead.
I urge anyone thinking of implementing results paging to consider how/why the user can be better served with a more conversational and iterative process. Our job should be to fetch what they really want from our database (even when they aren't sure what they want). Simply giving the user multiple pages of what they asked for makes us look lazy and makes them work much harder than they should.
May 7, 2010 at 9:34 am
To pull an idea from my ...um, thin air...
When I'm looking at the executed plan each component knows how many rows it handled, and the components used in the initial selection is the total rows.
What's the merit for returning the plan XML to the caller & letting the caller look into the XML for the total rows? It at least lets the database just handle the results, and moves the 'Total rows" workload out to the caller.
May 7, 2010 at 11:52 am
Nice article. Thanks for sharing.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
May 7, 2010 at 1:49 pm
I use a stored procedure to run the paged query, It runs the query twice; once to count, once to get the paged result set. I then return the count as the return value from the stored procedure and the result set is returned to the caller.
I also use Xml to pass in the search parameters and attributes for the paging information, but that would be a topic for another day.
Barry
May 7, 2010 at 3:28 pm
This article introduces a novel approach (in my experience) to tackling an old problem, and I think the concept behind it is great. However, the article could have been much better written, with greatly improved examples.
First, I would have liked to have seen a table in a sample database used, rather than [INFORMATION_SCHEMA].columns. This confused me at first.
Second, in the "grail" query, the second "row_number" column is aliased as "totrows". This should have been named something like rseq, since it actually represented the reverse sequence number, not the total number of rows. This was confusing too.
Third, I could not find mention of the fact that the columns in the OVER() clause of the "grail" query MUST include the primary key, and the columns must be listed in the same order in both OVER() clauses or it throws the calculation off.
Fourth, a lot of confusion could have been eliminated, and it would have been a lot more useful to the reader, if the article had also included some query examples with filter criteria to demonstrate the effect on the row count (viz. that # of rows returned was not the number of rows in the entire table).
Lastly, and this is a minor point, I was not familiar with the WITH construct. Speaking only for myself, it would have been nice if this had been explained too.
I give 5 stars for the concept, and 2 stars for the writing.
May 7, 2010 at 5:14 pm
One small observation from trying this: you have to be sure that the ORDER BY fields are unique, otherwise the Ascending and Descending sequences may not mirror each other, so the sum of the UP and DOWN row_number() fields is not guaranteed to be the total row count.
With that in mind, though, the central idea here does give another way to calculate a MEDIAN - sort for row_number on the value you are interested in (possibly including an ID in the ORDER BY to separate duplicate values) , then your median is the value where UP and DOWN are the same, or , if there is an even number of values, it's the average of those two where abs(UP - DOWN) = 1
Both these cases reduce to
avg(value).....WHERE (UP - DOWN) between -1 and 1
This snippet finds the median of the field enSum, in the table called Enrolments:
with RankedEnrol
As
(
Select enSum
, row_number() over (ORDER BY enSum, enID) UP-- identity is included in the ORDER BY to ensure uniqueness
, row_number() over (ORDER BY enSum DESC, enID DESC) DOWN
FROM Enrolments
)
SELECT avg(enSum) MedianEnrol
from RankedEnrol
WHERE (UP - DOWN) between -1 and 1
May 8, 2010 at 12:31 pm
brian lewis (5/7/2010)
... the central idea here does give another way to calculate a MEDIAN - sort for row_number on the value you are interested in (possibly including an ID in the ORDER BY to separate duplicate values) , then your median is the value where UP and DOWN are the same, or , if there is an even number of values, it's the average of those two where abs(UP - DOWN) = 1Both these cases reduce to
avg(value).....WHERE (UP - DOWN) between -1 and 1
This snippet finds the median of the field enSum, in the table called Enrolments:
with RankedEnrol
As
(
Select enSum
, row_number() over (ORDER BY enSum, enID) UP-- identity is included in the ORDER BY to ensure uniqueness
, row_number() over (ORDER BY enSum DESC, enID DESC) DOWN
FROM Enrolments
)
SELECT avg(enSum) MedianEnrol
from RankedEnrol
WHERE (UP - DOWN) between -1 and 1
Nice!
May 20, 2010 at 8:59 am
Numerous people have noted that the "holy grail" pattern does not work when there are duplicate values in the sort column.
a simple work around for this is:
With CTE as (
select *, row_number() OVER (ORDER BY nonunique ASC) seq
from MyTable
),
with Reverse as
(
select *, row_number() OVER (ORDER BY seq DESC) tot
)
select fields, seq+tot-1 as totalRecs
from Reverse
In my tests, this did not add any noticeable overhead over the standard "Holy Grail" pattern.
I find this does work well for small sets, and contrary to another posters statement, I disagree that all your SQL should be optimized to 1 million+ records. You should use the correct SQL and indexes for your workload, and you should have a good understanding of your workload, expected growth, access pattern, read / write ratio ,etc...
If I have a system that has little data, but a high transaction count, wicked processors and a slow disk sub-system, this pattern is ideal. For large amounts of data, this pattern has been clearly shown to break down and would not necessarily be the correct choice.
I'd be interested in seeing differences in performance with real-world examples of complex queries, as paging is typically used on search results, and search queries rarely, if ever, read from a single table as most of the examples both for and against this pattern have dictated.
December 5, 2010 at 11:40 pm
nice solution. and very nicely written too, but the holy grill solution is not giving the correct 'Total count of records' in case if the 'ROW_NUMBER() OVER(ORDER BY' column is not the unique column.
December 5, 2010 at 11:46 pm
Thanks SSC Rookie. your solution helped me.
December 6, 2010 at 7:27 pm
Paul Muharsky-474732 (5/20/2010)
I disagree that all your SQL should be optimized to 1 million+ records.
Heh... Yeah... I know I'm responding to an old post but the kind of attitude above why there are so many posts on this site that start with "I have this performance problem. Can anyone help"? If you look back at Peso's code, 4 milli-seconds vs almost 6 seconds isn't something that anyone in their right mind would disagree with. 😉
Write high performance code as if your life depended on it because saying "That's not my job" may be a self-fulfilling prophecy. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
December 6, 2010 at 8:15 pm
It seems the dual TOP/ORDER BY was the most efficient approach by the time.
I tested the new Denali OFFSET/FETCH and I got the exact same number of reads for @StartRow = 5000, but 2 ms instead of 4 ms.
SELECTNumber
FROMdbo.TallyNumbers
ORDER BYNumber
OFFSET@StartRow - 1 ROWS
FETCH NEXT50 ROWS ONLY
N 56°04'39.16"
E 12°55'05.25"
Viewing 15 posts - 46 through 60 (of 64 total)
You must be logged in to reply to this topic. Login to reply