Introduction
The paging and ranking functions introduced in 2005 are old news by now, but the typical ROW_NUMBER OVER() implementation only solves part of the problem.
Nearly every application that uses paging gives some indication of how many pages (or total records) are in the total result set. The challenge is to query the total number of rows, and return only the desired records with a minimum of overhead? The holy grail solution would allow you to return one page of the results and the total number of rows with no additional I/O overhead.
In this article, we're going to explore four approaches to this problem and discuss their relative strengths and weaknesses. For the purposes of comparison, we'll be using I/O as a relative benchmark.
The 'two-bites' approach
The most basic approach is the 'two-bites' approach. In this approach you, effectively, run your query twice; querying the total rows in one pass, and querying your result set in the second. The code is pretty straightforward:
DECLARE @startRow INT ; SET @startrow = 50
SELECT COUNT(*) AS TotRows
FROM [INFORMATION_SCHEMA].columns
;WITH cols
AS
(
SELECT table_name, column_name,
ROW_NUMBER() OVER(ORDER BY table_name, column_name) AS seq
FROM [INFORMATION_SCHEMA].columns
)
SELECT table_name, column_name
FROM cols
WHERE seq BETWEEN @startRow AND @startRow + 49
ORDERBY seq
It gives the desired results, but this approach doubles the cost of the query because you query your underlying tables twice:
(1 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 1, logical reads 46, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 34, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(50 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 1, logical reads 46, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 34, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
The temp table approach
The 'two-bites' approach is especially undesirable if your paged query is very expensive and complex. A common workaround is to write the superset into a temporary table, then query out the subset. This is also the most common way to implement paging pre-2005 (in this case, ROW_NUMBER is superfluous).
DECLARE @startRow INT ; SET @startrow = 50
CREATE TABLE #pgeResults(
id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
table_name VARCHAR(255),
column_name VARCHAR(255)
)
INSERT INTO #pgeResults(Table_name, column_name)
SELECT table_name, column_name
FROM [INFORMATION_SCHEMA].columns
ORDERBY [table_name], [column_name]
SELECT @@ROWCOUNT AS TotRows
SELECT Table_Name, Column_Name
FROM #pgeResults
WHERE id between @startrow and @startrow + 49
ORDER BY id
DROPTABLE #pgeResults
Looking at the query plan, you can see that your underlying tables are queried only once but the I/O stats show us that you take an even bigger hit populating the temporary table.
Table '#pgeResults_________________________________________________________________________________________________________000000001A9F'. Scan count 0, logical reads 5599, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 1, logical reads 46, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 34, physical reads 0, read-ahead reads 39, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(2762 row(s) affected)
(1 row(s) affected)
(50 row(s) affected)
Table '#pgeResults_________________________________________________________________________________________________________000000001A9F'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
In this case, it would be better to query the tables twice. Maybe some new 2005 functionality can yield a better solution.
The COUNT(*) OVER() Approach
OVER() can also be used with Aggregate Window Functions. For our purposes this means we can do a COUNT(*) without the need for a GROUP BY clause, returning the total count in our result set. The code definitely looks much cleaner and, if your application permits it, you can simply return one dataset (eliminating the overhead of writing to a temp table).
DECLARE @startRow INT ; SET @startrow = 50
;WITH cols
AS
(
SELECT table_name, column_name,
ROW_NUMBER() OVER(ORDER BY table_name, column_name) AS seq,
COUNT(*) OVER() AS totrows
FROM [INFORMATION_SCHEMA].columns
)
SELECT table_name, column_name, totrows
FROM cols
WHERE seq BETWEEN @startRow AND @startRow + 49
ORDER BY seq
Unfortunately this approach has it's own hidden overhead:
Table 'Worktable'. Scan count 3, logical reads 5724, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 1, logical reads 46, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 34, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Where did that come from? In this case, SQL Server implements the COUNT(*) OVER() by dumping all the data into a hidden spool table, which it then aggregates and joins back to your main output. It does this to avoid re scanning the underlying tables. Although this approach looks the cleanest, it introduces the most overhead.
I've spent most of today cleaning up and old data-paging proc that is both very inefficient and frequently called enough for me to notice it. I've explored probably a dozen other approaches to solving this problem before I came up with the solution below. For the sake of brevity—and because they rest are pretty obscure and equally inefficient—we'll now skip to the best solution.
The Holy Grail
In theory, ROW_NUMBER() gives you all the information you need because it assigns a sequential number to every single row in your result set. It all falls down, of course, when you only return a subset of your results that don't include the highest sequential number. The solution is to return a 2nd column of sequential numbers, in the reverse order. The total number of the records will always be the sum of the two fields on any given row minus 1 (unless one of your sequences is zero-bound).
DECLARE @startRow INT ; SET @startrow = 50
;WITH cols
AS
(
SELECT table_name, column_name,
ROW_NUMBER() OVER(ORDER BY table_name, column_name) AS seq,
ROW_NUMBER() OVER(ORDER BY table_name DESC, column_name desc) AS totrows
FROM [INFORMATION_SCHEMA].columns
)
SELECT table_name, column_name, totrows + seq -1 as TotRows
FROM cols
WHERE seq BETWEEN @startRow AND @startRow + 49
ORDER BY seq
This approach gives us our page of data and the total number of rows with zero additional overhead! (well, maybe one or two ms of CPU time, but that's it) The I/O statistics are identical to querying just the subset of records.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 1, logical reads 46, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 34, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Compare the stats above with the stats and query below (just returning one page of data).
;WITH cols
AS
(
SELECT table_name, column_name,
ROW_NUMBER() OVER(ORDER BY table_name, column_name) AS seq
FROM [INFORMATION_SCHEMA].columns
)
SELECT table_name, column_name
FROM cols
WHERE seq BETWEEN @startRow AND @startRow + 49
ORDER BY seq
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 1, logical reads 46, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 34, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Conclusion
I have found this approach to be best suited for smaller resultsets from complex queries where I/O is the primary bottleneck. Jeff Moden, Peso and others here have pointed out that with larger resultsets, the I/O cost you save is more than outweighed by the CPU cost. You definitly want to compare different approches to find the best solution for your problem.
My real goal here was to try and figure out a way to avoid unnecessary I/O overhead. I am sure that this solution is not the last word on the subject and I greatly look forward to hearing your thoughts, experiences and ideas on this topic. Thank you all for reading and for your feedback.