Task: Update a table with the most current value of a detail table.
In this article I'm using the sys.tables / sys.columns tables because they are available everywhere. In reality it would be, e.g. a customer table, where you are setting the last_order_date on the product table where you want the latest price.
For ideal performance the detail table should have an index on the main ID (object_id in the example or the customer_id in a more realistic scenario) plus the detail id (column_id / order_id) and includes the attribute (column name / order_date), too. This column is part of a clustered index or in the INCLUDE clause of the CREATE INDEX for a nonclustered index. If there is no fitting index everything will be much slower (you will have an table scan / clustered index scan) and always have a SORT operator, so you would not benefit that much from this test.
All tests were done on SQL Server 2022 CU3.
Preparation
First we prepare some test data in #temp-tables, because the sys.tables / sys.columns are not native tables but internal views. They also do not have the "perfect" indexes mentioned above. To make it more comparable, I blow up the tables to 5 mio columns as when you run tests with just a few thousand rows, the overhead is to large compared to the total runtime. Ideally you run this on a database with many tables / columns (otherwise it takes a few seconds longer because of more iterations in the WHILE loop). I know, I could have done this more set based with CROSS JOINs etc. but I was lazy π
--#region preparing #temp-tables DROP TABLE IF EXISTS #tbl DROP TABLE IF EXISTS #columns CREATE TABLE #tbl (object_id INT NOT NULL PRIMARY KEY , schema_name sysname NOT NULL , table_name sysname NOT NULL , last_column sysname NULL ); CREATE TABLE #columns (object_id INT NOT NULL , column_id INT NOT NULL , column_name sysname NOT NULL , PRIMARY KEY CLUSTERED (object_id, column_id) ); SET STATISTICS TIME, IO OFF; DECLARE @i INT = 0 , @total_rows INT = 5000000 WHILE @i < @total_rows BEGIN INSERT INTO #tbl (object_id, schema_name, table_name, last_column) SELECT t.object_id + @i , SCHEMA_NAME(t.schema_id) AS schema_name , t.name AS table_name , CAST(NULL AS sysname) AS last_column FROM sys.tables AS t; INSERT INTO #columns (object_id, column_id, column_name) SELECT object_id + @i , column_id + @i , name AS column_name FROM sys.columns AS c; SET @i += @@ROWCOUNT; END SELECT FORMAT(COUNT(*), '#,##0.####') AS rows_in_tbl FROM #tbl AS t -- 176.305 SELECT FORMAT(COUNT(*), '#,##0.####') AS rows_in_columns FROM #columns AS c -- 5,001,344 --#endregion preparing #temp-tables GO
To achieve the goal there are different ways and I tested four of these. The IO (logical reads / writes) are very close for all tested statements and do not really matter in this case as 5 mio rows are nothing for a modern server, particularly since all the data fits in buffer pool memory. The relevant value to compare the performance is the CPU time in this case. You can run
SET STATISTICS TIME, IO ON;
to get the metrics back (for the current session). We then use Statistics Parser to format the output better (at least if you are using a SQL Server with English installation).
The four different tests are shown below.
LAST_VALUE()
First I tested LAST_VALUE(). Remember that you have to add DISTINCT in this case to remove the duplicate rows (if you run just the SELECT part; in the whole UPDATE statement it would update the main table multiple times (once for every row in the detail table), which would be slow and could cause other problems, e.g. when you are using triggers or other auditing stuff). The statement would run without the ROWS BETWEEN too, but uses the tempdb for an internal worktable in this case, which makes the performance much worser.
UPDATE t SET t.last_column = sub.column_name FROM #tbl AS t INNER JOIN (SELECT --DISTINCT -- necessary to eliminate dublets c.object_id , LAST_VALUE(c.column_name) OVER (PARTITION BY c.object_id ORDER BY c.column_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS column_name FROM #columns AS c ) AS sub ON sub.object_id = t.object_id GO 10 -- run it 10 times
This was the slowest, because it uses a SORT and a Window Spool operator. The average CPU time was 10.9 seconds per execution. When I compared the estimated plans for all 4 tested variants it estimates this statement to 35% of the total costs.
FIRST_VALUE()
My second test was FIRST_VALUE() with a DESCending sort order. Be aware that the ROWS BETWEEN and of course the ORDER BY differs to the one for LAST_VALUE (so don't just change the function name).
UPDATE t SET t.last_column = sub.column_name FROM #tbl AS t INNER JOIN (SELECT DISTINCT c.object_id , FIRST_VALUE(c.column_name) OVER (PARTITION BY c.object_id ORDER BY c.column_id DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS column_name FROM #columns AS c ) AS sub ON sub.object_id = t.object_id GO 10 -- run it 10 times
Regarding the performance / execution plan it was almost the same as with LAST_VALUE(). It did a some fewer reads (don't know exactly why). The average CPU time was almost equal (10.3 seconds) and the estimated costs where 35% too.
ROW_NUMBER()
Next idea was to use ROW_NUMBER() with a descending order and filter for the lines with rn = 1.
BTW: I'v seen often people using RANK() or DENSE_RANK() in similar situations. Don't do this, there is no benefit (compared to ROW_NUMBER()) but you have to care about possible double ranks or holes in the ranks, if your ORDER BY is not guaranted unique. And of course the logic for the ranking functions is much more complex than for a simple row numbering, so it is slower. Of course both rank functions have their use case (often in reports), but this is not a valid use case.
UPDATE t SET t.last_column = sub.column_name FROM #tbl AS t INNER JOIN (SELECT c.object_id , c.column_name , ROW_NUMBER() OVER (PARTITION BY c.object_id ORDER BY c.column_id DESC) AS rn FROM #columns AS c ) AS sub ON sub.object_id = t.object_id AND sub.rn = 1 -- only uses the most recent line (adds an filter operator) ; GO 10 -- run it 10 times
This is much faster, the avg CPU time drops to 2.65 seconds per execution, the estimated costs where 20%. Problem is, that it still adds an expensive SORT operator, because you want the last column_id (therefore DESCending sort order), while the #columns table is clustered by object_id and column_id ASCending. And somehow SQL Server sorts internally by object_id ascending (because of the PARTITION BY) and column_id DESC (as specified), which prevents the index usage.
ROW_NUMBER() with explicit ORDER BY
To prevent this I forced the SQL server to explicit order by the object_id descending too. Remind, that indexes can be read forward and backward, so when both used columns (object_id and column_id) are needs to be ordered descending, it could simply do a backward scan on the ascending ordered index (before you ask: it is usually no good idea to create an index ordered by object_id ASC and column_id DESC, because in a real world scenario it would be not the column_id but e.g. order_date, which would create many page splits / index fragmentation, when your customers are continue to order stuff).
Since I use a subquery for the detail table, I have to use TOP with a very big limit, otherwise SQL Server would not allow the explicit order in the subquery.
UPDATE t SET t.last_column = sub.column_name FROM #tbl AS t INNER JOIN (SELECT TOP (2147483647) -- necessary, otherwise the ORDER BY is not allowed; number is the max valid value for an INT c.object_id , c.column_name , ROW_NUMBER() OVER (PARTITION BY c.object_id ORDER BY c.column_id DESC) AS rn FROM #columns AS c ORDER BY c.object_id DESC -- most important line (and the DESC is the most important word) -- this "forces" the SQL server to do a backward scan on the existing index (object_id + column_id both ascending) -- instead of adding a sort operator (because SQL Server stumbles over the mixed ascending and descending sort order) ) AS sub ON sub.object_id = t.object_id AND sub.rn = 1 -- only uses the most recent line (adds an filter operator) GO 10 -- run it 10 times
This time the execution plan does not longer shows the SORT operator, which makes this statement the fastest - just 1.36 seconds average CPU time (about the half of the ROW_NUMBER without the explicit ORDER BY and about 1/10th of the LAST_VALUE() / FIRST_VALUE()). Consequently the estimated costs in the execution plan where 10% (of the total of all tested 4 versions).
Conclusion
Forcing a backward scan by adding TOP and ORDER BY <main_id> DESC in the subquery is the fastest.
Possible drawback: before SQL Server 2022 a backward index scan couldn't go parallel, so it may be slower than the 3rd solution (ROW_NUMBER() but without TOP/ORDER BY) , when you are using it on VERY large data sets (I tested it with ~22 Mio rows and version 4 was still much faster).
In SQL Server 2022 CU3, when I enforce parallelism by adding OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) to the query, it shows me parallelism in the query plan and still doesn't use an sort operator, but the performance was slower again (to the niveau of the ROW_NUMBER without TOP/ORDER BY), so I'd assume that it uses some sort of invisible sort.
PS: Don't forget: There is no perfect solution for all situations - it always depends so you need to test with your own data / environment :-).