Overview
One of SQL Server's strengths is its ability to operate against multiple rows using just one statement, but in excess that strength can be problematic. For example, performing an insert of 20 million rows will place numerous page locks and possibly a table lock on that table, which can prevent others from effectively using any applications that access that table. Alternatively, one could use cursors to modify one row at a time, but the performance of cursors is not great and is not a good solution for processing millions of rows.
This article describes techniques that can be used to make operations against large data sets more manageable. You will be able to stop and start operations at a whim and throttle them up or down depending on how heavily you may want to use system resources at any given time.
'Tis the Season...
In some situations you can schedule operations against large data sets to occur off-peak hours, but what if your system has to be very responsive around-the-clock? Or, what if the operation takes so long that it can't be completed entirely before peak-hours?
Imagine the following scenario. Your nightly loads of large volumes of data usually complete before peak-hours start, but because today is the start of the holiday season there is suddenly twice the amount of data and the process is still running when peak-hours begin. OK, just wait 30-60 minutes and let it complete...
Twiddle thumbs...
Respond to emails from upset users: "Don't worry. It will be resolved soon."...
Nope. An hour later and it still hasn't completed. How much longer could it possibly take? Management makes the call - kill the job and we'll load it tomorrow. OK, you kill it. Hmmm... performance is still bad. What's going on?
Now that huge transaction you just cancelled is rolling itself back! Now, all you can do is wait. The system could very well be ground to a halt for the entire day. How can this be avoided?
Breaking it Down
The best way to manage operations on large data sets like this is to break them down into smaller pieces and perform the operation piece-by-piece. To demonstrate how this can be done, let's imagine that a daily ETL feed of millions of rows is placed into a staging table named "new_purchase" and that our operation needs to copy all of these rows into the "purchase" table. Then let's assume that inserting these rows will take a long time because the "purchase" table already has millions of rows in it, has numerous indexes on it, and is actively used by the production application.
So we have some data to test our solution with, I will first create a "new_purchase" table and populate it with a list of fake purchases:
CREATE TABLE new_purchase (
purchase_date DATETIME,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2) ) -- populate a fake table with 100,000 new purchases SET NOCOUNT ON
DECLARE @i AS INT
SET @i=0
WHILE @i< 100000
BEGIN
INSERT new_purchase
SELECT Cast('2009-01-01' AS DATETIME) + Rand() AS purchase_date,
'SKU'+Cast(Cast(Rand() * 10000000 AS BIGINT) AS VARCHAR(7)) AS item,
Cast(Rand() * 10 + 1 AS INT) AS quantity,
Cast(Rand() * 100 + .01 AS DECIMAL(10,2)) AS price
SET @i = @i + 1
END
Second, I will create a "purchase" table, which is the table that holds the cumulative list of all purchases in the production system:
--Very large table that contains all purchases and is used by the production system
CREATE TABLE purchase (
id UNIQUEIDENTIFIER,
created_date DATETIME,
purchase_date datetime,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2),
total DECIMAL(14,2)
)
And finally, I will create a stored procedure to copy one to twenty thousand rows at a time from the "new_purchase" table into the "purchase" table. I will also create a log table to track and measure each of the copy operations.
--Create a table for tracking
CREATE TABLE load_new_purchase_log (
start_time DATETIME,
end_time DATETIME,
row_count INT
)
GO --Create a procedure to load new purchases in groups of 1,000-20,000 rows
CREATE PROCEDURE load_new_purchase
AS
BEGIN
DECLARE @start_time AS DATETIME
DECLARE @maximum_row AS BIGINT
DECLARE @starting_row AS BIGINT
DECLARE @rows_to_process_per_iteration AS INT
DECLARE @rows_added AS INT
DECLARE @message AS VARCHAR(100)
--Define how many rows to process in each iteration
--depending on whether the routine is running during peak hours or not
SET @rows_to_process_per_iteration =
CASE WHEN DatePart(Hour, GetDate()) BETWEEN 7 AND 17 THEN 1000 --Peak hours
ELSE 20000 END --Off-Peak hours, load a larger number of records at a time --Determine how many rows need to be processed in total
SET @maximum_row = (SELECT Count(*) FROM new_purchase)
--If this process had already run, it may have already added some rows.
--Determine where to start.
SET @starting_row = IsNull((SELECT Sum(row_count) + 1
FROM load_new_purchase_log),1)
--Continue looping until all new_purchase records have been processed
SET NOCOUNT OFF
WHILE @starting_row <= @maximum_row
BEGIN
SET @message = 'Processing next '+
Cast(@rows_to_process_per_iteration AS VARCHAR(100))+
' records starting at record '+
Cast(@starting_row AS VARCHAR(100))+' (of '+
Cast(@maximum_row AS VARCHAR(100))+')'
PRINT @message
SET @start_time = GetDate()
--Insert the next set of rows and log it to the log table in the same transaction
BEGIN TRANSACTION
--Copy rows into purchase table from the new_purchase table
INSERT purchase
SELECT NewId() as id, GetDate() as created_date,
purchase_date, item, quantity, price, quantity * price as total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1
--Log the records that have been added
SET @rows_added = @@RowCount
INSERT load_new_purchase_log (start_time,end_time,row_count)
VALUES (@start_time, GetDate(), @rows_added)
COMMIT TRANSACTION
SET @starting_row = @starting_row + @rows_added --Define how many rows to process in each iteration
--depending whether the routine is running during peak hours or not
IF DatePart(Hour, GetDate()) BETWEEN 7 and 17
BEGIN
--Peak Hours
--Load a small number of rows for each iteration
SET @rows_to_process_per_iteration = 1000
--Delay 10 seconds to lighten the load even further
WAITFOR DELAY '00:00:10'
END
ELSE
BEGIN
--Off-Peak Hours
--Load a large number of rows for each iteration
SET @rows_to_process_per_iteration = 20000
END END --end while statement END --end stored proc
The main trick in the code is to use the Row_Number() function that is available in version 2005 and later. This enables you to segment the large operation in to smaller operations, each of which operates only on a sub-set of all of the rows. Note that SQL Server does not support Row_Number() calculations within the WHERE clause. So, the following would produce an error:
SELECT NewId() as id, GetDate() as created_date,
purchase_date, item, quantity, price, quantity * price as total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1
That is why the stored procedure has to compute the Row_Number() in a sub-select statement. Even though there is this large SELECT statement in the sub-select, SQL Server performs surprisingly well at filtering out just the requested records; this is true especially if the source table has an index with the same ordering that is in the ORDER BY in the OVER clause.
The number of rows copied in each iteration is written to a log table. This enables the procedure to restart where it left off in the case that it is terminated before it can process all of the rows. The main INSERT statement and the INSERT of a row to the log table are included in the same TRANSACTION to ensure that they are COMMITTED to the database as one transaction. A secondary benefit of writing to a log table is that it is helpful for reporting on the progress of the operation and for benchmarking performance.
You will also notice a couple of places in this code where it tests to see if the current time is during peak-hours (between 7AM and 5PM). If it is then the routine only loads 1,000 rows and then pauses for 30 seconds for each iteration. If the current time is outside of peak hours, then the routine loads up to 20,000 rows per iteration because there are more system resources available for processing.
Now, to test the stored procedure I type:
EXEC load_new_purchase
And the following output is generated:
Processing next 1000 records starting at record 1 (of 100000)
Processing next 1000 records starting at record 1001 (of 100000)
Processing next 1000 records starting at record 2001 (of 100000)
I then disconnect that query to test its ability to stop and start with ease. Then, rerun the stored procedure and let it complete. Running the following queries shows that all 100,000 rows were copied correctly to the "purchase" table.
SELECT Count(*) as record_count_in_purchase_table FROM purchase
SELECT TOP 10 * FROM purchase
SELECT TOP 10 * FROM load_new_purchase_log
Wrap-Up
In situations where system performance is important and one needs to operate on a large number of rows, breaking the operation down into smaller pieces has many benefits, including:
- Fewer database locks and less contention.
- Better performance for any other applications that share the affected tables.
- Adding robustness to the operation so that it can easily and efficiently be stopped and restarted at will.
- Visibility into the progress of the operation and automatic tracking of metrics that can then be used for benchmarking.
About the Author
Zach Mided is a seasoned technology executive specializing in translation between technology and business. He is currently employed by Alliance Global Services and is consulting to a rebate fulfillment company by helping them develop software that processes millions of dollars in rebates each week..