Managing Large Data Sets in SQL Server 2005 and 2008

Overview

One of SQL Server's strengths is its ability to operate against multiple rows using just one statement, but in excess that strength can be problematic. For example, performing an insert of 20 million rows will place numerous page locks and possibly a table lock on that table, which can prevent others from effectively using any applications that access that table. Alternatively, one could use cursors to modify one row at a time, but the performance of cursors is not great and is not a good solution for processing millions of rows.

This article describes techniques that can be used to make operations against large data sets more manageable. You will be able to stop and start operations at a whim and throttle them up or down depending on how heavily you may want to use system resources at any given time.

'Tis the Season...

In some situations you can schedule operations against large data sets to occur off-peak hours, but what if your system has to be very responsive around-the-clock? Or, what if the operation takes so long that it can't be completed entirely before peak-hours?

Imagine the following scenario. Your nightly loads of large volumes of data usually complete before peak-hours start, but because today is the start of the holiday season there is suddenly twice the amount of data and the process is still running when peak-hours begin. OK, just wait 30-60 minutes and let it complete...

Twiddle thumbs...

Respond to emails from upset users: "Don't worry. It will be resolved soon."...

Nope. An hour later and it still hasn't completed. How much longer could it possibly take? Management makes the call - kill the job and we'll load it tomorrow. OK, you kill it. Hmmm... performance is still bad. What's going on?

Now that huge transaction you just cancelled is rolling itself back! Now, all you can do is wait. The system could very well be ground to a halt for the entire day. How can this be avoided?

Breaking it Down

The best way to manage operations on large data sets like this is to break them down into smaller pieces and perform the operation piece-by-piece. To demonstrate how this can be done, let's imagine that a daily ETL feed of millions of rows is placed into a staging table named "new_purchase" and that our operation needs to copy all of these rows into the "purchase" table. Then let's assume that inserting these rows will take a long time because the "purchase" table already has millions of rows in it, has numerous indexes on it, and is actively used by the production application.

So we have some data to test our solution with, I will first create a "new_purchase" table and populate it with a list of fake purchases:

CREATE TABLE new_purchase (

 purchase_date DATETIME, 

 item VARCHAR(10), 

 quantity INT,

 price DECIMAL(10,2)
)
-- populate a fake table with 100,000 new purchases
SET NOCOUNT ON

DECLARE @i AS INT

SET @i=0

WHILE @i< 100000

BEGIN

  INSERT new_purchase

  SELECT Cast('2009-01-01' AS DATETIME) + Rand() AS purchase_date,

    'SKU'+Cast(Cast(Rand() * 10000000 AS BIGINT) AS VARCHAR(7)) AS item,

    Cast(Rand() * 10 + 1 AS INT) AS quantity,

    Cast(Rand() * 100 + .01 AS DECIMAL(10,2)) AS price

  SET @i = @i + 1

END

Second, I will create a "purchase" table, which is the table that holds the cumulative list of all purchases in the production system:

--Very large table that contains all purchases and is used by the production system

CREATE TABLE purchase (

 id UNIQUEIDENTIFIER,

 created_date DATETIME, 

 purchase_date datetime, 

 item VARCHAR(10), 

 quantity INT,

 price DECIMAL(10,2),

 total DECIMAL(14,2)

)

And finally, I will create a stored procedure to copy one to twenty thousand rows at a time from the "new_purchase" table into the "purchase" table. I will also create a log table to track and measure each of the copy operations.

--Create a table for tracking 

CREATE TABLE load_new_purchase_log (

  start_time DATETIME, 

  end_time DATETIME, 

  row_count INT

)

GO
--Create a procedure to load new purchases in groups of 1,000-20,000 rows

CREATE PROCEDURE load_new_purchase

AS

BEGIN


  DECLARE @start_time AS DATETIME

  DECLARE @maximum_row AS BIGINT

  DECLARE @starting_row AS BIGINT

  DECLARE @rows_to_process_per_iteration AS INT

  DECLARE @rows_added AS INT

  DECLARE @message AS VARCHAR(100)


  --Define how many rows to process in each iteration

  --depending on whether the routine is running during peak hours or not

  SET @rows_to_process_per_iteration = 

  CASE WHEN DatePart(Hour, GetDate()) BETWEEN 7 AND 17 THEN 1000 --Peak hours

  ELSE 20000 END --Off-Peak hours, load a larger number of records at a time
  --Determine how many rows need to be processed in total

  SET @maximum_row = (SELECT Count(*) FROM new_purchase)


  --If this process had already run, it may have already added some rows.

  --Determine where to start.

  SET @starting_row = IsNull((SELECT Sum(row_count) + 1

  FROM load_new_purchase_log),1)


  --Continue looping until all new_purchase records have been processed

  SET NOCOUNT OFF

  WHILE @starting_row <= @maximum_row

  BEGIN

    SET @message = 'Processing next '+

      Cast(@rows_to_process_per_iteration AS VARCHAR(100))+

      ' records starting at record '+

      Cast(@starting_row AS VARCHAR(100))+' (of '+

      Cast(@maximum_row AS VARCHAR(100))+')'

    PRINT @message


    SET @start_time = GetDate()

    --Insert the next set of rows and log it to the log table in the same transaction

    BEGIN TRANSACTION

    --Copy rows into purchase table from the new_purchase table

    INSERT purchase

    SELECT NewId() as id, GetDate() as created_date, 

      purchase_date, item, quantity, price, quantity * price as total

    FROM

    (

      SELECT *, 
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS rownumber

      FROM new_purchase

    ) as new
WHERE rownumber BETWEEN @starting_row 
      AND @starting_row + @rows_to_process_per_iteration - 1



    --Log the records that have been added

    SET @rows_added = @@RowCount

    INSERT load_new_purchase_log (start_time,end_time,row_count)

    VALUES (@start_time, GetDate(), @rows_added)

    COMMIT TRANSACTION

    SET @starting_row = @starting_row + @rows_added
    
    --Define how many rows to process in each iteration

    --depending whether the routine is running during peak hours or not

    IF DatePart(Hour, GetDate()) BETWEEN 7 and 17 

    BEGIN

      --Peak Hours

      --Load a small number of rows for each iteration

      SET @rows_to_process_per_iteration = 1000

      --Delay 10 seconds to lighten the load even further

      WAITFOR DELAY '00:00:10'

    END

    ELSE

    BEGIN 

      --Off-Peak Hours

      --Load a large number of rows for each iteration

      SET @rows_to_process_per_iteration = 20000

    END 

  END --end while statement

END --end stored proc

The main trick in the code is to use the Row_Number() function that is available in version 2005 and later. This enables you to segment the large operation in to smaller operations, each of which operates only on a sub-set of all of the rows. Note that SQL Server does not support Row_Number() calculations within the WHERE clause. So, the following would produce an error:

SELECT NewId() as id, GetDate() as created_date, 

  purchase_date, item, quantity, price, quantity * price as total

FROM

(

  SELECT *, 

    Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS rownumber

  FROM new_purchase

) as new

WHERE rownumber BETWEEN @starting_row 

  AND @starting_row + @rows_to_process_per_iteration - 1

That is why the stored procedure has to compute the Row_Number() in a sub-select statement. Even though there is this large SELECT statement in the sub-select, SQL Server performs surprisingly well at filtering out just the requested records; this is true especially if the source table has an index with the same ordering that is in the ORDER BY in the OVER clause.

The number of rows copied in each iteration is written to a log table. This enables the procedure to restart where it left off in the case that it is terminated before it can process all of the rows. The main INSERT statement and the INSERT of a row to the log table are included in the same TRANSACTION to ensure that they are COMMITTED to the database as one transaction. A secondary benefit of writing to a log table is that it is helpful for reporting on the progress of the operation and for benchmarking performance.

You will also notice a couple of places in this code where it tests to see if the current time is during peak-hours (between 7AM and 5PM). If it is then the routine only loads 1,000 rows and then pauses for 30 seconds for each iteration. If the current time is outside of peak hours, then the routine loads up to 20,000 rows per iteration because there are more system resources available for processing.

Now, to test the stored procedure I type:

EXEC load_new_purchase

And the following output is generated:

Processing next 1000 records starting at record 1 (of 100000)

Processing next 1000 records starting at record 1001 (of 100000)

Processing next 1000 records starting at record 2001 (of 100000)

I then disconnect that query to test its ability to stop and start with ease. Then, rerun the stored procedure and let it complete. Running the following queries shows that all 100,000 rows were copied correctly to the "purchase" table.

SELECT Count(*) as record_count_in_purchase_table FROM purchase

SELECT TOP 10 * FROM purchase

SELECT TOP 10 * FROM load_new_purchase_log

Wrap-Up

In situations where system performance is important and one needs to operate on a large number of rows, breaking the operation down into smaller pieces has many benefits, including:

Fewer database locks and less contention.
Better performance for any other applications that share the affected tables.
Adding robustness to the operation so that it can easily and efficiently be stopped and restarted at will.
Visibility into the progress of the operation and automatic tracking of metrics that can then be used for benchmarking.

About the Author

Zach Mided is a seasoned technology executive specializing in translation between technology and business. He is currently employed by Alliance Global Services and is consulting to a rebate fulfillment company by helping them develop software that processes millions of dollars in rebates each week..

Managing Large Data Sets in SQL Server 2005 and 2008

Overview

'Tis the Season...

Breaking it Down

Wrap-Up

About the Author

Rate

Share

Categories

Share

Rate

Managing Large Data Sets in SQL Server 2005 and 2008

Overview

'Tis the Season...

Breaking it Down

Wrap-Up

About the Author

Rate

Share

Categories

Share

Rate

Related content

What's the best way to count?

SQL 2005 vs. SQL 2008 Part 1 - (Backup File Sizes and Times)

Fine Tuning your Database Design in SQL 2005