Divide and Conquer - Batching Part 1

Introduction

There are many queries and processes that a DBA may perform on an SQL

Server that run quickly and efficiently with small amounts of data,

but do not scale up when run on large sets of data. The power behind a

RDBMS and SQL Server come from the extremely efficient and rapid processing of

sets of data. Most other programming languages operate in a row-by-row or

procedural fashion working with one piece or line of data at a time. The

ability to write SQL code that operates on a batch of data allows the alteration of

a large amount of data with a small amount of code. It also allows

the very quick and headache creating destruction of data with even smaller amounts

of code, but that is another subject and another article.

I always preach about the writing of tight code that operates on sets of data

rather than a row by row approach. I avoid cursors and temp tables when

possible to ensure that ever operation is a batch-oriented statement rather than an

individual row operation. When I see another DBA send me a procedure with a

set rowcount xx in it, I am often returning this for a rewrite to work on a

batch of data. As I write this, however, I am getting ready to eat my words.

There are times where the use of SET ROWCOUNT can be a great time saver

and maximize the use of resources. As with most of my rules, there are always

exceptions and the use of the best tool for the job is something that should

always be kept in mind. Batches are something that all SQL programmers use to operate on a

set of data, and there are times when it may sometimes be necessary to

sub-divide your batches into sub-batches to more efficiently complete the process.

The Problem

An example I will give may best illustrate the reason why batching a process is

sometimes better involves the following objects:

--Transaction table
Create table MyTrans 
 (
   AccountID int,
   TransID int identity (1, 1),
   TranDt  datetime,
   Amt float,
   Balance float
 )

With these items defined as follows:

MyTrans table

AccountID	Foreign key to the account table
TransID	Primary key for this table
TranDt	Date of transaction
Amt	Amount of transaction
Balance	Balance to date for this account

MyTrans had an Insert, Update trigger that updates the balance field for

the inserted ledger with the sum all previous amounts for the same account ID.

This trigger contains the following code:

Update transaction
Set balance = sum( t.amount) from inserted I, transaction t
Where i.accountID = t.accountID
And i.accountID = transaction.accountid
And t.date <= i.date)

This table contains about 5 million rows of transactions spanning 15,000 or

so accounts over about a decade. Recently I needed to correct a large number of

rows historically in this transaction table. The corrections involved changing

amounts of previous transactions that had been rounded incorrectly. I had

received a file of data that contained the ids, dates, and corrected amounts.

This data was loaded into a temporary table called TempSteve. It would have

taken only one line of SQL code to perform this update:

Update transaction
Set amount = t.amount
From tempsteve t
Where t.id = transaction.id

This code would update each existing transaction with the new amount and fire the

trigger to reload the balance with the sum of the previous transaction. In

theory this looks like a relatively simple process that I could run from my

SQL window and leave early for a light run before dinner.

It Didn't Work

In actuality there is a problem with this code. Not a bug, since it will work and

perform the updates, the problem lies in the actual execution of this code on the

server. Performing a single update, for any transaction, whether the most recent or one

five years ago takes a split second on my SQL Server. Running the ten thousand

corrections on my decent sized SQL Server (a 4-way PP200, 2048MB RAM) was still

running nearly twelve hours later. Why the delay?

Normally I would not have allowed this query to run so long, but it was started on a

Friday evening and let run until I found it Saturday morning. That first cup of coffee

was not very enjoyable when I found this still running on my server. Let's examine

what has to occur for this code to complete:

Transaction 1 is altered with a new inserted table that the trigger evaluates

for the sum of prior transactions (0), and updates the ledger. Before

this occurs, however, all the remaining transactions in the statement must be

completed. The next transaction being evaluated is for the same account and

its amount is updated (in memory again) and its trigger fires, but

the balance update query requires the amount update from the previous update

be evaluated in order to get the accurate sum. Therefore the query

processor must search its current altered table in memory to get the sums along

with the rows of the physical disk. Carry this forward a few hundred

transaction and you find a recursive effect where each new update for an account

requires results from a pending update of the same account in this same batch.

I suspect my SQL Server had problems with the recursive requirements of this

query because of the trigger. With my morning coffee I reluctantly

canceled the query and set about examining the process and trying to devise a

better solution before my kids woke up and came down to help.

The Solution - Batching!

Initial testing of single row updates, showed each one completing in less than a

second. After a few of these queries completing and a second cup of coffee, I realized

the recursive problems in this update. At that point I added a tinyint column to the TempSteve

table for remaining rows, set the value to zero and ran the following code:

Declare @t int, @i int
Set @i = 5000
Set rowcount 1

While @I > 0
 Begin
   Select @t = transid from TempSteve

   Update transaction
    Set amount = t.amount
    From tempsteve t
    Where t.id = transaction.id
    And t.transactionid= @t

   Update tempsteve set status = 1
    Where transid = @t

   Select @I = @I - 1
 End

This ran for the remaining 9000+ rows in a couple minutes. This was timed

perfectly as I heard the kids starting to move around upstairs. By batching this

very simple SQL statement and removing the requirement the trigger implement

recursion the server could perform the update with a minimal use of resources.

Conclusions

So where else does batching come in handy? Preventing locks and blocks

is another area I use this. When I have a process that requires a

significant amount of time, like inserting new rows into a table from a large

batch or deleting old archived information, I use this. To prevent issues with

other users, I usually use some type of flag to indicate processed or unprocessed, then

set a batch size and process x number of rows. I often also include a short

waitfor in my loop to slow this down for a second or two and let other users

get in a query on the table.

I hope nobody was laughing about my "decent" sized SQL Server. This process actually occurred

a couple years ago. Now even my desktop is bigger than that server.

The next part of batching will look at an alternative method of batching data that

does not use a cursor.

As always, please rate this article and feel free to send some feedback.

Steve Jones

December 2000

Return to Steve Jones Home

Divide and Conquer

Divide and Conquer - Batching Part 1

Introduction

The Problem

It Didn't Work

The Solution - Batching!

Conclusions

Rate

Share

Categories

Share

Rate

Divide and Conquer

Divide and Conquer - Batching Part 1

Introduction

The Problem

It Didn't Work

The Solution - Batching!

Conclusions

Rate

Share

Categories

Share

Rate

Related content

Reducing Round Trips - Working with HTML Checkboxes

Finding and Deleting Duplicate Data

New MDX Book Published

How to Build Dynamic Stored Procedures

Default Values and Named Parameters for Stored Procs