Divide and Conquer - Batching Part 1
Introduction
There are many queries and processes that a DBA may perform on an SQL
Server that run quickly and efficiently with small amounts of data,
but do not scale up when run on large sets of data. The power behind a
RDBMS and SQL Server come from the extremely efficient and rapid processing of
sets of data. Most other programming languages operate in a row-by-row or
procedural fashion working with one piece or line of data at a time. The
ability to write SQL code that operates on a batch of data allows the alteration of
a large amount of data with a small amount of code. It also allows
the very quick and headache creating destruction of data with even smaller amounts
of code, but that is another subject and another article.
I always preach about the writing of tight code that operates on sets of data
rather than a row by row approach. I avoid cursors and temp tables when
possible to ensure that ever operation is a batch-oriented statement rather than an
individual row operation. When I see another DBA send me a procedure with a
set rowcount xx in it, I am often returning this for a rewrite to work on a
batch of data. As I write this, however, I am getting ready to eat my words.
There are times where the use of SET ROWCOUNT can be a great time saver
and maximize the use of resources. As with most of my rules, there are always
exceptions and the use of the best tool for the job is something that should
always be kept in mind. Batches are something that all SQL programmers use to operate on a
set of data, and there are times when it may sometimes be necessary to
sub-divide your batches into sub-batches to more efficiently complete the process.
The Problem
An example I will give may best illustrate the reason why batching a process is
sometimes better involves the following objects:
--Transaction table Create table MyTrans ( AccountID int, TransID int identity (1, 1), TranDt datetime, Amt float, Balance float )
With these items defined as follows:
MyTrans table
AccountID | Foreign key to the account table |
TransID | Primary key for this table |
TranDt | Date of transaction |
Amt | Amount of transaction |
Balance | Balance to date for this account |
MyTrans had an Insert, Update trigger that updates the balance field for
the inserted ledger with the sum all previous amounts for the same account ID.
This trigger contains the following code:
Update transaction Set balance = sum( t.amount) from inserted I, transaction t Where i.accountID = t.accountID And i.accountID = transaction.accountid And t.date <= i.date)
This table contains about 5 million rows of transactions spanning 15,000 or
so accounts over about a decade. Recently I needed to correct a large number of
rows historically in this transaction table. The corrections involved changing
amounts of previous transactions that had been rounded incorrectly. I had
received a file of data that contained the ids, dates, and corrected amounts.
This data was loaded into a temporary table called TempSteve. It would have
taken only one line of SQL code to perform this update:
Update transaction Set amount = t.amount From tempsteve t Where t.id = transaction.id
This code would update each existing transaction with the new amount and fire the
trigger to reload the balance with the sum of the previous transaction. In
theory this looks like a relatively simple process that I could run from my
SQL window and leave early for a light run before dinner.
It Didn't Work
In actuality there is a problem with this code. Not a bug, since it will work and
perform the updates, the problem lies in the actual execution of this code on the
server. Performing a single update, for any transaction, whether the most recent or one
five years ago takes a split second on my SQL Server. Running the ten thousand
corrections on my decent sized SQL Server (a 4-way PP200, 2048MB RAM) was still
running nearly twelve hours later. Why the delay?
Normally I would not have allowed this query to run so long, but it was started on a
Friday evening and let run until I found it Saturday morning. That first cup of coffee
was not very enjoyable when I found this still running on my server. Let's examine
what has to occur for this code to complete:
Transaction 1 is altered with a new inserted table that the trigger evaluates
for the sum of prior transactions (0), and updates the ledger. Before
this occurs, however, all the remaining transactions in the statement must be
completed. The next transaction being evaluated is for the same account and
its amount is updated (in memory again) and its trigger fires, but
the balance update query requires the amount update from the previous update
be evaluated in order to get the accurate sum. Therefore the query
processor must search its current altered table in memory to get the sums along
with the rows of the physical disk. Carry this forward a few hundred
transaction and you find a recursive effect where each new update for an account
requires results from a pending update of the same account in this same batch.
I suspect my SQL Server had problems with the recursive requirements of this
query because of the trigger. With my morning coffee I reluctantly
canceled the query and set about examining the process and trying to devise a
better solution before my kids woke up and came down to help.
The Solution - Batching!
Initial testing of single row updates, showed each one completing in less than a
second. After a few of these queries completing and a second cup of coffee, I realized
the recursive problems in this update. At that point I added a tinyint column to the TempSteve
table for remaining rows, set the value to zero and ran the following code:
Declare @t int, @i int Set @i = 5000 Set rowcount 1 While @I > 0 Begin Select @t = transid from TempSteve Update transaction Set amount = t.amount From tempsteve t Where t.id = transaction.id And t.transactionid= @t Update tempsteve set status = 1 Where transid = @t Select @I = @I - 1 End
This ran for the remaining 9000+ rows in a couple minutes. This was timed
perfectly as I heard the kids starting to move around upstairs. By batching this
very simple SQL statement and removing the requirement the trigger implement
recursion the server could perform the update with a minimal use of resources.
Conclusions
So where else does batching come in handy? Preventing locks and blocks
is another area I use this. When I have a process that requires a
significant amount of time, like inserting new rows into a table from a large
batch or deleting old archived information, I use this. To prevent issues with
other users, I usually use some type of flag to indicate processed or unprocessed, then
set a batch size and process x number of rows. I often also include a short
waitfor in my loop to slow this down for a second or two and let other users
get in a query on the table.
I hope nobody was laughing about my "decent" sized SQL Server. This process actually occurred
a couple years ago. Now even my desktop is bigger than that server.
The next part of batching will look at an alternative method of batching data that
does not use a cursor.
As always, please rate this article and feel free to send some feedback.
Steve Jones
December 2000