Introduction:
SQL Server is based on, what a lot of folks have come to it know as, a "set based" database engine. You've seen dozens, maybe even hundreds of posts where someone says "Use set based techniques" for faster code.
I was doing some comparisons with Oracle on a totally different subject in a post and one of the thread participants drew my attention to some example code in Books Online for the UPDATE statement (Ref: http://msdn2.microsoft.com/en-us/library/ms187326(SQL.100).aspx). I was actually shocked and mortified to see that the UPDATE "Example C" code was RBAR... on steroids. Thinking back a bit, I realized that this is an all too common problem, even for some "real experts" like the ones that wrote Books Online.
The biggest problem in this area is that many people don't know how to do joins in the UPDATE statement and they end up using correlated sub-queries, instead. Of course, that's a major form of RBAR that should be avoided in favor of the set based techniques that those hundreds of posts have been talking about. Another problem is that there is an undocumented, very difficult to reproduce, fault in the UPDATE statement that can turn an innocent 20 second UPDATE into a server paralyzing monster that will slam 4 CPU's into the wall for 2 hours with an exteme form of hidden RBAR..
The purpose of this article is to show what constitutes a RBAR UPDATE that uses correlated sub-queries, the wrong way to avoid it (the monster), and the right way to join to tables in an UPDATE.
It'll also bring another principle that I've been trying to bring to the foreground in many of my recommendations in many posts; When you have a performance problem where some code is taking too long, you won't find the solution in tuning the server. "Tune the Code... that's where the performance is!" And, make no mistake about it, sometimes (most times) you just can't tune the code... you'll need to rewrite it.
Basic Types of Updates:
There are two basic types of UPDATEs... one that updates the target table with "direct data" and one that updates the target table based on the content of other tables.
The "Direct Data" Update:
The "Direct Data" update syntax looks kind of like this...
--===== Basic syntax for a "Direct Data Update UPDATE sometable SET somecolumn = somevalue or math expression WHERE some condition exists
There's not a whole lot that needs optimizing there and that's not the subject of this article.
The "Joined" Update:
A "Joined" update is an update based on what another table has in it. A lot of folks don't know that the UPDATE statement in SQL Server has a FROM clause, so they'll do it like folks do it in, say, Oracle. They'll use a correlated sub-query which is kind of a derivative of the "Direct Data" UPDATE, but much slower because of the correlated sub-query... something like this...
--===== Basic syntax for a "Joined" Update using a -- SLOW correlated sub-query. Don't ever do this in SQL Server -- even though its a documented example in Books Online! UPDATE sometable
SET somecolumn = (SELECT somevalue
FROM sometable2 st2
WHERE st2.somecolumn = somecolumn)
WHERE some condition exists
The correct way to do the same thing in SQL Server is as follows:
--===== Correct basic syntax for a "Joined" update -- in SQL Server and is also documented in Books Online. UPDATE sometable
SET somecolumn = st2.somecolumn
FROM sometable st1 --LOOK!!! MUST BE THE OBJECT OF THE UPDATE!!!
INNER JOIN sometable2 st2
ON st1.somecolumn = st2.somecolumn
WHERE some condition exists
Notice that "sometable" in the "UPDATE sometable" and the "FROM sometable st1" are the SAME table. Some other folks noticed that, too. So, they came up with this method... it's actually undocumented in BOL, but it works very well and seems to make it less confusing for some developers. Notice the use of the table alias instead of the table name in the UPDATE?
--===== Undocumented "Aliased" method for a "Joined" update -- that also works very well. UPDATE st1
SET somecolumn = st2.somecolumn
FROM sometable st1 --LOOK!!! MUST BE THE OBJECT OF THE UPDATE!!!
INNER JOIN sometable2 st2
ON st1.somecolumn = st2.somecolumn
WHERE some condition exists
The "Death by SQL" Update:
Some other folks also try to use an undocumented method for doing "Joined" updates. This is the "Monster" I spoke of in the introduction that slammed 4 CPU's into the wall for 4 hours on what should have been a 20 second update...
--===== Not so obvious "CPU KILLER" code. Runs fine until one day... -- Then, BOOOOOM! UPDATE sometable
SET somecolumn = st2.somecolumn
FROM sometable2 st2 --LOOK!!! FROM ID MISSING OBJECT OF THE UPDATE!!!
WHERE sometable.somecolumn = st2.somecolumn
AND some condition exists
Look at this seemingly innocent devil... the FROM clause makes no mention of the target table. Only a join condition in the WHERE clause and the actual UPDATE make any mention of the target table. This, Ladies and Gentlemen, will usually work just fine... until, someday, there are just enough rows, with just the right indexes, with just the right parallelism being spawned. Then, it will attack your server when you least expect it. It looks so harmless that, like my DBA and I, you will overlook it as the problem for hours.
I can't tell you just exactly how to recreate the exact setup of data, indexes, CPU's, etc, etc, to cause this monster to reveal it's ugly side, but I can tell you what it does when it does. Every row that gets updated, causes a full recompile of the statement along with a full table scan and the required index updates one bloody row at a time. It's like a cursor on steroids when it goes bad. And, if you have a wad of update triggers, they too fire one miserable row at a time.
"Tune the Code... that's where the performance is!":
You've probably seen many posts where some poor soul is saying his/her code is running too slow and they want to know how to tune the server? You've probably also seen me suggest that 90% of the tuning is in the code and the code will simply need to be rewritten. That's still my recommendation above and beyond any server tuning you might care to try. Here's why...
Look back at the syntax examples in this article and see just how similar they look. At a quick glance, even the "monster" code looks valid. Think about the fact that even the good folks at Microsoft blew it when they wrote "Example C" in Books Online for the UPDATE statement (Ref: http://msdn2.microsoft.com/en-us/library/ms177523(SQL.100).aspx). Stop and think about all the SQL Developers that don't even know that UPDATE has a FROM clause in SQL Server. Whenever I run into a performance problem with a particular piece of slow running code, I always ask myself, "Am I absolutely sure the code is as good as I think? Maybe I need to look again."
There's a 90% chance that you'll probably find something you can fix on any code that's "taking too long to run". Tune the code... that's where the performance is! As a reminder, "Tune the code" may actually mean "Rewrite the code".
Practical Tuning Example of Update:
Ok... Let's get down to the brass-tacks and fix some code! In fact, let's fix the very Microsoft example I was talking about. It's a good, simple, practical example to demonstrate what some performance challenged code might look like, breiefly how to analyze what it does, and how to rewrite it. Uh huh, I said rewrite the code.
Yes, this is a simple example. Yes, it will take longer to do an analysis and rewrite of more complicated code. But, consider this... if we can increase the speed of this simple straight forward example by 19 times, just imagine what we can do to more complicated, more performance challenged code...
Setup:
Before we get started, the example code is for an update of an existing table in the Adventure Works database. Since I don't really want to presume that any of us want to do an actual data update on that example data base, we're first going to make a copy of the target table of the UPDATE. Here's the code to do that.
SET NOCOUNT ON
USE AdventureWorks;
--------------------------------------------------
SELECT * INTO Sales.SPTest FROM Sales.SalesPerson
GRANT UPDATE ON Sales.SPTest TO PUBLIC
--------------------------------------------------
SET NOCOUNT OFF
GO
The Microsoft Example Code:
Here's the code (with the changed target table) from "Example C" for the UPDATE statement in BOL... despite my foreboding, it looks like nice tight code and we've probably seen similar code many times in many stored procedures...
PRINT REPLICATE('=',70)
PRINT 'The correlated sub query method...'
SET STATISTICS TIME ON
--==========================================================================
-- This is the code from Example C. Looks simple enough but has a double
-- nested correlated sub-query which is going to suck the life out of this
-- simple code performance wise.
--==========================================================================
UPDATE Sales.SPTest
SET SalesYTD = SalesYTD +
(SELECT SUM(so.SubTotal)
FROM Sales.SalesOrderHeader AS so
WHERE so.OrderDate = (SELECT MAX(OrderDate)
FROM Sales.SalesOrderHeader AS so2
WHERE so2.SalesPersonID = so.SalesPersonID)
AND Sales.SPTest.SalesPersonID = so.SalesPersonID
GROUP BY so.SalesPersonID)
SET STATISTICS TIME OFF
GO
When you run that bit of computational heaven, here's what you get for timer stats...
======================================================================
The correlated sub query method...
SQL Server Execution Times:
CPU time = 313 ms, elapsed time = 408 ms.
(17 row(s) affected)
That, right there, should be the first indication of a problem... it took a whopping 313 CPU milliseconds to find the data for and update only 17 rows??? The problem is in the code... tune the code. If you can't tune it, then rewrite it!
Analyze the Code/ Define the problem:
Like the Microsoft example, there's usually no documentation in the code and it's just complex enough to have to study the code a bit before we attempt a rewrite. The way I do it, is I may have to reformat the code, but I'll set it up in such a fashion that I can write some comments in it... like this...
--===== Update each SalesPersonID YTD sales by adding the accumulated SubTotal from a given
-- date (Max Order Date) to the current value of the SalesYTD column
UPDATE Sales.SPTest
SET SalesYTD = SalesYTD +
(--==== Using the Max Orderdate for each SalesPersonID in the SalesOrderHeader table,
-- accumulate the Sums of the SubTotal for each SalesPersonID for that date
SELECT SUM(so.SubTotal)
FROM Sales.SalesOrderHeader AS so
WHERE so.OrderDate = (--==== Find the Max Orderdate for each SalesPersonID
-- in the SalesOrderHeader table
SELECT MAX(OrderDate)
FROM Sales.SalesOrderHeader AS so2
WHERE so2.SalesPersonID = so.SalesPersonID)
AND Sales.SPTest.SalesPersonID = so.SalesPersonID
GROUP BY so.SalesPersonID)
Documentation of the code really isn't as hard as it looks... you've just gotta peel one layer at a time starting with the inner most sub-queries. As you work your way out, more and more understanding will come from the query in the form of comments until you get to the very outside comment and THAT defines the overall problem.
In the process of documenting the code, we see the hidden RBAR... correlated subqueries. I've highlighted those in red. This code is NOT tunable... it will need to be rewritten. And, that, by the way, is how I normally identify performance challenged code. Is there any sign of any RBAR including the likes of correlated sub-querie? Are the joins correct? How much data is being returned just to make the joins? All identify tuning and rewrite opportunites.
Rewrite according to the comments:
The comments are the PLAN. Rewrite according to the comments.
Let's take the inner most sub-query... we can see that the WHERE clause makes reference outside itself... that's what makes it a performance killing correlated sub-query. The best thing to do, without even looking at the other code, it to rewrite the sub-query as a derived table. The only thing you really have to concern yourself with is getting the Max(OrdDate) for each SalesPersonID and making sure both those items are in the SELECT List. Think of nothing else at the movement... here we go... be sure to transfer your hard-won comment, as well...
(--==== Find Max OrderDate for each SalesPersonID
-- in the SalesOrderHeader table
SELECT SalesPersonID,
MAX(OrderDate) AS MaxOrderDate
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID
) maxorddate
Now, you can test that SELECT (skip the first "(" line and the last ")" line to verify it works and that it works fast. This is one of the best methods for creating performance enabled code that suffers no errors when you put it all together... divide, conquer, test as you go. It also gives you a glimpse of the "interim data" which you can't do with a correlated sub-query without actually changing the code.
Ok... let's continue working our way out... We need to resolve the next sub-query... we need to change it to a derived table and it's gotta include the one we just made... here goes... notice the we're using a real join instead of a WHERE IN!
(--==== Using the Max Orderdate for each SalesPersonID in the SalesOrderHeader table,
-- accumulate the Sums of the SubTotal for each SalesPersonID for that date
SELECT so.SalesPersonID, SUM(so.SubTotal) AS DayTotal
FROM Sales.SalesOrderHeader so
INNER JOIN
(--==== Find Max OrderDate for each SalesPersonID
-- in the SalesOrderHeader table
SELECT SalesPersonID,
MAX(OrderDate) AS MaxOrderDate
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID
) maxorddate
ON so.SalesPersonID = maxorddate.SalesPersonID
AND so.OrderDate = maxorddate.MaxOrderDate
GROUP BY so.SalesPersonID
) total
Notice again... the comments say we're finding the SUM() of the SubTotal for each SalesPersonID (defines the SELECT list) and that it should be for the Max(OrderDate) for that SalesPersonID (defines the join). Again, not including the very first or last line of this code, we can test this derived table right now!
Now, we have all the information we need to update the SalesYTD column with... let's put this derived table aside for the time being and work on the syntactical format of the update... let's write out the comment, start the update according to what the comment says, and add some pseudo code to match what we need to join to just to keep the clutter down... I've highlighted all of the pseudo code so it's easy to spot...
--===== Update each SalesPersonID YTD sales by adding the accumulated SubTotal from a given
-- date (Max Order Date) to the current value of the SalesYTD column
UPDATE Sales.SPTest
SET SalesYTD = spt.SalesYTD + accumulated subtotal
FROM Sales.SPTest spt --LOOK! OBJECT OF UPDATE IS IN THE FROM CLAUSE!!!!
INNER JOIN
Some derived table with the accumulated subtotal and salespersonid
ON spt.SalesPersonID = derived table SalesPersonID
Pretty easy, so far, huh? Now, let's take one piece of pseudo code at a time and replace it with real code. The first thing to do would be to bring in the derived table so we can get all the column names we need...
--===== Update each SalesPersonID YTD sales by adding the accumulated SubTotal from a given
-- date (Max Order Date) to the current value of the SalesYTD column
UPDATE Sales.SPTest
SET SalesYTD = spt.SalesYTD + accumulated subtotal
FROM Sales.SPTest spt --LOOK! OBJECT OF UPDATE IS IN THE FROM CLAUSE!!!!
INNER JOIN
(--==== Using the Max Orderdate for each SalesPersonID in the SalesOrderHeader table,
-- accumulate the Sums of the SubTotal for each SalesPersonID for that date
SELECT so.SalesPersonID,SUM(so.SubTotal) AS DayTotal
FROM Sales.SalesOrderHeader so
INNER JOIN
(--==== Find Max OrderDate for each SalesPersonID
-- in the SalesOrderHeader table
SELECT SalesPersonID,
MAX(OrderDate) AS MaxOrderDate
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID
) maxorddate
ON so.SalesPersonID = maxorddate.SalesPersonID
AND so.OrderDate = maxorddate.MaxOrderDate
GROUP BY so.SalesPersonID
) total
ON spt.SalesPersonID = derived table SalesPersonID
That only leaves 2 small pieces of pseudo code (highlighted in blue above) and they're both columns from the "total" derived table... let's replace those now...
--===== Update each SalesPersonID YTD sales by adding the accumulated SubTotal from a given
-- date (Max Order Date) to the current value of the SalesYTD column
UPDATE Sales.SPTest
SET SalesYTD = spt.SalesYTD + total.DayTotal
FROM Sales.SPTest spt --LOOK! OBJECT OF UPDATE IS IN THE FROM CLAUSE!!!!
INNER JOIN
(--==== Using the Max Orderdate for each SalesPersonID in the SalesOrderHeader table,
-- accumulate the Sums of the SubTotal for each SalesPersonID for that date
SELECT so.SalesPersonID,SUM(so.SubTotal) AS DayTotal
FROM Sales.SalesOrderHeader so
INNER JOIN
(--==== Find Max OrderDate for each SalesPersonID
-- in the SalesOrderHeader table
SELECT SalesPersonID,
MAX(OrderDate) AS MaxOrderDate
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID
) maxorddate
ON so.SalesPersonID = maxorddate.SalesPersonID
AND so.OrderDate = maxorddate.MaxOrderDate
GROUP BY so.SalesPersonID
) total
ON spt.SalesPersonID = total.SalesPersonID
... and, we're done... notice that the code is a wee bit (a lot, actually) longer than the original code. Short code isn't always the best code.
Normally, we'd stop here, but I'll throw some extra comments in, turn on some timers, and this code is ready for comparison with the original for this article. Like this..
--==========================================================================
-- This is the code I would write to be the performance enabled equivelant
-- to the code above. Notice that it uses derived tables instead of
-- correlated sub queries... we're using joins instead of RBAR here...
-- Guess what that's going to do to performance? ;-)
-- Also, the derived tables allow me to more easily troubleshoot each
-- section of the code. I can select smaller sections of the code and
-- execute them individually to ensure they are working as expected. With
-- the correlated sub queries above, there's NO change of doing that without
-- actually making a modification to the code to test it.
--==========================================================================
PRINT REPLICATE('=',70)
PRINT 'My performance enabled rewrite...'
SET STATISTICS TIME ON --===== Update each SalesPersonID YTD sales by adding the accumulated SubTotal from a given
-- date (Max Order Date) to the current value of the SalesYTD column
UPDATE Sales.SPTest
SET SalesYTD = spt.SalesYTD + total.DayTotal
FROM Sales.SPTest spt --LOOK! OBJECT OF UPDATE IS IN THE FROM CLAUSE!!!!
INNER JOIN
(--==== Using the Max Orderdate for each SalesPersonID in the SalesOrderHeader table,
-- accumulate the Sums of the SubTotal for each SalesPersonID for that date
SELECT so.SalesPersonID,SUM(so.SubTotal) AS DayTotal
FROM Sales.SalesOrderHeader so
INNER JOIN
(--==== Find Max OrderDate for each SalesPersonID
-- in the SalesOrderHeader table
SELECT SalesPersonID,
MAX(OrderDate) AS MaxOrderDate
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID
) maxorddate
ON so.SalesPersonID = maxorddate.SalesPersonID
AND so.OrderDate = maxorddate.MaxOrderDate
GROUP BY so.SalesPersonID
) total
ON spt.SalesPersonID = total.SalesPersonID SET STATISTICS TIME OFF
PRINT REPLICATE('=',70)
When I run that code, it only takes 15 milliseconds to update what used to take 313 milliseconds...
======================================================================
My performance enabled rewrite...
SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 76 ms.
(17 row(s) affected)
Tuning Opportunity?
Let's put that in terms that everyone would appreciate... because there is no triangular join in either code, we can make the assumption that the code is nearly linear in performance. For the Microsoft example, it would take 0.313/17*1000000 milliseconds or 18.4 seconds to update a million rows. Not bad... until you realize that the performance enabled version does it in 0.015/17*1000000 milliseconds or only 0.882 seconds.
Now, I don't know about you, but improving the code performance by more than 19 times on the first try is what performance improvement is all about. And, it was all in the code. "Tune the code... that's where the performance is!".
The real key is that you have to recognize the "Tuning Opportinity". Look for correlated sub-queries and other hidden RBAR like that found in triangular joins (Ref : http://www.sqlservercentral.com/articles/T-SQL/61539/). Know that, right of the top of your head, that anything having these types of hidden RBAR can and should be rewritten for a performance gain of at least 15 times up to 1000's of times.
Conclusion:
Just like we saw in this article, performance challenged code, especially in UPDATEs (and it's brother, DELETE's which also have a from clause), can look even better/shorter than what real set-based high performance code does. Performance challenged code is rather easy to overlook because all the forms of "Joined" UPDATEs look so similar.
We found out that performance tuning, even in UPDATE's, isn't really the black art that people make it out to be. It's usually just some basic programming that avoids all forms of RBAR including the seemingly innocuous correlated sub-query and the deadly set-based-looking triangular join (REF: http://www.sqlservercentral.com/articles/T-SQL/61539/ ). Although well beyond the scope of this article, 99.9% of all forms of RBAR can be avoided one way or another.
We also saw how easy it was to rewrite one piece of code to get it to run 19 times faster that it originally did and we didn't touch the server or indexes. Sometimes, you've just got to accept the fact that slow code is slow code and that nothing you can do to the server will fix it. You also have to realize that sometimes "Tuning the Code" means "Rewriting the Code."
To close, I know a fair number of DBA's who are plagued, on a daily basis, with some CIO or customer saying "The code is running too slow... tune the server." Now, if you did it right, and most DBA's indeed do it right, the server was setup with all the real tuning it will ever need. Sure, an occasional tweek here and there as some data grows but, for the most part, all the additional server tuning in the world isn't going to help performance challenged code. If you're a DBA who is plagued in this manner, you're just going to have to dig in, find a substantial piece performance challenged code (don't forget, it could be embedded in the GUI), and rewrite it along with all the before and after statistics you can generate. You need to document what you did to the code, how you found the tuning opportunities in the code (it's almost always some form of RBAR), how long it took you step by step, and then you need to make a presentation on how to improve the performance of the server. I suggest you title it..."Tune the Code... that's where the performance is!"
Thanks for listening, folks.
--Jeff Moden
"Your lack of planning DOES constitute an emergency on my part... SO PLAN BETTER! "
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row"