November 27, 2008 at 11:02 am
Lynn Pettis (11/26/2008)
Good article, but let's take another approach and see what we see.
Heh... you beat me to it, again. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
November 27, 2008 at 11:11 am
Jeff Moden (11/27/2008)
Lynn Pettis (11/26/2008)
Good article, but let's take another approach and see what we see.Heh... you beat me to it, again. 😉
Heh... I may be old, but I'm also trainable.
November 27, 2008 at 1:06 pm
I like both approaches to the problem. They both have a refreshing spark of originality.
I'm not entirely sure if I completely agree with all Hugo's objections, though I share his caution. 'Quirky Update' techniques can go wrong if you are not entirely aware of certain 'gotchas', but they were documented, used, and supported even before Microsoft bought the product. For the 'quirky update' approach that Lynne uses to work safely, you have to remember certain rules (e.g. updates are done in the order of the clustered index, all '@var=col=expression' variable assignments are done before any simple column updates-I hope I've remembered that the right way around!)
Best wishes,
Phil Factor
November 27, 2008 at 1:51 pm
Lynn Pettis (11/27/2008)
Since the purpose of the identity field was order, if there was a gap it wouldn't have caused an issue.
Hi Lynn,
But you are using the IDENTITY values for more than just imposing an order. You use "WHERE B.ID BETWEEN A.ID - 4 AND A.ID" to find the current and four preceeding rows - but if there are gaps, then less than four preceeding rows will match this clause.
November 27, 2008 at 1:58 pm
Jeff Moden (11/27/2008)
Hugo Kornelis (11/27/2008)
What I was refering too is the lack of ROW_NUMBER() in SQL Server 2000. This means you'll either have to take your chance with IDENTITY, at the risk of gaps, as the author of this article did; or you have to use a correlated subquery to calculate the row number on the fly, which can result in dramatic performance as the amount of rows grows. Plus, the queries tend to get long and hard to understand.Nope... in SQL Server 2000, just do a SELECT INTO a temp table with the IDENTITY function and the author's code works just fine without any difficulty for length or understanding.
Hi Jeff,
You're right. When using SELECT INTO a temp table with the IDENTITY function (*), then there will not be any gaps and the range of A.ID-4 up to A.ID will always have 5 rows. But unless I overlooked something, this was not in the article. To me, the article appears to imply that any IDENTITY row can be used for this. And since many tables already have an IDENTITY column, often with gaps in the sequence due to deleted data or rolled back inserts, I thought it'd be better to point out this danger.
(*) There is another potential problem here. I know that there is only one situation where Microsoft guarantees that identity values are given out in the expected order when using ORDER BY in an INSERT statement, but I can never recall if this guarantee is for SELECT INTO with the IDENTITY() function, of for INSERT ... SELECT on a table with a predefined IDENTITY column. And I can never find this particular bit of documentation when I need it. I think that SELECT INTO with the IDENTITY() function is the supported scenario, but if you are going to use this in production you'd probably better doublecheck first, for my memory is known to .... aaahh, what was I going to say again? 🙂
November 27, 2008 at 2:03 pm
Hugo Kornelis (11/27/2008)
Lynn Pettis (11/27/2008)
Since the purpose of the identity field was order, if there was a gap it wouldn't have caused an issue.Hi Lynn,
But you are using the IDENTITY values for more than just imposing an order. You use "WHERE B.ID BETWEEN A.ID - 4 AND A.ID" to find the current and four preceeding rows - but if there are gaps, then less than four preceeding rows will match this clause.
Actually, the author was using it in his code (which I included in my post for comparision purposes), I was using it in my code exclusively to provide order to the data.
November 27, 2008 at 2:40 pm
Lynn Pettis (11/27/2008)
Actually, the author was using it in his code (which I included in my post for comparision purposes), I was using it in my code exclusively to provide order to the data.
Hi Lynn,
Oops, my bad. Apologies for confusing you with the author. And also for not seeing the scrollbar and the extra code you added.
However, now that I did see your code, I must say that I like it even less than the code in the article. Your update relies on at least three assumptions:
1) The assumption that this kind of UPDATE statement will always work row by row, using the variables after assignment from row #n to update row #n+1. This may or may not be documented (I don't feel like digging through BOL at the moment), but it's definitely against the original idea of set-based updates, where all modifications are done "at once".
2) The assumption that such an UPDATE will always be processed in order of the clustered index. And I am fairly confident that this is not documented at all. Maybe the current versions of SQL Server will work like that (I admit not being able to break it in my first three tries, but with only three tries you can also see that I didn't try very hard), but who says they'll continue to do so on the enxt version? or maybe even the next service pack? or maybe even on different hardware?
For instance, as far as I know data modifications are currently never paralellized. But what if the next service pack changes that, to make better use of the increasing numbers of cores per socket and sockets per server. I can assure you that if this update runs in parallel, results will be completely different from what you want....
3) The assumption that people want to store the running totals. In most cases, you DON'T want to store them, since this would necessify recalculation everytime some data changes. Most people will want to calculate running totals on the client. And if it's really necessary to calculate them on the server, you'll want to do it in a view or stored procedure, not persist it. Unless you are in a reporting database that refreshes once or twice a day and is further used to query somewhat stale data for reporting or analysis.
Unless there is some source that I don't know of where Microsoft documents the order of processing in an UPDATE statement and commits itself to maintaining that behaviour, I would never allow this code to run in any of my production database - and I'd urge anyone to do the same.
November 27, 2008 at 3:01 pm
Okay, can't wait to see Jeff's response. I was using a technique I picked up from him for completing running totals.
Regarding storing running totals, maybe it isn't stored in the database but the calculation is completed in a #temp table that is loaded in a stored procedure with a proper clustered index to ensure the order of the data, updated (including using an index hint on the clustered index (per Jeff's article on running totals), then the results returned by a select query on the #tem table.
Not stored, and the calculations done much faster than it would be done with the cross-join query.
As another aside, I ran both against a table with 32,800 entries and here are the stats for that run:
[qoute]
-- Cross Join Query --
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table 'Accounts'. Scan count 32802, logical reads 104229, physical reads 0, read-ahead reads 46, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1250 ms, elapsed time = 2138 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
-- Cross Join Query --
-- Update Query --
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table 'Accounts'. Scan count 1, logical reads 153, physical reads 0, read-ahead reads 39, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 219 ms, elapsed time = 283 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
-- Update Join Query --
-- Select After Update Query --
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table 'Accounts'. Scan count 1, logical reads 153, physical reads 1, read-ahead reads 177, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 1459 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
-- Select After Update Query --
[/quote]
November 27, 2008 at 3:21 pm
Hugo Kornelis (11/27/2008)
this[/i] is not documented at all. Maybe the current versions of SQL Server will work like that (I admit not being able to break it in my first three tries, but with only three tries you can also see that I didn't try very hard), but who says they'll continue to do so on the enxt version? or maybe even the next service pack? or maybe even on different hardware?For instance, as far as I know data modifications are currently never paralellized. But what if the next service pack changes that, to make better use of the increasing numbers of cores per socket and sockets per server. I can assure you that if this update runs in parallel, results will be completely different from what you want....
...
Hey Hugo,
I think the consensus for those that have taken up the "Jeff-style" quirky update method is to use the MAXDOP=1 option on the UPDATE query to explicitly disable parallelism. At least I do 🙂
Regards,
Jacob
November 27, 2008 at 3:42 pm
Also, if I remember the discussions around Jeff's article on running totals, the only place he has run into difficulties with the technique is with partitioned tables. That is one I'd still like to tackle, but just haven't had the time to work with it.
November 27, 2008 at 4:02 pm
Jacob Luebbers (11/27/2008)
Hey Hugo,I think the consensus for those that have taken up the "Jeff-style" quirky update method is to use the MAXDOP=1 option on the UPDATE query to explicitly disable parallelism. At least I do 🙂
Hi Jacob,
That mitigates one of the potential problems, but not all. There are more risks. For instance:
* The optimizer might choose a non-clustered index to drive the query
* The optimizer might choose to use an unordered scan of the clustered index instead of the current ordered scan
* The query engine might perform a merry-go-round scan
These are just three examples. There might be more risks that I don't think of at the moment.
My point is that every undocumented behaviour that you observe, no matter how consistent it appears, should be considered to be a coincidal side effect of how the current version of the software interacts with your current hardware configuration and current data distribution. And that is NOT a safe foundation for building production software.
November 27, 2008 at 4:13 pm
November 27, 2008 at 4:20 pm
Hey Hugo,
Hugo Kornelis (11/27/2008)
...
* The optimizer might choose a non-clustered index to drive the query
...
The technique also uses an index hint specifying the clustered index, with TABLOCKX to ensure that no other modifications could occur to the table during our UPDATE.
Hugo Kornelis (11/27/2008)
...
* The query engine might perform a merry-go-round scan
...
Jeff's article covers this (following on from comments from Gail on the merry-go-round index behaviour): http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url].
Hugo Kornelis (11/27/2008)
...
* The optimizer might choose to use an unordered scan of the clustered index instead of the current ordered scan
...
The initial insert into the table is ordered explicitly. There was some lively discussion on whether or not the results of a subsequent unordered scan would reliably come back in that order without the guarantee of an ORDER BY... so far no-one has shown a case where they don't (to my knowledge).
Hugo Kornelis (11/27/2008)
...
My point is that every undocumented behaviour that you observe, no matter how consistent it appears, should be considered to be a coincidal side effect of how the current version of the software interacts with your current hardware configuration and current data distribution. And that is NOT a safe foundation for building production software.
...
Agree with you here - however I'm willing to take that risk in certain cases. The massive performance gain in certain cases using this technique is compelling, and if the process using it can bear this risk (and the maintainers of that process are mindful of potential changes in this behaviour with updates to the engine) I say go for it.
Regards,
Jacob
November 27, 2008 at 5:12 pm
Jacob Luebbers (11/27/2008)
Hey Hugo,Hugo Kornelis (11/27/2008)
...
* The optimizer might choose a non-clustered index to drive the query
...
The technique also uses an index hint specifying the clustered index, with TABLOCKX to ensure that no other modifications could occur to the table during our UPDATE.
Hi Jacob,
Point taken. Though the flipside is that TABLOCK increases the chance to get an unordered scan instead of an ordered one. More on that below.
Jacob Luebbers (11/27/2008)
Hugo Kornelis (11/27/2008)
...
* The query engine might perform a merry-go-round scan
...
Jeff's article covers this (following on from comments from Gail on the merry-go-round index behaviour): http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url].
I see that Jeff defends his use of ORDER BY in a subquery by refering to observed behaviour, not to any documentation. Exactly the danger I am warning about.
In fact, this behaviour of TOP 100 PERCENT ... ORDER BY did in fact change with the release of SQL Server 2005, much to the chagrin of many people who were relying on this undocumented "trick". (In all fairness, I must admit that even some of Microsoft's tools rely on this behaviour. And that apparently a very influential customer cooerced Microsoft into bringing this behaviour back in SQL 2005 and SQL 2008, although it does now require a trace flag to be set). If this doesn't prove how dangerous it is to rely on these tricks, then what does?
Jacob Luebbers (11/27/2008)
Hugo Kornelis (11/27/2008)
...
* The optimizer might choose to use an unordered scan of the clustered index instead of the current ordered scan
...
The initial insert into the table is ordered explicitly. There was some lively discussion on whether or not the results of a subsequent unordered scan would reliably come back in that order without the guarantee of an ORDER BY... so far no-one has shown a case where they don't (to my knowledge).
I'll show you two. The first relies on how SQL Server allocates pages to tables. The first 8 pages come from mixed extents; after that all allocation is in uniform extents (i.e. a whole block of 8 pages is allocated to the table at once). I first create a filler table that allocates a mixed extent and some uniform extents. After deleting the table, the extents are free again, but they apparently keep their "mixed/uniform" mark. So than I allocate 6 tables that only use one page (all from the mixed extent), and then the actual table I will test with. The first 2 pages come from the existing mixed extent, and than a new mixed extent is allocated, after the previously allocated (and now empty) uniform extents. The rest of the table uses uniform extents, first the empty ones allocated previously (between the first and second mixed extent), then new ones (after the second mixed extent).
use tempdb;
go
create database testit;
go
use testit;
go
create table filler (a char(8000));
go
insert into filler default values;
go 100
drop table filler;
go
create table fill1 (a int);
create table fill2 (a int);
create table fill3 (a int);
create table fill4 (a int);
create table fill5 (a int);
create table fill6 (a int);
insert into fill1 default values;
insert into fill2 default values;
insert into fill3 default values;
insert into fill4 default values;
insert into fill5 default values;
insert into fill6 default values;
go
create table testtable
(id int identity primary key, val int, filler char(4000));
declare @i int;
select @i = 1;
while @i < 5000
begin;
insert into testtable (val, filler) select @i, str(@i);
set @i = @i + 1;
end;
go
select * from testtable with (nolock) option (maxdop 1);
go
use tempdb;
go
drop database testit;
go
The second example mimicks concurrent behaviour. While you are filling your table, someone else frees some space. Space that will now be used for your table.
use tempdb;
go
create database testit;
go
use testit;
go
create table filler (a char(8000));
go
insert into filler default values;
go 100
create table testtable
(id int identity primary key, val int, filler char(4000));
declare @i int;
select @i = 1;
while @i < 5000
begin;
insert into testtable (val, filler) select @i, str(@i);
set @i = @i + 1;
-- SIMULATE CONCURRENT ACTIVITY
if @i = 2000 truncate table filler;
end;
go
select * from testtable with (nolock) option (maxdop 1);
go
use tempdb;
go
drop database testit;
go
(Sorry for the all-lowercase by the way, but I just lost my post because I took too long and I can't be bothered to do the nice formatting again)
Jacob Luebbers (11/27/2008)
Hugo Kornelis (11/27/2008)
...
My point is that every undocumented behaviour that you observe, no matter how consistent it appears, should be considered to be a coincidal side effect of how the current version of the software interacts with your current hardware configuration and current data distribution. And that is NOT a safe foundation for building production software.
...
Agree with you here - however I'm willing to take that risk in certain cases. The massive performance gain in certain cases using this technique is compelling, and if the process using it can bear this risk (and the maintainers of that process are mindful of potential changes in this behaviour with updates to the engine) I say go for it.
Regards,
Jacob
This is where we will have to agree to disagree. I've witnessed this too often. When you warn about the danger, management is always "prepared to take the risk, and they will "of course monitor, and plan followup action". But when the shit hits the fan, it's suddenly your fault...
Maybe you have been lucky not to have experienced this yet 😉 But I won't take any chances 😀
November 27, 2008 at 5:34 pm
Hey Hugo,
Hugo Kornelis (11/27/2008)
Point taken. Though the flipside is that TABLOCK increases the chance to get an unordered scan instead of an ordered one. More on that below.
Maybe I'm being thick, but I don't see from your comments how TABLOCKX will increase the chance of an unordered scan... am I missing something?
Hugo Kornelis (11/27/2008)
I see that Jeff defends his use of ORDER BY in a subquery by refering to observed behaviour, not to any documentation. Exactly the danger I am warning about.
Jeff's final solution (bottom of the article) doesn't use an ordered subquery - just the clustered index hint to guarantee order.
Hugo Kornelis (11/27/2008)
I'll show you two...
I'll take your word for it (I don't have the time right this sec to run through your demo code). Thanks for the examples though - I'll try to get some coherent thoughts on them over the weekend.
Cheers!
Regards,
Jacob
Viewing 15 posts - 16 through 30 (of 57 total)
You must be logged in to reply to this topic. Login to reply