May 9, 2008 at 9:47 am
dfalso (5/9/2008)
Jeff Moden (5/6/2008)No... no Reporting Services or Excel required. Take a peek in Books Online for "cross-tab reports" in SQL Server 2000 or follow the following URL...
http://msdn.microsoft.com/en-us/library/aa172756(sql.80).aspx
They're very effective, can be made to return more than PIVOT in 2k5, and, in the presence of the correct indexes, just absolutely fly. They're also pretty easy to make in the form of Dynamic SQL and, with the advent of VARCHAR(MAX) in 2k5, can be easily made to return more columns than you'd ever need.
{I apologize, I was away from the office}
Ah, I know of that technique (and use it), I just didnt have a formal name (I just think of it as a technique to pivot things). Thanks for the tip.
Yes, I do appreciate that one can make a PIVOT statement effectively (in fact, that's how I currently solve this problem). I just feel that there should be a CURSOR solution more efficient. My reasoning is in two parts:
1) if you look at the process to build and run your SQL, you make one pass over the table to get all the row id values (which goes into a variable). Then, you make another pass over the information_schema or sys.columns view to create the SQL for each column, and finally you actually run the SQL which then issues a seperate PIVOT statement for each column in the table (but UNIONing them all together). So you've evaluated all the rows, then all the columns, to create a statment to evaluate each row of each column. There seems like there should be a more efficient way to go there.
Actually - the "first pass" you talk about would only conceptually happen when there is NO appropriate supporting indexes. with the appropriate indexing - the top levels of the B-tree would be used to find the values.
As to the rest of the process - that's not what I've seen. The advantage of the set based approach is that all of the column values are being accumulated AT THE SAME TIME, and NOT serially. That's what the worktable is tracking. You're still describing the procedural approach to solving the problem, which is by nature serial, rote and inneficient when compared to an implicitly much more parallel method of computing these values.
As to cursors - one of the big "weaknesses" of T-SQL cursors (versus other cursor-like objects in procedural languages) is that each row requires individual "commits". There's no way to "accumulate intermediate results, and then flush them all" with T-SQL cursors, which is ultimately why cursors don't stand a chance performance-wise against decent set-based code. The whole point of set-based is to allow you to perform operations on (large) chunks of data, and flush the changes/commit in *batches*, not row-by-row.
If you do in fact come up with a cursor method that is faster/more efficient than a set-based method for crosstabulation, please share it. There would be a HUGE group of people who would love to get their hands on more efficient ways to aggregate data like that. I think you will however find that set-based will crush the cursor methods, each and every time.
2) the problem is not a set-level problem, its a cell-level problem. That is, there's no common operation applied to each set member. There's a common heuristic, sure, but each column is evaluated seperately, and each row for that column is evaluated seperately still, so in effect each cell has it's own custom operation (regardless of the syntax we use). Clearly, an array-based solution is the best, so I *think* a cursor is the closest to that. (How well does CLR-stored procedure code do? Can it access SQL Server's memory directly for I/O, or is it another layer? Because if it's efficient, that's probably the best choice...I havent' done work with CLR stored procs yet, I confess)
CLR might have some interesting prospects to beating the cursor methods, but having tried myself, there would be precious few opportunities to flat out beat T-SQL code in this case. It would likely have to be some bizarre aggregation that SQL Server doesn't naturally provide for (which is ultimately what it takes for CLR to really shine: finding a spot T-SQL is just plain not very good at). Odd running totals, peculiarly scoped percentages, maybe - but most of the calculations would fly through T-SQL (either using aggregate function or the new windowed function calls), and again - eclipse CLR in terms of perf. combine that with its "challenged" resource management, and CLR is not something I would lean on overly heavily for LARGE amount of data processing (which dovetails directly into Microsoft's own recommendations as to best practices for CLR usage).
Just my experience on it.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
May 9, 2008 at 10:45 am
Matt Miller (5/9/2008)Actually - the "first pass" you talk about would only conceptually happen when there is NO appropriate supporting indexes. with the appropriate indexing - the top levels of the B-tree would be used to find the values.
As to the rest of the process - that's not what I've seen. The advantage of the set based approach is that all of the column values are being accumulated AT THE SAME TIME, and NOT serially. That's what the worktable is tracking. You're still describing the procedural approach to solving the problem, which is by nature serial, rote and inneficient when compared to an implicitly much more parallel method of computing these values.
To clarify: I was describing the process and issues of building and running the SQL statement. If you're using a PIVOT, your syntax is of the form
SELECT { ID values }
FROM (
SELECT {ID field}, {column to transpose}
FROM mytable
) AS sourcetab
PIVOT (
AVG({column to transpose})
FOR {ID field} IN
( {ID values} )
) AS pivotted
So, if we're building this dynamically, we'd first create the variable string for the {ID values} part, assuming we don't need to rename. So that's the initial SELECT over the table.
Then we need to build a PIVOT statment per {column to transpose} field, each of which uses our ID list variable. That's the information_schema SELECT.
We'll concatenate each string together, but we still have to actually run the code. And the code runs each SELECT and UNIONs them together, although commiting them as a batch, like you said.
Unfortunately, the CASE WHEN technique cited earlier has the same steps necessary to construct and run the string.
Informally, I was trying to describe how in this situation, each cell ultimately has a seperate operation because no two rows share the same operator. It's a bit like if we had to do something that would cause us to write a CASE WHEN statement, where we have a WHEN clause for each ID in a table. Yes, the operation happens as a set, but we're not really doing set-based work, so there should be a more efficient way to accomplish the task. That was my point.
As to cursors - one of the big "weaknesses" of T-SQL cursors (versus other cursor-like objects in procedural languages) is that each row requires individual "commits". There's no way to "accumulate intermediate results, and then flush them all" with T-SQL cursors, which is ultimately why cursors don't stand a chance performance-wise against decent set-based code. The whole point of set-based is to allow you to perform operations on (large) chunks of data, and flush the changes/commit in *batches*, not row-by-row.
Thanks for the insight, and that's some good information.
June 27, 2008 at 8:14 am
Chris had the most elegant solution, all he had to do was rearrange the order of his if statements, ie...
declare @i int
set @i = 1
while( @i < 101)
begin
if @i%3 = 0 and @i%5 = 0 print 'BizzBuzz'
else if @i%5 = 0 print 'Buzz'
else if @i%3 = 0 print 'Bizz'
else print @i
set @i = @i+1
end
Done.
June 27, 2008 at 8:32 am
Ummm... you think the use of a While loop is "elegant"? :blink: :Whistling:
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2008 at 9:10 am
Of the solutions printed in the article, his had the fastest computing time.
Please excuse my ignorance, but what is wrong with while loops?
June 27, 2008 at 10:07 am
zxn001 (6/27/2008)
Of the solutions printed in the article, his had the fastest computing time.Please excuse my ignorance, but what is wrong with while loops?
nothing per se, but run this in text mode...
set nocount on
declare @start datetime, @end datetime
select @start = GetDate()
select case N%5 + N%3
when 0 then 'BizzBuzz'
else case N%5
when 0 then 'Bizz'
else case N%3
when 0 then 'Buzz'
else cast(N as varchar(10))
end
end
end
from tally t1 --tally is a table of one (int) column, containing numbers 1 - 11,000
where N < 10001
select @end = GetDate()
select datediff(ms, @start, @end) as SetTheoryLoopTime
select @start = GetDate()
declare @i int
set @i = 1
while( @i < 10001)
begin
if @i%3 = 0 and @i%5 = 0 print 'BizzBuzz'
else if @i%5 = 0 print 'Buzz'
else if @i%3 = 0 print 'Bizz'
else print @i
set @i = @i+1
end
select @end = GetDate()
select datediff(ms, @start, @end) as whileLoopTime
What do you get back in performance times?
Dave J
June 27, 2008 at 11:27 am
The set was much faster.
However, I must note that in order for the set to work, you must have the pre-existing data for it to compute, which whould require either another loop, or someone to type in the data.
Is there a way to add a few lines to SetTheoryLoop to first create the table of integers, then mutating it, AND have it still be faster than the while loop.
But still, thanks for showing me the Set Loop, if I happen to have a pre-existing table, I'll be sure to use it rather than the WHILE.
June 27, 2008 at 11:35 am
zxn001 (6/27/2008)
The set was much faster.However, I must note that in order for the set to work, you must have the pre-existing data for it to compute, which whould require either another loop, or someone to type in the data.
Is there a way to add a few lines to SetTheoryLoop to first create the table of integers, then mutating it, AND have it still be faster than the while loop.
But still, thanks for showing me the Set Loop, if I happen to have a pre-existing table, I'll be sure to use it rather than the WHILE.
Just so you're convinced:
drop table #tally
set nocount on
declare @start datetime, @end datetime
select @start = GetDate()
select top 11000 identity(int,1,1) N
into #tally
from sys.all_columns sc1, sys.all_columns sc2
create unique clustered index ixn on #tally(n)
select case N%5 + N%3
when 0 then 'BizzBuzz'
else case N%5
when 0 then 'Bizz'
else case N%3
when 0 then 'Buzz'
else cast(N as varchar(10))
end
end
end
from #tally t1 --tally is a table of one (int) column, containing numbers 1 - 11,000
where N < 10001
select @end = GetDate()
select datediff(ms, @start, @end) as SetTheoryLoopTime
select @start = GetDate()
declare @i int
set @i = 1
while( @i < 10001)
begin
if @i%3 = 0 and @i%5 = 0 print 'BizzBuzz'
else if @i%5 = 0 print 'Buzz'
else if @i%3 = 0 print 'Bizz'
else print @i
set @i = @i+1
end
select @end = GetDate()
select datediff(ms, @start, @end) as whileLoopTime
Really - forget loops even exist. the likelihood they will outperform something set-based are minute.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
June 27, 2008 at 11:38 am
Insert this into the top of the script
--===== Conditionally drop
IF OBJECT_ID('dbo.Tally') IS NOT NULL
DROP TABLE dbo.Tally
--===== Create and populate the Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
No loop required! 🙂
On my old-ish server it runs in 80 ms
Hats off to Jeff M for teaching us this trick
Edit: I see Matt beat me to it, but he is spot on, forget loops if you can.
Dave
June 27, 2008 at 11:57 am
I didnt know SQL could do that.
And you made me a believer, a very confused one, but still believing.
So now I'm really curious, WHY does the set loop work fasther? is it the way SQL is implemented or is it in general?
It almost seems like Set does one commparison of @i to 3 and 5 at the same time, but bytes are binary, I did some testing,
if u switch the 3 and 5, ie test 3 then 5, it increases the speed, which implies that it doesnt test them at the same time, because if it did it should matter. if it tested 3 then if not test 5, then the increase in speed makes sense due to the freq of 3 being higher than 5.
so wat exactly does the Set Loop do?
P.S. thanks everyone for enlightening me
June 27, 2008 at 12:14 pm
zxn001 (6/27/2008)
I didnt know SQL could do that.And you made me a believer, a very confused one, but still believing.
So now I'm really curious, WHY does the set loop work fasther? is it the way SQL is implemented or is it in general?
It almost seems like Set does one commparison of @i to 3 and 5 at the same time, but bytes are binary, I did some testing,
if u switch the 3 and 5, ie test 3 then 5, it increases the speed, which implies that it doesnt test them at the same time, because if it did it should matter. if it tested 3 then if not test 5, then the increase in speed makes sense due to the freq of 3 being higher than 5.
so wat exactly does the Set Loop do?
P.S. thanks everyone for enlightening me
The simple answer is : SQL Server is a SET engine. That's how it thinks, and how it processes. Set-based actions occur on many rows simultaneously (assuming we stay out of SQL Server's way and let it decide how best to do things).
The set operation here isn't actually a loop (that's a misnomer). The Tally table provides us with a high-speed ordered list of numbers on which we can perform the rules. Because it's set-based - several hundred/thousand rows are being processed in one shot, instead of the one by one you get out of the loop.
The difference gets even more pronounced when you make for "bigger" sets. The advantage of a set-based approach is that performance doesn't have to decrease linearly as the set gets bigger.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
June 27, 2008 at 12:19 pm
zxn001 (6/27/2008)
I didnt know SQL could do that.And you made me a believer, a very confused one, but still believing.
So now I'm really curious, WHY does the set loop work fasther? is it the way SQL is implemented or is it in general?
It almost seems like Set does one commparison of @i to 3 and 5 at the same time, but bytes are binary, I did some testing,
if u switch the 3 and 5, ie test 3 then 5, it increases the speed, which implies that it doesnt test them at the same time, because if it did it should matter. if it tested 3 then if not test 5, then the increase in speed makes sense due to the freq of 3 being higher than 5.
so wat exactly does the Set Loop do?
P.S. thanks everyone for enlightening me
I happen to have a link to an article that explains much of that...
http://www.sqlservercentral.com/articles/TSQL/62867/
Most of the performance based articles that I've written either use a Tally table to form the set based loop, or other methods. For example, most folks use cursors or While loops or (worse) triangular joins to do running totals... they take comparitively forever to run... rather than list each of the articles individually, take a look at the set based solutions that use some of these "set based loops" in the following collection of articles... try the code in each of them... you'll be amazed...
http://www.sqlservercentral.com/Authors/Articles/Jeff_Moden/80567/
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2008 at 12:20 pm
zxn001 (6/27/2008)
I didnt know SQL could do that.And you made me a believer, a very confused one, but still believing.
So now I'm really curious, WHY does the set loop work fasther? is it the way SQL is implemented or is it in general?
It almost seems like Set does one commparison of @i to 3 and 5 at the same time, but bytes are binary, I did some testing,
if u switch the 3 and 5, ie test 3 then 5, it increases the speed, which implies that it doesnt test them at the same time, because if it did it should matter. if it tested 3 then if not test 5, then the increase in speed makes sense due to the freq of 3 being higher than 5.
so wat exactly does the Set Loop do?
P.S. thanks everyone for enlightening me
There are some assumptions in what I'm about to say.
Behind the scenes, SQL Server is always using cursors.
When you run a set based operation, SQL Server creates a cursor and iterates through the set to perform the requested operations. On a simple set, it will create it's cursor, do the work with locking, unlocking, etc., then deliver the results to you.
1 >< trip behind the scenes.
When you setup a loop, every iteration requires a trip behind the scenes. If you are looping through a cursor, then you are
1. > requesting a cursor
2. < receiving a reference
3. > requesting a move
4. < receive new references
5. > close the cursor
6. < return
7. > kill the cursor
8. < return
I've left out the locking because it varies.
... and if you are just looping without a cursor, but doing some data manipulation in each loop. The each iteration's behind the scenes request will be causing a new SQL Server cursor to be created, used, and destroyed just to resolve your request.
June 27, 2008 at 12:26 pm
Tom Garth (6/27/2008)
There are some assumptions in what I'm about to say.
That's a heck of a good explanation, Tom...
To put it in laymen's terms, there are two ways to buy groceries...
1. Go to the store, buy all the groceries, come home and unpack. 1 Trip (Set based... you bought a set of groceries).
... or... the "RBAR" method
1. Go to the store
2. Pick out 1 item
3. Stand in line at checkout
4. Pay for item
5. Take item home
6. Repeat until all groceries have been bought
Which one do you think will take the longest and which one do you think will be easier on your gasoline bill?
Same idea with set based technology vs RBAR.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2008 at 12:31 pm
Jeff Moden (6/27/2008)
Tom Garth (6/27/2008)
There are some assumptions in what I'm about to say.That's a heck of a good explanation, Tom...
To put it in laymen's terms, there are two ways to buy groceries...
1. Go to the store, buy all the groceries, come home and unpack. 1 Trip (Set based... you bought a set of groceries).
... or... the "RBAR" method
1. Go to the store
2. Pick out 1 item
3. Stand in line at checkout
4. Pay for item
5. Take item home
6. Repeat until all groceries have been bought
Which one do you think will take the longest and which one do you think will be easier on your gasoline bill?
Same idea with set based technology vs RBAR.
:hehe:
That's just pure poetry Jeff
Viewing 15 posts - 241 through 255 (of 309 total)
You must be logged in to reply to this topic. Login to reply