February 15, 2008 at 6:26 pm
Jeff Moden (2/15/2008)
Christopher Ford (2/15/2008)
So I says to myself, bah, table variable...you can do it without a table variable!
... for an interesting exercise, please run this...or better yet, don't run this unless you want to lock your machine for a bit...
Sorry, Chris... after 12 minutes, it had only done 50,000 rows... I cancelled the query.
Heh... I don't mean to be too much of a smarta$$ here, but they called it "Recursion" so it would show up in the dictionary right after "RBAR" and "Real Slow". :laugh: I don't believe anything recursive will beat the direct update method although some pretty cool code is bubbling to the surface in the attempts. Thanks guys. 🙂
Anyone wanna try a CLR? 😀
Actually - Adam (Machanic) figured that the CLR might be a good solution, since a. it guarantees order and b. it doesn't necessarily require RBAR (you can make the changes in memory and push them to disk). I've been kicking it around in the old peashooter up top - not quite there yet.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 15, 2008 at 6:32 pm
Heh... I knew it would be YOU, Matt... after all the testing you did for/with me on the "other" CLR's, I should have just asked "Hey Matt... you wanna try writing a CLR for this?" 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
February 15, 2008 at 6:41 pm
I suppose you're going to tell me I'm getting to be predictable now.... :hehe:
There's the obvious way, emulating a SQL-style cursor, which seems oh so very obvious. But the point is - .NET is so much smarter at using loops (as in - process and store the results in memory and start streaming them back), it would just be a shame not to give that a whirl and see if we can't get the old cursor method a boot in the pants. I still don't think there's a shot in hell it will beat the update method, but an honorable second... I'm not counting that out yet.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 15, 2008 at 7:02 pm
Boy, I go off to a CodeCamp dinner and come back and BAM - 15 more entries on this thread! 🙂
2 points
1) SELECT ... OVER(ORDER BY...) does guarantee the order by in the OVER but does NOT IIRC guarantee the order of the SELECT output.
2) use of a table variable voids the use of parallelism which can significantly reduce performance for large dataset processing.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
February 15, 2008 at 8:13 pm
1) SELECT ... OVER(ORDER BY...) does guarantee the order by in the OVER but does NOT IIRC guarantee the order of the SELECT output.
If you're refering to the CTE approach I posted, the order returned is irrelevant; the ordering is captured and preserved in the seq column in the table var: this is completely legit SQL, fully supported, no special circumstances required. The recursive CTE expression processes the table var a 'level' at a time to build the result table for the enclosing query.
2) use of a table variable voids the use of parallelism which can significantly reduce performance for large dataset processing.
You are of course correct; I shouldn't have mentioned the parallel bit. However, this restriction may at some point be lifted (no language reasons why it couldn't be, it's just an engine implementation detail). Also, with the way SQL defines handling of table vars, it is entirely possible that further enhancements of the query engine could result in the var only being populated on demand, at first use -- and in this sort of query, a (better) optimizer could thus completely eliminate the table variable....
-frank
February 15, 2008 at 8:41 pm
Jeff,
Sorry, I should have posted the group by time as well; it's about the same as yours, 800msec or so. I agree that using the partitioned sum as a group by is a loss; I posted it as I thought it was an interesting coding construct (combining distinct() with sum()) and interesting in that the optimizer did not detect the group by emulation.
As far as recursion, I will try to look at the situation a little closer; I am interested in the differences between the plans for the table var approach (fast) versus the all-in-one CTE approach (slow triangular join). I only glanced at the plans, but I did note that the table var approach only processes the same number of rows as the base table (each row processed once), while the all-CTE approach processes the full triangle. (Row counts in the plan steps indicate this.) I'm going to go way out on a limb (as I haven't looked at this enough to really say this), but I suspect that the plan may not really be showing enough detail to identify the full sophistication of the table var approach. I do plan on looking into this more.
The thing to note is that while this is a recursive query in the CTE, it is tail recursive which can (in theory, and seemingly in practice here) be fully optimized into a loop. The code does look like a triangluar join, and if you look at the code closely you may even wonder why the results are not the full triangular result set. However, this is the magic of the recursive CTE: a recursive CTE is a totally new SQL construct, and it cannot be emulated with the obvious while loop around the same query! In this specific case, we see that when the optimizer has access to an appropriate index, it will indeed turn the triangular join into a loop, and thus be very, very fast.
The only unfortunate aspect to me is that you need the table variable here, with the clustered index, in order to get the optimizer to make the right choice; it would be vastly preferable if this could be done as two CTE tables (the table var replaced with a non-recursive query) in the WITH cte block (in a single statement query). But that's a relatively minor point here.... To me, this is a huge win in that we are limiting ourselves to fully supported language constructs and we don't have to modify source tables to make this fast: further improvements to the optimizer and query engine will only make this approach even faster, they won't break this....
Of course, as with everything in this area, it is possible that something changes somewhere and the optimizer will suddenly decide to not optimize the recursive join into a loop; that's always a possibility, but seems very unlikely here. However, the result set cannot change; our only risk with this approach is that the optimizer may cause the query to run a lot slower. And while the speed difference would be huge (ginormous even), I think it is highly significant that we do not need to worry about the result set changing. Speed problems we generally live to fix another day; accuracy errors we may not survive....
-frank
February 16, 2008 at 11:03 am
TheSQLGuru (2/15/2008)
Kevin, do you, by any chance, have a Microsoft article that states that? Seems like I've been in a huge number of "arguments" about whether to use Temp Tables or Table Variables and, as part of the "It Depends" answer, I wouldn't mind having that particular piece of information as a verifiable source.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 16, 2008 at 11:17 am
Frank,
Heh... understood on things possibly changing...
It's a bit funny... for years, people jumped on me for using the occasional undocumented feature (xp_DirTree to be specific) because "Microsoft could change those at any time without notice". Then, on 2k, they changed the required privs for a documented feature called sp_MakeWebTask that blew a lot of people's code out of the water. 2k had some great documented features like the {f4} key. That very nice feature was removed in 2k5. The list goes on but you get the idea. Dunno what the hell they're going to mess up in 2k8 😀
The really ironic part of all this is that the "undocumented feature" that I used way back when, still hasn't changed. Go figure, huh? 😛
Anyway, thanks for all of your feedback on this thread.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 16, 2008 at 11:27 am
Matt Miller (2/15/2008)
I suppose you're going to tell me I'm getting to be predictable now.... :hehe:There's the obvious way, emulating a SQL-style cursor, which seems oh so very obvious. But the point is - .NET is so much smarter at using loops (as in - process and store the results in memory and start streaming them back), it would just be a shame not to give that a whirl and see if we can't get the old cursor method a boot in the pants. I still don't think there's a shot in hell it will beat the update method, but an honorable second... I'm not counting that out yet.
Predictable? I guess you could call it that... I was thinking more like "reliable" 🙂
I agree... I don't believe that anything, including the best written CLR, will ever be able to beat the update method, but I do think it can be written to be as fast as a firehose Cursor:w00t:... heh, of course, as soon as I say that, someone will find a way (probably you). 😀
I'm not sure, if I were you, that I'd waste the time writing the CLR for any other reason other than to have an information point where one could say, "No, we already tried that and it wasn't as fast". 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
February 16, 2008 at 9:40 pm
Jeff Moden (2/16/2008)
TheSQLGuru (2/15/2008)
2) use of a table variable voids the use of parallelism which can significantly reduce performance for large dataset processing./quote]Kevin, do you, by any chance, have a Microsoft article that states that? Seems like I've been in a huge number of "arguments" about whether to use Temp Tables or Table Variables and, as part of the "It Depends" answer, I wouldn't mind having that particular piece of information as a verifiable source.
2005 BOL: ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/tsqlref9/html/1ef0b60e-a64c-4e97-847b-67930e3973ef.htm
Important:
Queries that modify table variables do not generate parallel query execution plans. Performance can be affected when very large table variables, or table variables in complex queries, are modified. In these situations, consider using temporary tables instead. For more information, see CREATE TABLE (Transact-SQL). Queries that read table variables without modifying them can still be parallelized.
So the insert to populate it can't be parallelized but select access can.
I have moved several clients away from extensive use of table variables. They have either large datasets or poorly distributed data values (or both). These can lead to exceedingly poor query plans due to either no parallel on populate or lacking statistics leading to suboptimal query plans.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
February 16, 2008 at 9:44 pm
Off Topic
Hey Matt - I got your remote "Hello" at the Raleigh Code Camp today. Some guy came up to me and said "Hey, your TheSQLGuru (had that on my name tag 😀 ). Matt says to tell you Hello". Boy did you miss a good event too. We had a televised luncheon (for dnrTV), lots of swag, great topics (SIX tracks!) and an unbelivable 4 Microsoft RDs and 12 MVPs among the presenters.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
February 16, 2008 at 9:49 pm
Heh... dang... thanks Kevin! I should'a looked in BOL...
--Jeff Moden
Change is inevitable... Change for the better is not.
February 17, 2008 at 1:14 am
TheSQLGuru (2/16/2008)
Off TopicHey Matt - I got your remote "Hello" at the Raleigh Code Camp today. Some guy came up to me and said "Hey, your TheSQLGuru (had that on my name tag 😀 ). Matt says to tell you Hello". Boy did you miss a good event too. We had a televised luncheon (for dnrTV), lots of swag, great topics (SIX tracks!) and an unbelivable 4 Microsoft RDs and 12 MVPs among the presenters.
Bahahaahahahaa! That's my recruiter, who I have been reintroducing to .NET and SQL Server. That guy has a pair of steel-tipped ones - he can just walk up to anyone and start up a conversation. He had asked whose presentations I thought he should stop in on, so I told him those who I wanted to see (a couple of yours, Andy's (on SSIS), Miguel's (on what the future of contracting looks like) and Diane's (on LinQ)). Your internet handle came up....
It was a REALLY hard one to pass up....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 18, 2008 at 10:47 am
Jeff Moden (2/16/2008)
Matt Miller (2/15/2008)
I suppose you're going to tell me I'm getting to be predictable now.... :hehe:There's the obvious way, emulating a SQL-style cursor, which seems oh so very obvious. But the point is - .NET is so much smarter at using loops (as in - process and store the results in memory and start streaming them back), it would just be a shame not to give that a whirl and see if we can't get the old cursor method a boot in the pants. I still don't think there's a shot in hell it will beat the update method, but an honorable second... I'm not counting that out yet.
Predictable? I guess you could call it that... I was thinking more like "reliable" 🙂
I agree... I don't believe that anything, including the best written CLR, will ever be able to beat the update method, but I do think it can be written to be as fast as a firehose Cursor:w00t:... heh, of course, as soon as I say that, someone will find a way (probably you). 😀
I'm not sure, if I were you, that I'd waste the time writing the CLR for any other reason other than to have an information point where one could say, "No, we already tried that and it wasn't as fast". 😉
Well - it might just be egg on the face time....for us both. I just built a running totals CLR Table-valued function.
What if I told you it executes in 13 seconds? No temp table needed either - it only requires a covering non-clustered index... Meaning - if you factor in building a temp table, etc... it might actually be FASTER than the update process. Of course - this all happens in memory, so I'm thinking it's murder on memory.
I knew it could loop with A$$$$ off, but wow. I didn't have enough faith in it...
I need to go dig out your solution and set these up against each other. Once I do - I will post the CLR code and the results.
It's totally not reusable right now (meaning - the column names and table name are hardcoded, so I can't just pass it parameters and have it just give me the stuff), but - nothing says that it can't be improved on.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 18, 2008 at 12:51 pm
Okay - results time....:)
First - the test scenario setup. Since the "update" method requires a new table and a clustered index to do its thing, it's necessary to throw in the creation of the temp table and clustered index. In the same way - since the CLR function requires an index on the "base" table, the time required to build this will be counted "against" the CLR function.
The Base table is built based on the "Merry Go round" scenario, and is used against both scenarios.
--===== Create and populate a 1,000,000 row test table 100 rows at a time to
-- simulate inserting multiple sets of rows.
-- Column "RowNum" is an IDENTITY and has a range of 1 to 1,000,000 unique numbers
-- Column "AccountID" has a range of 1 to 50,000 non-unique numbers
-- Column "Amount has a range of 0.0000 to +/- 99.9900 non-unique numbers
-- Column "Date" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
--===== If the test table already exists, drop it
SET NOCOUNT ON
IF OBJECT_ID('dbo.JBMTestMerry','U') IS NOT NULL
DROP TABLE dbo.JBMTestMerry
GO
--===== Create the test table
CREATE TABLE dbo.JBMTestMerry
(
RowNum INT IDENTITY (1,1) NOT NULL Primary Key Clustered,
AccountID INT NULL,
Amount MONEY NULL,
Date DATETIME NULL,
RunBal MONEY NULL,
GrpBal MONEY NULL,
RunCnt INT NULL,
GrpCnt INT NULL
)
--===== Add the "sorting index" to the table
CREATE NONCLUSTERED INDEX IX_JBMTestMerry_AccountID_Date --not clustered for "Merry-go-Round" test
ON dbo.JBMTestMerry (AccountID, Date)
--===== Build the table 100 rows at a time to "mix things up"
DECLARE @Counter INT
SET @Counter = 0
WHILE @Counter < 1000000
BEGIN
--===== Add 1000 rows to the test table
INSERT INTO dbo.JBMTestMerry
(AccountID, Amount, Date)
SELECT TOP 100
AccountID = ABS(CHECKSUM(NEWID()))%50000+1,
Amount = CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY),
Date = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME)
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== Increment the counter
SET @Counter = @Counter + 100
END
The CLR function:
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Runtime.InteropServices
Partial Public Class UserDefinedFunctions
Private Class jbmrunbal
Private _rowID As SqlInt32
Private _accountid As SqlInt32
Private _date As DateTime
Private _amt As SqlMoney
Private _rb As SqlMoney
Public Property RowID() As SqlInt32
Get
Return _rowID
End Get
Set(ByVal value As SqlInt32)
_rowID = value
End Set
End Property
Public Property AccountID() As SqlInt32
Get
Return _accountid
End Get
Set(ByVal value As SQLINT32)
_accountid = value
End Set
End Property
Public Property DateCol() As DateTime
Get
return _date
End Get
Set(ByVal value As DateTime)
_date=value
End Set
End Property
Public Property amount() As SqlMoney
Get
Return _amt
End Get
Set(ByVal value As SqlMoney)
_amt = value
End Set
End Property
Public Property running() As SqlMoney
Get
Return _rb
End Get
Set(ByVal value As SqlMoney)
_rb = value
End Set
End Property
End Class
<Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName:="FillRowRunTot",DataAccess:=DataAccessKind.Read, IsDeterministic:=True, IsPrecise:=True,TableDefinition:="RowNum int, AccountID int,date datetime,amount money,runbal money" )> _
Public Shared Function RunningTotal() As IEnumerable
Dim cmd As SqlCommand
Dim rdr As SqlDataReader
Dim jm As New Collection
Dim jbmr As jbmrunbal
Dim prevacct As SQLINT32 = 0
Dim runbal As SQLMoney = 0
Using conn As New SqlConnection("context connection=true")
conn.Open()
cmd = New SqlCommand("Select rownum,accountid,date,amount from jbmtestmerry ORDER by accountid,date", conn)
rdr = cmd.ExecuteReader()
While (rdr.Read())
jbmr = New jbmrunbal
jbmr.RowID = rdr.GetInt32(0)
jbmr.AccountID =rdr.GetInt32(1)
jbmr.DateCol = rdr.GetDateTime(2)
jbmr.amount = rdr.GetSqlMoney(3)
If prevacct = jbmr.AccountID Then
runbal = runbal + jbmr.amount
Else
runbal = jbmr.amount
End If
jbmr.running = runbal
prevacct = jbmr.AccountID
jm.Add(jbmr)
End While
End Using
Return jm
End Function
Public Shared Sub FillRowRunTot(ByVal obj As Object, <Out()> ByRef RowID As SqlInt32, <Out()> ByRef AccountID As SqlInt32, <Out()> ByRef DateCol As SqlDatetime, <Out()> ByRef Amount As SqlMoney, <Out()> ByRef Runbal As SqlMoney)
Dim j As jbmrunbal
j=ctype(obj,jbmrunbal)
With j
RowID = .RowID
AccountID = .AccountID
DateCol = .DateCol
Amount = .amount
Runbal = .running
End With
End Sub
End Class
So - this leads to the test script:
drop table jbmtest
declare @g datetime
--start the clock
select @g=getdate()
select * Into JBMTEST from jbmtestMerry
Create unique clustered index uci_jbmtest on jbmtest(accountID,date)
select 'Update method',datediff(ms,@g,getdate())
go
--===== Declare the variables for the "Code Basis"
declare @g datetime
select @g=getdate()
DECLARE @PrevRunBal MONEY
SET @PrevRunBal = 0
DECLARE @PrevAcctID INT
SET @PrevAcctID = 0
--===== Apply what we know to the "Code Basis"
UPDATE dbo.jbmtest
SET @PrevRunBal = RunBal = CASE
WHEN AccountID = @PrevAcctID
THEN @PrevRunBal + Amount
ELSE Amount -- Restarts total at "0 + current amount"
END,
@PrevAcctID = AccountID
FROM dbo.JBMTest WITH (INDEX(uci_jbmtest),TABLOCKX)
select 'Update method',datediff(ms,@g,getdate())
--===== Display the results in the correct order
SELECT top 100 *
FROM dbo.JBMTest
ORDER BY AccountID, Date
go
--dbcc freeproccache
go
drop table memtest
go
declare @g datetime
--start the clock
select @g=getdate()
create index ix_jbmtestmerry on jbmtestmerry(accountid,date) include (amount)
Select * into memtest from
(select * from dbo.runningtotal()) t
drop index jbmtestmerry.ix_jbmtestmerry --drop the temporary index so as to not "leave anything behind"
select 'CLR method',datediff(ms,@g,getdate())
--Verification - makes sure that they match
SELECT top 100 *
FROM dbo.memTest
ORDER BY AccountID, Date
Just for a point of comparison - I also threw in a version of Adam Machanic's "Cursor version" to boot:
declare @G datetime
select @g=getdate()
create index ix_jbmtestmerry on jbmtestmerry(accountid,date) include (amount)
DECLARE RunningTotalCursor
CURSOR LOCAL FAST_FORWARD FOR
SELECT AccountID,date, Amount
FROM JBMTestMerry
ORDER BY AccountID,date
OPEN RunningTotalCursor
declare @prevaccount int
set @prevaccount=0
DECLARE @AccountID INT
Declare @date datetime
DECLARE @Amount MONEY
DECLARE @RunningTotal MONEY
SET @RunningTotal = 0
DECLARE @Results TABLE
(
AccountID INT NOT NULL,
date datetime,
Amount MONEY,
RunningTotal MONEY,
PRIMARY KEY(accountid,date)
)
FETCH NEXT FROM RunningTotalCursor
INTO @AccountID,@date, @Amount
WHILE @@FETCH_STATUS = 0
BEGIN
SET @RunningTotal = case when @prevaccount=@accountid then @RunningTotal else 0 end + @Amount
INSERT @Results
VALUES (@AccountID, @date, @Amount, @RunningTotal)
FETCH NEXT FROM RunningTotalCursor
INTO @AccountID,@date, @Amount
END
CLOSE RunningTotalCursor
DEALLOCATE RunningTotalCursor
drop index jbmtestmerry.ix_jbmtestmerry --drop the temporary index so as to not "leave anything behind"
select datediff(ms,@g,getdate())
SELECT Top 100 *
FROM @Results
ORDER BY AccountID,date
And now - with no further delay.... THE RESULTS (picked what looks to be the "middle of the road" results):
Update Function :6936 (create temp table and index) + 4060 (update) = 10996 ms
CLR: 15032 ms (add 1267 ms if you want a clustered index on the final temp table).
Adam Machanic's Cursor: 50196 ms
So - the "update" method is still safe in its first place during most runs. But CLR does seem to give it a run for its money, and does seem to guarantee Order, etc... It's actually close enough that once out of 20 runs, the CLR method actually did beat the update method once.
Both blows the doors off of the Cursor method.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 136 through 150 (of 250 total)
You must be logged in to reply to this topic. Login to reply