October 21, 2007 at 4:21 am
So many words instead of just sayng "No".
David, I ccan tell you a secret.
Math is not about numbers at all.
Math uses numbers as an abstraction in an attempt to explain the LOGIC of the world.
_____________
Code for TallyGenerator
October 21, 2007 at 4:34 am
David McFarland (10/20/2007)
I'll happily run it for you once marting is done, but I'd expect them to be in the same ballpark for that simple of a query.
Logic (you would know what is it if you'd study the math) says otherwise.
2 derived tables (1 generated for each OVER statement) will make overal performance of your query worse than the standard T-SQL query with single derived table.
You cannot probably notice the difference when you are single user on dedicated server with such a small database as Northwind.
So, everybody will happily accept your approach, because people use to believe in numbers, not in logic.
But the difference will appear later, when your approach is long gone into production and there is a real life load of data.
That's the reason why "Performance Tuning" forum is full of stories about "Performance of my database degrades gradually".
_____________
Code for TallyGenerator
October 21, 2007 at 7:55 am
ok here's some performance numbers from my machine, which has both SQL2000 and SQL 2005:
Jeff's inner join example took 00:00:40:580 Duration (hh:mi:ss:mmm)
the OVER example, when run from Query Analyzer: 00:00:53:500 Duration (hh:mi:ss:mmm)
the OVER example, when run from SQL Man Studio: 00:00:22:687 Duration (hh:mi:ss:mmm)
if it's not obvious, management studio, which is using the SQL Native Client was twice as fast as using SQLOLEDB from QA.
Lowell
October 21, 2007 at 8:24 am
Lowell (10/21/2007)
ok here's some performance numbers from my machine, which has both SQL2000 and SQL 2005:Jeff's inner join example took 00:00:40:580 Duration (hh:mi:ss:mmm)
the OVER example, when run from Query Analyzer: 00:00:53:500 Duration (hh:mi:ss:mmm)
the OVER example, when run from SQL Man Studio: 00:00:22:687 Duration (hh:mi:ss:mmm)
if it's not obvious, management studio, which is using the SQL Native Client was twice as fast as using SQLOLEDB from QA.
Thanks, Lowell!
Can you do it twice more, please... I'm concerned about the "display time" getting in the way and have modified the scripts to save to a temp table (1 for each result set) in both cases. Would you run the whole script (below) just in 2005 please... reason why just 2005 is I'm trying to see which works "better" in 2k5. The reason why I'm asking for it to be run twice is once from QA and once from Management Studio because of the difference in the connection...
Thanks, again... here's the code...
--===== If they already exist, drop the temp tables
IF OBJECT_ID('TempDB..#SalesOrderHeader') IS NOT NULL
DROP TABLE #SalesOrderHeader
IF OBJECT_ID('TempDB..#Test1Results') IS NOT NULL
DROP TABLE #Test1Results
IF OBJECT_ID('TempDB..#Test2Results') IS NOT NULL
DROP TABLE #Test2Results
--===== Create a variable to measure duration with
DECLARE @StartDate DATETIME
--===== Create a million row test table
PRINT 'Creating test table...'
SET @StartDate = GETDATE()
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
CustomerID = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT),
SalesOrderID = CAST(RAND(CAST(NEWID() AS VARBINARY))*10000000+1 AS INT),
TotalDue = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY)
INTO #SalesOrderHeader
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE #SalesOrderHeader
ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100
--===== Add an index to the CustomerID/TotalDue columns just for grins
CREATE INDEX tmpSalesOrderHeader_CustomerID_TotalDue
ON #SalesOrderHeader (CustomerID,TotalDue)
PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--===== Run duration test on INNER JOIN method
PRINT 'INNER JOIN method...'
SET @StartDate = GETDATE()
SELECT H.CustomerID
,H.SalesOrderID
,H.TotalDue
,PercentageOfTotal = H.TotalDue / T.Sum_TotalDue * 100.0
,DifferenceFromAverage = H.TotalDue - T.Avg_TotalDue
INTO #Test1Results
FROM #SalesOrderHeader H
INNER JOIN (SELECT CustomerID, SUM(TotalDue) Sum_TotalDue, AVG(TotalDue) Avg_TotalDue
FROM #SalesOrderHeader
GROUP BY CustomerID) T
ON T.CustomerID = H.CustomerID
-- ORDER BY H.CustomerID
PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--===== Run duration test on OVER PARTITION method
PRINT 'OverParition method...'
SELECT CustomerID
,SalesOrderID
,TotalDue
,PercentageOfTotal = TotalDue / Sum(TotalDue) OVER(Partition BY CustomerID) * 100.0
,DifferenceFromAverage = TotalDue - Avg(TotalDue) OVER(Partition BY CustomerID)
INTO #Test2Results
FROM #SalesOrderHeader
-- ORDER BY CustomerID
PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--Jeff Moden
Change is inevitable... Change for the better is not.
October 21, 2007 at 8:50 am
David McFarland (10/20/2007)
Now I'm dying to ask, and please don't take it personal. What does this have to do with this thread (or even that other thread, if that's what this is about)?
My question was based on Serqiy's post saying that you didn't have a background in Math. It was an out of place question and I shouldn't have asked :blush: because, it really doesn't matter... this isn't a "math forum" and the only "math" I really care about is which methods are faster :hehe:
I have a special request to make from you two fine warriors (both of which I have a very high regard for for different reasons)... please don't turn this thread into another BR thread. I don't care who started it, I don't care whether you guys like each other or not, I don't care what you think of each other or not. Stop slamming each other. Stick to the point...
... and here's the point...
My contention is that some of the newer tools available in SQL 2k5, much like when cursors were added to 6.5, were put there to make it easier for folks that don't have a strong knowledge of T-SQL and seek other methods to get the job done. I want to find out which parallel methods are "better" using simple duration tests. That's all.
I had no idea that my silly attempts at humor by saying things like...
Will OVER(PARTITION) be added to the list of things you can do if you don't really know T-SQL? "Only the shadow knows"
...or...
CLR's are for people who don't know T-SQL
would bring out the worst in a couple of people I have a modicum of respect for. I appologize and I'll stop that silliness... I'm also asking you two guys (David and Serqiy) to stop attacking each other on this thread if for no other reason than the fact that someone who respects you both is asking you to quit.
Thanks in advance...
--Jeff Moden
Change is inevitable... Change for the better is not.
October 21, 2007 at 9:12 am
here's the results of the second pass that Jeff asked for:
SQL2000/QA
Creating test table...
00:00:13:000 Duration (hh:mi:ss:mmm)
======================================
INNER JOIN method...
00:00:03:860 Duration (hh:mi:ss:mmm)
======================================
OverPartition method...
[Not Applicable on SQL2000]
======================================
SQL2005/QA
Creating test table...
00:00:48:953 Duration (hh:mi:ss:mmm)
======================================
INNER JOIN method...
00:00:03:377 Duration (hh:mi:ss:mmm)
======================================
OverParition method...
00:00:12:470 Duration (hh:mi:ss:mmm)
======================================
SQL2005/SSMS
Creating test table...
00:00:10:203 Duration (hh:mi:ss:mmm)
======================================
INNER JOIN method...
00:00:03:390 Duration (hh:mi:ss:mmm)
======================================
OverPartition method...
00:00:12:250 Duration (hh:mi:ss:mmm)
======================================
Lowell
October 21, 2007 at 10:46 am
Outstanding! And on a Sunday, no less! You ARE the MAN, Lowell!
I don't know about anyone else, but while a lousy 9 second difference on a lousy million rows doesn't seem like much, a batch process that may have hundreds of such 9 second differences per million rows makes it worthwhile for me to spend the tiny bit of extra time to use the Inner Join method instead of the convenient method of OVER(PARTITION).
A very wise friend of mine once said "A developer must NOT guess... a developer must KNOW!" (Thanks, Serqiy 😉 ) Well, now I know and, when we migrate to 2005 or 2008, I'm going to teach my guys the faster method. I don't ever want to become one of those poor buggers that seem to frequently post "It ran fine yesterday but our 10 minute runs are suddely taking hours to run and we have a water cooled server with a pod wad of dual-core CPU's and a bazillion bytes of RAM." They didn't think 9 seconds would make much of a difference either... especially when they had a tight schedule to meet. Those problems also seem to occur at the most inconvenient times... month end runs where you have, maybe, a 24 hour window to do the run (think Federal and State required tax runs) and the bad run just ate up 8 hours of that and is still running 😉
My mantra is "Make it work, make it fast, and make it pretty... and it ain't done 'til it's pretty." even in the face of a short schedule because I know if I don't follow that, I'm gonna pay in spades later on. I've also gotten a lot better at estimates and sometimes saying "No... can't do it right that quickly", which helps, a lot! 😛
Heh... what shall we test next? How about RANKING or ROWNUM? :w00t:
... and thanks to all you folks for providing code to test with and running the code that I just don't have the resources to execute.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 21, 2007 at 12:13 pm
Here are the results of the first script provided on my testbed machine when it's not doing anything else:
Creating test table...
(1000000 row(s) affected)
00:00:11:250 Duration (hh:mi:ss:mmm)
==============================================================================
INNER JOIN method...
(1000000 row(s) affected)
00:00:30:780 Duration (hh:mi:ss:mmm)
==============================================================================
OverParition method...
(1000000 row(s) affected)
00:00:59:923 Duration (hh:mi:ss:mmm)
==============================================================================
At first glance, you'd think that the Overpartition method took twice as long. First glances aren't always right, and in this case, they definitely aren't. Both took the same amount of time in my test.
Look at the code again, but this time play a game of "Find the missing subtrahend" while doing so. 🙂
I haven't had time to look at your second test yet, but will check it out later, Jeff. By the way, if we want to test the functionality of some of the other stuff in 2005, I'd add things like PIVOT and recursive CTEs to the list. DMVs would also be a good candidate for the "test against the equivalent in 2000" tests.
October 21, 2007 at 12:28 pm
Look at the code again, but this time play a game of "Find the missing subtrahend" while doing so.
Thanks for the test, David...
Haven't "studied" the code... what do you mean by "Find the missing subtrahend"... like I said, I don't have 2k5 so I can't run a comparison test on the results and I sure don't wanna guess at this point 😉 If there's something wrong with the code, let's fix it and run another test...:)
--Jeff Moden
Change is inevitable... Change for the better is not.
October 21, 2007 at 12:32 pm
I'd add things like PIVOT
Cool... Once we figure out what's meant my "missing subtrahend" in the previous code and retest, I'd love to do a test on the very popular PIVOT function... Again, I don't have 2k5 so the only code I'd be able to provide is the 2k method. I'll be very happy to wrap whatever code someone would be kind enough to provide in a duration test harness ,including the test data, for someone else to run.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 21, 2007 at 1:01 pm
Lowell (10/21/2007)
if it's not obvious, management studio, which is using the SQL Native Client was twice as fast as using SQLOLEDB from QA.
Do you know that SQLOLEDB driver contains cursor in it?
I discovered it occasionally when I was updating some big table with heavy triggers on it with "Some Date" = GETDATE() using OPENROWSET remote call (it was not that simple, but it's not important now).
In theory (;)) all updated lines should have the same updated time value.
In practice there were 5 - 30 ms intervals between GETDATE's for differewnt lines.
And if you think about it - it's quite logical.
SQLOLEDB driver is just a program passing data from one point to another.
Which language it was written with?
Yeah, right.
So, what should we expect? It can be row-by-row and only row-by-row operation.
_____________
Code for TallyGenerator
October 21, 2007 at 1:13 pm
Jeff Moden (10/21/2007)
I don't know about anyone else, but while a lousy 9 second difference on a lousy million rows doesn't seem like much, a batch process that may have hundreds of such 9 second differences per million rows makes it worthwhile for me to spend the tiny bit of extra time to use the Inner Join method instead of the convenient method of OVER(PARTITION).
It's not 9 seconds, it's almost 4 times.
it's almost 4 times more time to execute, almost 4 times more of server resource taken, almost 4 times less capacity, almost 4 times less scalability.
Good enough, yeah?
_____________
Code for TallyGenerator
October 21, 2007 at 1:14 pm
Jeff Moden (10/21/2007)
My question was based on Serqiy's post saying that you didn't have a background in Math. It was an out of place question and I shouldn't have asked :blush: because, it really doesn't matter... this isn't a "math forum" and the only "math" I really care about is which methods are faster :hehe:[/QUOTE]
I care about two things, and which one matters most to me depends primarily on the frequency of the executed code. I'll always opt for the absolutely fastest method possible for just about anything in an OLTP environment, no matter how wonky the code looks. If the clean method runs in 5 milliseconds and the ugly method runs in 4.9, ugly is going to be the way we go. I'll opt for the cleanest, easiest to read method, if it's a once a day process, such as a marting process, unless the other method is significantly faster. In other words, if one process in that environment takes 45 minutes, and a much cleaner version runs in 46 minutes, I'll go with the latter. As with most things in the SQL world, my answer to the question as to which is "better" is "It depends". I, like many of us here, can determine which route to take almost instinctively in most cases, and in those situations where it's questionable, I test. When it's for a report that will be accessed ten times a day, and the query returns in a second, I don't care if I could have spent an additional hour making it run in half a second. When it's a procedure that will run everytime someone so much as breathes on a button on one of our websites, I'll spend as much time and energy optimizing it as possible, until I've either satisfied myself that it's golden, or business requires that I move on. Our database grows by hundreds of millions of rows a day, so trust me when I say that I'm very careful to make sure that things that need to run fast do so.
I have a special request to make from you two fine warriors (both of which I have a very high regard for for different reasons)... please don't turn this thread into another BR thread. I don't care who started it, I don't care whether you guys like each other or not, I don't care what you think of each other or not. Stop slamming each other. Stick to the point...
I do care who started it. Personal attacks have no place in a forum such as this (in fact, in the other boards I mentioned, some of them have special forums just for the purpose of airing grievances, to keep them out of the topical forums), and shouldn't be acceptable. In fact, on those boards, they issue warnings and eventually bans when the rules are broken. Maybe we should have such a forum. Then, you can stick to factual, evidentially based debates in the appropriate forums, while calling each other a poopy-head in the other forum. Then, users can choose to avoid the poo-flinging monkeys if they want.
My contention is that some of the newer tools available in SQL 2k5, much like when cursors were added to 6.5, were put there to make it easier for folks that don't have a strong knowledge of T-SQL and seek other methods to get the job done. I want to find out which parallel methods are "better" using simple duration tests. That's all.
I agree that many of the new tools allow you to create code that probably should have been created elsewhere, but I don't think that was the intention. If MS ever states that the CLR stuff is the recommended method, instead of T-SQL, for querying data, I'll slam them hard. I personally think they are there for those situations where they are warranted. A good example on my end is that I needed to implement Blowfish in SQL Server. With the new assemblies, it was pretty easy to do in 2005, so I took that route. Sure, it could have been done as an Extended Stored Proc in 2000, and if you were really masochistic, it could be done in pure T-SQL, but no thanks. I don't think that means that MS intended for assemblies to be used in place of stored procs. It's just an additional tool which comes in handy for some people on some occasions, while others might never find a need for it in their entire careers. Cursors are the same way. I avoid them like the plague, but I've found them indispensible in a couple of situations. A real world example was an update statement at a previous contract that took days to complete. Their network infrastructure was so wonky that it kept failing for the person running it, and they weren't allowed to run it directly on the box to avoid that. They called me over for advice, and after a few minutes, I told them to use a cursor to split the updates into smaller batches (created a table of the keys we used to separate the batches, and the cursor was on that, not the billions of rows in the main table, so we had a few hundred updates, marked complete as each batch was done, instead of one). When it failed, they just picked up where it left off. Hardly elegant, but they'd spent weeks trying to get it to work, and my method ended up taking just a few hours longer than it would have taken had the simple UPDATE solution worked the first time. A quick 5 minute solution and the problem was solved. Sure, there were other options, but that one worked fine for their situation, and got the job done now, instead of spending even more time testing another method.
This is true of many things in SQL Server. Cursors? Avoid them, but don't pretend they don't exist. UDFs? Avoid them, but sometimes they are the best way to accomplish something. CLR? Well, I haven't actually found a situation where I'd use them yet and would strongly warn my devs from using them, but I'll certainly use them if I ever do find myself in a situation where they are the best bet. I've been contracted in many places where I followed some "Ninja-team of SQL best practice mavens" that put out documents with statements such as "UDFs are not permitted, under any circumstances, in production environments". The poor SQL devs (and these were some pretty sharp devs) were wondering how in the hell they were going to do their jobs under such restrictions (they actually were the ideal solution for a few of their problems), so I had to convince the business to ignore the best-practices documentation that they had just spent a million dollars on. Sometimes, business intelligence is truly an oxymoron.
I had no idea that my silly attempts at humor by saying things like...
Will OVER(PARTITION) be added to the list of things you can do if you don't really know T-SQL? "Only the shadow knows"
...or...
CLR's are for people who don't know T-SQL
would bring out the worst in a couple of people I have a modicum of respect for. I appologize and I'll stop that silliness... I'm also asking you two guys (David and Serqiy) to stop attacking each other on this thread if for no other reason than the fact that someone who respects you both is asking you to quit.[/QUOTE]
Why would you think that that would remain unquestioned, when it was simply a repeat of the personal attack, that again, has no place in this forum?
October 21, 2007 at 1:16 pm
Jeff Moden (10/21/2007)
Haven't "studied" the code... what do you mean by "Find the missing subtrahend"... like I said, I don't have 2k5 so I can't run a comparison test on the results and I sure don't wanna guess at this point 😉 If there's something wrong with the code, let's fix it and run another test...:)
Your test doesn't reset the start time before running the Over() portion, hence the missing subtrahend. 🙂
October 21, 2007 at 1:23 pm
OMG... what a complete rookie copy'n'paste mistake on my part... :blush: I'm so sorry...
Lowell, here's the code, one more time... would you and David please do the honors? Thanks, guys...
--===== If they already exist, drop the temp tables
IF OBJECT_ID('TempDB..#SalesOrderHeader') IS NOT NULL
DROP TABLE #SalesOrderHeader
IF OBJECT_ID('TempDB..#Test1Results') IS NOT NULL
DROP TABLE #Test1Results
IF OBJECT_ID('TempDB..#Test2Results') IS NOT NULL
DROP TABLE #Test2Results
--===== Create a variable to measure duration with
DECLARE @StartDate DATETIME
--===== Create a million row test table
PRINT 'Creating test table...'
SET @StartDate = GETDATE()
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
CustomerID = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT),
SalesOrderID = CAST(RAND(CAST(NEWID() AS VARBINARY))*10000000+1 AS INT),
TotalDue = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY)
INTO #SalesOrderHeader
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE #SalesOrderHeader
ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100
--===== Add an index to the CustomerID/TotalDue columns just for grins
CREATE INDEX tmpSalesOrderHeader_CustomerID_TotalDue
ON #SalesOrderHeader (CustomerID,TotalDue)
PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--===== Run duration test on INNER JOIN method
PRINT 'INNER JOIN method...'
SET @StartDate = GETDATE()
SELECT H.CustomerID
,H.SalesOrderID
,H.TotalDue
,PercentageOfTotal = H.TotalDue / T.Sum_TotalDue * 100.0
,DifferenceFromAverage = H.TotalDue - T.Avg_TotalDue
INTO #Test1Results
FROM #SalesOrderHeader H
INNER JOIN (SELECT CustomerID, SUM(TotalDue) Sum_TotalDue, AVG(TotalDue) Avg_TotalDue
FROM #SalesOrderHeader
GROUP BY CustomerID) T
ON T.CustomerID = H.CustomerID
-- ORDER BY H.CustomerID
PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--===== Run duration test on OVER PARTITION method
PRINT 'OverParition method...'
SET @StartDate = GETDATE()
SELECT CustomerID
,SalesOrderID
,TotalDue
,PercentageOfTotal = TotalDue / Sum(TotalDue) OVER(Partition BY CustomerID) * 100.0
,DifferenceFromAverage = TotalDue - Avg(TotalDue) OVER(Partition BY CustomerID)
INTO #Test2Results
FROM #SalesOrderHeader
-- ORDER BY CustomerID
PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 76 through 90 (of 172 total)
You must be logged in to reply to this topic. Login to reply