November 26, 2008 at 8:20 am
Adam Machanic (11/26/2008)
I also seen solutions (and I am still using one, but looking for a way out) that use 'master.sys.All_Columns' as a seed table and generate numbers using the row_number()
When I need something quick and dirty I'll use master.dbo.spt_values, which exists all the way back to SQL Server 7.0. It has a column called Number which -- amazingly enough -- contains numbers. Filter on type='P' and you'll get numbers from 0-255 in SQL Server 7.0 and 2000, and from 0-2047 in SQL Server 2005 and 2008.
select number
from master.dbo.spt_values
where type = 'P'
Thanks for your reply Adam,
This trick is certainly one of the options I will inspect further. My demands on a function that is generic, scalable, well performing and widely deployable is 'high'. I can therefore use every input I can find in threads like this.
November 27, 2008 at 9:20 am
jacroberts (11/25/2008)
timothyawiseman (11/24/2008)[hr
>>The problem is that your example did not show a case where efficiency is not a concern.
>>But, in your example you say there is no need to worry about the efficiency of a query that runs with no user waiting, but I disagree.
If you had read to the end of the sentence I wrote you would have seen that efficiency is not a concern in this example. So I repeat:
If you have a daily report that is run in batch, with no user waiting at the end of a button for the result, it doesn't matter if the SQL takes 3.2 seconds instead of 3.1 seconds to run.
Yes, I know what you said, but I respectfully disagree.
Once again, at that point in time, the developer-time required to optimize it may simply be a higher price than it is worth, but it will always matter how long it takes to run.
When I was newer to programming, I thought there were times when it would never matter, so I often didn't worry about it. Then the datasets would grow and small inefficiencies would become magnified until they mattered a great deal.
And just because its normally a nightly report that runs when no one is around, as the company grows you may find that time when no one is around begins to shrink quickly. Also, there may be times when management decides they need a refresh with the latest data in the middle of the day.
I have written procedures meant to run in the middle of the night only, but then later had someone rushing up saying they needed it produced now (in my case, it was mostly copying data from production to a reporting server).
And even if you genuinely know for whatever reason that the length of time that the procedure takes to run will actually never matter, remember that humans are creatures of habit. If a programmer is in the habit of writing efficient code, they will do it even when they are not trying to, but if they are not they may find it a very hard thing to do even when they have to.
There are times when other concerns, such as developer-time or robustness with more validation and error correction, outweigh and override the drive for efficiency, but I believe that it is always a concern.
---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/
November 27, 2008 at 9:31 am
jacroberts (11/25/2008)
Adam Machanic (11/25/2008)
If you have a daily report that is run in batch, with no user waiting at the end of a button for the result, it doesn't matter if the SQL takes 3.2 seconds instead of 3.1 seconds to run.
>>But what if we have not a "user", but lots of "users"? What if we have a batch system that needs to process--and send--a few hundred thousand reports each day?
Anyone can take any example and change it so it performance is important.
Precisely. That is because in reality it is so easy for a situation to change from where performance does not seem to be important to one where it is vitally important.
---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/
November 27, 2008 at 9:43 am
timothyawiseman (11/27/2008)
jacroberts (11/25/2008)
timothyawiseman (11/24/2008)[hr
>>The problem is that your example did not show a case where efficiency is not a concern.
>>But, in your example you say there is no need to worry about the efficiency of a query that runs with no user waiting, but I disagree.
If you had read to the end of the sentence I wrote you would have seen that efficiency is not a concern in this example. So I repeat:
If you have a daily report that is run in batch, with no user waiting at the end of a button for the result, it doesn't matter if the SQL takes 3.2 seconds instead of 3.1 seconds to run.
Yes, I know what you said, but I respectfully disagree.
Once again, at that point in time, the developer-time required to optimize it may simply be a higher price than it is worth, but it will always matter how long it takes to run.
When I was newer to programming, I thought there were times when it would never matter, so I often didn't worry about it. Then the datasets would grow and small inefficiencies would become magnified until they mattered a great deal.
And just because its normally a nightly report that runs when no one is around, as the company grows you may find that time when no one is around begins to shrink quickly. Also, there may be times when management decides they need a refresh with the latest data in the middle of the day.
I have written procedures meant to run in the middle of the night only, but then later had someone rushing up saying they needed it produced now (in my case, it was mostly copying data from production to a reporting server).
And even if you genuinely know for whatever reason that the length of time that the procedure takes to run will actually never matter, remember that humans are creatures of habit. If a programmer is in the habit of writing efficient code, they will do it even when they are not trying to, but if they are not they may find it a very hard thing to do even when they have to.
There are times when other concerns, such as developer-time or robustness with more validation and error correction, outweigh and override the drive for efficiency, but I believe that it is always a concern.
Ok, I have recently written some code that uses an RBAR TVF to increase the efficiency and speed of a query.
Here's my example:
We had a table of dates on the system called RefDates that reports use in their SQL. The table RefDates contained every day in a 5 year period or approximately 2,000 rows. We had another table of events that had to be reported against; approximately 1 event was generated every minute. The report was generated for a period of 1 month (31 days max). The query was very slow as there were a lot of rows in the RefDates lookup table.
A typical report had the following line:
RIGHT JOIN RefDates C
ON B.Date = C.Date
WHERE C.Date >= @StartDateTime
AND C.Date < @EndDateTime
Instead I put a RBAR table valued function:
CREATE FUNCTION [dbo].[GenRefDates]
(
@StartDate datetime,
@EndDate datetime
)
RETURNS @table TABLE (Date datetime)
AS BEGIN
DECLARE @tmpDate datetime
SET @tmpDate = Convert(varchar, @StartDate, 112)
SET @EndDate = Convert(varchar, @EndDate, 112)
WHILE @tmpDate <= @EndDate
BEGIN
INSERT INTO @table VALUES (@tmpDate)
SET @tmpDate = DateAdd(dd, 1, @tmpDate)
END
RETURN
END
The query was then changed to:
RIGHT JOIN GenRefDates(@StartDateTime, @EndDateTime) C
ON B.Date = C.Date
This changed the run time of the query from about a minute to a few seconds. So this shows that a RBAR table valued function can be much more efficient than using a lookup ref table. BTW the time taken to generate the TVF with 31 days is a few micro seconds and is totally insignificant compared to the rest of the query.
So there is an example where an RBAR TVF is much more efficient than a lookup table.
November 27, 2008 at 9:58 am
jacroberts (11/27/2008)
timothyawiseman (11/27/2008)
jacroberts (11/25/2008)
timothyawiseman (11/24/2008)[hr
>>The problem is that your example did not show a case where efficiency is not a concern.
>>But, in your example you say there is no need to worry about the efficiency of a query that runs with no user waiting, but I disagree.
If you had read to the end of the sentence I wrote you would have seen that efficiency is not a concern in this example. So I repeat:
If you have a daily report that is run in batch, with no user waiting at the end of a button for the result, it doesn't matter if the SQL takes 3.2 seconds instead of 3.1 seconds to run.
Yes, I know what you said, but I respectfully disagree.
Once again, at that point in time, the developer-time required to optimize it may simply be a higher price than it is worth, but it will always matter how long it takes to run.
When I was newer to programming, I thought there were times when it would never matter, so I often didn't worry about it. Then the datasets would grow and small inefficiencies would become magnified until they mattered a great deal.
And just because its normally a nightly report that runs when no one is around, as the company grows you may find that time when no one is around begins to shrink quickly. Also, there may be times when management decides they need a refresh with the latest data in the middle of the day.
I have written procedures meant to run in the middle of the night only, but then later had someone rushing up saying they needed it produced now (in my case, it was mostly copying data from production to a reporting server).
And even if you genuinely know for whatever reason that the length of time that the procedure takes to run will actually never matter, remember that humans are creatures of habit. If a programmer is in the habit of writing efficient code, they will do it even when they are not trying to, but if they are not they may find it a very hard thing to do even when they have to.
There are times when other concerns, such as developer-time or robustness with more validation and error correction, outweigh and override the drive for efficiency, but I believe that it is always a concern.
Ok, I have recently written some code that uses an RBAR TVF to increase the efficiency and speed of a query.
Here's my example:
We had a table of dates on the system called RefDates that reports use in their SQL. The table RefDates contained every day in a 5 year period or approximately 2,000 rows. We had another table of events that had to be reported against; approximately 1 event was generated every minute. The report was generated for a period of 1 month (31 days max). The query was very slow as there were a lot of rows in the RefDates lookup table.
A typical report had the following line:
RIGHT JOIN RefDates C
ON B.Date = C.Date
WHERE C.Date >= @StartDateTime
AND C.Date < @EndDateTime
Instead I put a RBAR table valued function:
CREATE FUNCTION [dbo].[GenRefDates]
(
@StartDate datetime,
@EndDate datetime
)
RETURNS @table TABLE (Date datetime)
AS BEGIN
DECLARE @tmpDate datetime
SET @tmpDate = Convert(varchar, @StartDate, 112)
SET @EndDate = Convert(varchar, @EndDate, 112)
WHILE @tmpDate <= @EndDate
BEGIN
INSERT INTO @table VALUES (@tmpDate)
SET @tmpDate = DateAdd(dd, 1, @tmpDate)
END
RETURN
END
The query was then changed to:
RIGHT JOIN GenRefDates(@StartDateTime, @EndDateTime) C
ON B.Date = C.Date
This changed the run time of the query from about a minute to a few seconds. So this shows that a RBAR table valued function can be much more efficient than using a lookup ref table. BTW the time taken to generate the TVF with 31 days is a few micro seconds and is totally insignificant compared to the rest of the query.
So there is an example where an RBAR TVF is much more efficient than a lookup table.
This is true, for a small number of rows. Your RBAR TVF will become ineffecient should requirements change and the number of rows it has to generate becomes quite large. This is what we all have been talking about in this thread. Try it, see how long your RBAR TVF takes to generate 10,000 rows versus the same function written in a set-based manner.
November 27, 2008 at 10:03 am
This changed the run time of the query from about a minute to a few seconds. So this shows that a RBAR table valued function can be much more efficient than using a lookup ref table. BTW the time taken to generate the TVF with 31 days is a few micro seconds and is totally insignificant compared to the rest of the query.
So there is an example where an RBAR TVF is much more efficient than a lookup table.
We have to take your word for it as it is impossible to verify the initial table was set up properly. It would need to be as small as possible and be indexed on the field you match to be any good. Also the datatypes of the field you match against must be the same as that of the variable, but I will assume you know all that.
If you don't consider this:
What might seem as a constant for us humans with respect to the query will not be so for the optimizer as it plans at the batch level and reuses those plans. Thus any local variable or parameter is NOT a constant in your query plan! SQL Sever is improving in this respect, but its remains tricky and counter intuitive. What problably happended in your case is that each row of the input table is first converted before the comparison (check the queryplan)! With a TVF you only have to do such conversion once and the comparisons for each record takes place against your generated table that has a matching type as output.
RIGHT JOIN RefDates C
ON B.Date = C.Date
WHERE C.Date >= @StartDateTime
AND C.Date < @EndDateTime
why did you not use in the original query something like this?
RIGHT JOIN RefDates C
ON B.Date = C.Date and C.Date >= @StartDateTime AND C.Date < @EndDateTime
But in all honesty, I simply think you ran out of luck. I had it once with a stored procedure that had a paramterised query. No matter what I did, it was extremely slow compared to a test query with constants I ran before. All types matched and still I had no luck. Matching against parameters or local variables sometimes has these effects (only seen it twice I think with such a big effect). I did the counter intuitive, and created a dynamic SQL statement so all the parameters became constants and ran that instead, and it was rocking (and ugly)!
As for RBAR defeats persistent table, that is a conculsion I would not draw based on your experience. You can ofcourse still use a numbers table to do only one insert within your stored procedure and it would be faster then RBAR. The true improvement comes some something else, and with some tweaking you will find that to be the case.
Also not long ago I optimised a complex query that was using a list of numbers that needed splitting. I currenly have two versions of the appropiate function. An inline version and one that creates a small table in the temp DB as it is not inline. Both use a numbers function and both where much faster then the procedural version I once used. But the non-inline version was still faster in this particular query, sometimes you have you have to break queries down a little to help the optimizer make sound decissions.
November 27, 2008 at 10:41 am
Lynn Pettis (11/27/2008)
jacroberts (11/27/2008)
timothyawiseman (11/27/2008)
jacroberts (11/25/2008)
timothyawiseman (11/24/2008)[hr
This is true, for a small number of rows. Your RBAR TVF will become ineffecient should requirements change and the number of rows it has to generate becomes quite large. This is what we all have been talking about in this thread. Try it, see how long your RBAR TVF takes to generate 10,000 rows versus the same function written in a set-based manner.
But it's not for 10,000 rows it's for a max of 31 days in a month. Even if it were generating 10,000 rows, which incidently would table about 2 seconds, it would still be insignificant compared to the run time of the query.
November 27, 2008 at 10:48 am
jacroberts (11/27/2008)
Lynn Pettis (11/27/2008)
jacroberts (11/27/2008)
timothyawiseman (11/27/2008)
jacroberts (11/25/2008)
timothyawiseman (11/24/2008)[hr
This is true, for a small number of rows. Your RBAR TVF will become ineffecient should requirements change and the number of rows it has to generate becomes quite large. This is what we all have been talking about in this thread. Try it, see how long your RBAR TVF takes to generate 10,000 rows versus the same function written in a set-based manner.
But it's not for 10,000 rows it's for a max of 31 days in a month. Even if it were generating 10,000 rows, which incidently would table about 2 seconds, it would still be insignificant compared to the run time of the query.
You are missing the major points. One requirements change, and what is effecient now won't be an longer. And two, someone is going to come along and see your code in your RBAR TVF and use it somewhere else and it won't be effecient at all.
If you take the time to write solid, efficient, and scalable code all the time you don't have to worry about those two things occuring as much. Something I have learned since becoming a member of this awesome site and getting tips and tricks from the like of Jeff Moden, Gail Shaw, etc..
November 27, 2008 at 11:10 am
Lynn Pettis (11/27/2008)
And two, someone is going to come along and see your code in your RBAR TVF and use it somewhere else and it won't be effecient at all.
Heh... job security for him... if he keeps planting timebombs like this, the customer is sure to call back for more "professional tuning services". 😛
If you take the time to write solid, efficient, and scalable code all the time you don't have to worry about those two things occuring as much.
If you compare the code, which takes longer to write? The cross join method or the WHILE loop? Nah... like I said before, it takes no extra time to do it right the first time.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 27, 2008 at 12:40 pm
Lynn Pettis (11/27/2008)
jacroberts (11/27/2008)
Lynn Pettis (11/27/2008)
jacroberts (11/27/2008)
timothyawiseman (11/27/2008)
jacroberts (11/25/2008)
timothyawiseman (11/24/2008)[hr
This is true, for a small number of rows. Your RBAR TVF will become ineffecient should requirements change and the number of rows it has to generate becomes quite large. This is what we all have been talking about in this thread. Try it, see how long your RBAR TVF takes to generate 10,000 rows versus the same function written in a set-based manner.
But it's not for 10,000 rows it's for a max of 31 days in a month. Even if it were generating 10,000 rows, which incidently would table about 2 seconds, it would still be insignificant compared to the run time of the query.
You are missing the major points. One requirements change, and what is effecient now won't be an longer. And two, someone is going to come along and see your code in your RBAR TVF and use it somewhere else and it won't be effecient at all.
If you take the time to write solid, efficient, and scalable code all the time you don't have to worry about those two things occuring as much. Something I have learned since becoming a member of this awesome site and getting tips and tricks from the like of Jeff Moden, Gail Shaw, etc..
It's horses for courses. There are times when it's good and times when it's bad to use different methods. If you think using a lookup table will always make your queries run faster then you are wrong. It is quite possible to use a lookup table on a query that runs quickly on a small amount of data, when there are requierments changes or even just with the progression of time causing more data to be added to a table, or the lookup table becomes larger the query will become horrendously slow and will need someone to come in and optimise it. In the example I gave the number of days reported on never changed from the number of days in a month; what did change was the number of days in the lookup table and the number of rows in the tables to be reported on. This the reason the report slowed down and using an RBAR TVF fixed that particular problem. I'm not saying that you should use them indiscriminately but there are times when they are suitable. Not only did the report run faster but it also removed the need to regularly insert more rows into the RefDates table as the new TVF would work for any dates, not just the ones on the lookup table.
November 27, 2008 at 8:25 pm
jacroberts (11/27/2008)
Ok, I have recently written some code that uses an RBAR TVF to increase the efficiency and speed of a query.Here's my example:
We had a table of dates on the system called RefDates that reports use in their SQL. The table RefDates contained every day in a 5 year period or approximately 2,000 rows. We had another table of events that had to be reported against; approximately 1 event was generated every minute. The report was generated for a period of 1 month (31 days max). The query was very slow as there were a lot of rows in the RefDates lookup table.
A typical report had the following line:
RIGHT JOIN RefDates C
ON B.Date = C.Date
WHERE C.Date >= @StartDateTime
AND C.Date < @EndDateTime
Instead I put a RBAR table valued function:
CREATE FUNCTION [dbo].[GenRefDates]
(
@StartDate datetime,
@EndDate datetime
)
RETURNS @table TABLE (Date datetime)
AS BEGIN
DECLARE @tmpDate datetime
SET @tmpDate = Convert(varchar, @StartDate, 112)
SET @EndDate = Convert(varchar, @EndDate, 112)
WHILE @tmpDate <= @EndDate
BEGIN
INSERT INTO @table VALUES (@tmpDate)
SET @tmpDate = DateAdd(dd, 1, @tmpDate)
END
RETURN
END
The query was then changed to:
RIGHT JOIN GenRefDates(@StartDateTime, @EndDateTime) C
ON B.Date = C.Date
This changed the run time of the query from about a minute to a few seconds. So this shows that a RBAR table valued function can be much more efficient than using a lookup ref table. BTW the time taken to generate the TVF with 31 days is a few micro seconds and is totally insignificant compared to the rest of the query.
So there is an example where an RBAR TVF is much more efficient than a lookup table.
I'm not sure what other change you may have made to the code, but shifting to your RBAR function instead of using the RefDates table they had wasn't what improved the performance. No form of RBAR will ever beat proper set based code, but let me prove it to you... first, some test code to simulate your table of events... I realize that a million rows consitutes only 1.9012852687655030 years of data according to your specs above, but it should suffice for the test...
USE TempDB
GO
--===== Create and populate a 1,000,000 row test table.
-- Column "EventID" has a range of 1 to 1,000,000 unique numbers
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique dates with midnight times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 1000000
EventID = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeDate = CAST(ABS(CHECKSUM(NEWID()))%3653+36524 AS DATETIME),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.Event
FROM Master.sys.All_Columns t1
CROSS JOIN Master.sys.All_Columns t2
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.Event
ADD PRIMARY KEY CLUSTERED (EventID)
CREATE INDEX IX_Event_SomeDate_SomeInt ON dbo.Events (SomeDate,SomeInt) INCLUDE (SomeMoney)
--===== Delete all the data for 15 May 2008 for testing
DELETE dbo.Event
WHERE SomeDate >= '20080515'
AND SomeDate < '20080516'
We also need a RefDate table... you said the one they had only had about 2000 rows in it and that THAT was the main source of the performance problem... what the heck, let's build one with 11,000 rows just so you don't think that's a problem anymore... 😛
USE TempDB
GO
--===== Create and populate the RefDate table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
DATEADD(dd,ROW_NUMBER() OVER (ORDER BY sc1.ID),'20000101') AS Date
INTO dbo.RefDate
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
ALTER TABLE dbo.RefDate
ALTER COLUMN Date DATETIME NOT NULL
GO
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.RefDate
ADD CONSTRAINT PK_Date_Date
PRIMARY KEY CLUSTERED (Date) WITH FILLFACTOR = 100
GO
--===== Allow the general public to use it
GRANT SELECT ON dbo.RefDate TO PUBLIC
And now, the test... each section of code is "stand-alone" with all it's own variables and all... I'm using your RBAR function in the second second section...
USE TempDB
GO
--===== Set based method with Date table
PRINT '===== Set based method with Date table ====='
SET STATISTICS TIME ON
DECLARE @MonthYear DATETIME,
@MonthStart DATETIME,
@MonthEnd DATETIME,
@NextMonthStart DATETIME
SELECT @MonthYear = 'May 2008',
@MonthStart = @MonthYear,
@NextMonthStart = DATEADD(mm,1,@MonthStart),
@MonthEnd = @NextMonthStart-1
SET NOCOUNT ON
SELECT d.Date, e.SomeInt, SUM(e.SomeMoney) AS Total
FROM dbo.Event e
RIGHT OUTER JOIN
dbo.RefDate d
ON e.SomeDate = d.Date
WHERE d.Date >= @MonthStart
AND d.Date < @NextMonthStart
GROUP BY d.Date, e.SomeInt
ORDER BY d.Date, e.SomeInt
SET STATISTICS TIME OFF
GO
PRINT REPLICATE('=',100)
GO
--===== Method with RBAR looping function
PRINT '===== Method with RBAR looping function ====='
SET STATISTICS TIME ON
DECLARE @MonthYear DATETIME,
@MonthStart DATETIME,
@MonthEnd DATETIME,
@NextMonthStart DATETIME
SELECT @MonthYear = 'May 2008',
@MonthStart = @MonthYear,
@NextMonthStart = DATEADD(mm,1,@MonthStart),
@MonthEnd = @NextMonthStart-1
SET NOCOUNT ON
SELECT d.Date, e.SomeInt, SUM(e.SomeMoney) AS Total
FROM dbo.Event e
RIGHT OUTER JOIN
dbo.GenRefDates(@MonthStart,@MonthEnd) d
ON e.SomeDate = d.Date
GROUP BY d.Date, e.SomeInt
ORDER BY d.Date, e.SomeInt
SET STATISTICS TIME OFF
GO
-------------------------------------------------------------
GO
I don't know about you, but on my humble 6 year old, single 1.8 Ghz CPU, 1GB ram, IDE hard drive system running SQL Server 2005 Developer's Edition sp2, here's what I get for runtimes...
[font="Arial Black"]===== Set based method with Date table =====[/font]
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
[font="Arial Black"]CPU time = 390 ms, elapsed time = 930 ms.[/font]
====================================================================================================
[font="Arial Black"]===== Method with RBAR looping function =====[/font]
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
[font="Arial Black"]CPU time = 19078 ms, elapsed time = 20878 ms.[/font]
Like I said, I don't know what else you changed in the report code or what condition their RefDate table was in or whatever other improper thing they may have had going on, but your RBAR function will never be faster than proper set based code and neither will any other form of RBAR.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 28, 2008 at 4:36 am
Jeff Moden (11/27/2008)
I don't know about you, but on my humble 6 year old, single 1.8 Ghz CPU, 1GB ram, IDE hard drive system running SQL Server 2005 Developer's Edition sp2, here's what I get for runtimes...[font="Arial Black"]===== Set based method with Date table =====[/font]
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
[font="Arial Black"]CPU time = 390 ms, elapsed time = 930 ms.[/font]
====================================================================================================
[font="Arial Black"]===== Method with RBAR looping function =====[/font]
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
[font="Arial Black"]CPU time = 19078 ms, elapsed time = 20878 ms.[/font]
Like I said, I don't know what else you changed in the report code or what condition their RefDate table was in or whatever other improper thing they may have had going on, but your RBAR function will never be faster than proper set based code and neither will any other form of RBAR.
Considdering that for me, just writing a post (let alone a focused one) takes a lot of time and then reading all the work you have put into your argument makes me feel "a bit" humble :). Did you even sleep?
Either way, you made a strong case here that something else must have been wrong in the original solution. Be it the queries themselfs, the modeling/indexing or even not up to date statistics.
November 28, 2008 at 5:51 am
Jeff Moden (11/27/2008)
jacroberts (11/27/2008)
Ok, I have recently written some code that uses an RBAR TVF to increase the efficiency and speed of a query.Here's my example:
We had a table of dates on the system called RefDates that reports use in their SQL. The table RefDates contained every day in a 5 year period or approximately 2,000 rows. We had another table of events that had to be reported against; approximately 1 event was generated every minute. The report was generated for a period of 1 month (31 days max). The query was very slow as there were a lot of rows in the RefDates lookup table.
A typical report had the following line:
RIGHT JOIN RefDates C
ON B.Date = C.Date
WHERE C.Date >= @StartDateTime
AND C.Date < @EndDateTime
Instead I put a RBAR table valued function:
CREATE FUNCTION [dbo].[GenRefDates]
(
@StartDate datetime,
@EndDate datetime
)
RETURNS @table TABLE (Date datetime)
AS BEGIN
DECLARE @tmpDate datetime
SET @tmpDate = Convert(varchar, @StartDate, 112)
SET @EndDate = Convert(varchar, @EndDate, 112)
WHILE @tmpDate <= @EndDate
BEGIN
INSERT INTO @table VALUES (@tmpDate)
SET @tmpDate = DateAdd(dd, 1, @tmpDate)
END
RETURN
END
The query was then changed to:
RIGHT JOIN GenRefDates(@StartDateTime, @EndDateTime) C
ON B.Date = C.Date
This changed the run time of the query from about a minute to a few seconds. So this shows that a RBAR table valued function can be much more efficient than using a lookup ref table. BTW the time taken to generate the TVF with 31 days is a few micro seconds and is totally insignificant compared to the rest of the query.
So there is an example where an RBAR TVF is much more efficient than a lookup table.
I'm not sure what other change you may have made to the code, but shifting to your RBAR function instead of using the RefDates table they had wasn't what improved the performance. No form of RBAR will ever beat proper set based code, but let me prove it to you... first, some test code to simulate your table of events... I realize that a million rows consitutes only 1.9012852687655030 years of data according to your specs above, but it should suffice for the test...
USE TempDB
GO
--===== Create and populate a 1,000,000 row test table.
-- Column "EventID" has a range of 1 to 1,000,000 unique numbers
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique dates with midnight times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 1000000
EventID = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeDate = CAST(ABS(CHECKSUM(NEWID()))%3653+36524 AS DATETIME),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.Event
FROM Master.sys.All_Columns t1
CROSS JOIN Master.sys.All_Columns t2
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.Event
ADD PRIMARY KEY CLUSTERED (EventID)
CREATE INDEX IX_Event_SomeDate_SomeInt ON dbo.Events (SomeDate,SomeInt) INCLUDE (SomeMoney)
--===== Delete all the data for 15 May 2008 for testing
DELETE dbo.Event
WHERE SomeDate >= '20080515'
AND SomeDate < '20080516'
We also need a RefDate table... you said the one they had only had about 2000 rows in it and that THAT was the main source of the performance problem... what the heck, let's build one with 11,000 rows just so you don't think that's a problem anymore... 😛
USE TempDB
GO
--===== Create and populate the RefDate table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
DATEADD(dd,ROW_NUMBER() OVER (ORDER BY sc1.ID),'20000101') AS Date
INTO dbo.RefDate
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
ALTER TABLE dbo.RefDate
ALTER COLUMN Date DATETIME NOT NULL
GO
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.RefDate
ADD CONSTRAINT PK_Date_Date
PRIMARY KEY CLUSTERED (Date) WITH FILLFACTOR = 100
GO
--===== Allow the general public to use it
GRANT SELECT ON dbo.RefDate TO PUBLIC
And now, the test... each section of code is "stand-alone" with all it's own variables and all... I'm using your RBAR function in the second second section...
USE TempDB
GO
--===== Set based method with Date table
PRINT '===== Set based method with Date table ====='
SET STATISTICS TIME ON
DECLARE @MonthYear DATETIME,
@MonthStart DATETIME,
@MonthEnd DATETIME,
@NextMonthStart DATETIME
SELECT @MonthYear = 'May 2008',
@MonthStart = @MonthYear,
@NextMonthStart = DATEADD(mm,1,@MonthStart),
@MonthEnd = @NextMonthStart-1
SET NOCOUNT ON
SELECT d.Date, e.SomeInt, SUM(e.SomeMoney) AS Total
FROM dbo.Event e
RIGHT OUTER JOIN
dbo.RefDate d
ON e.SomeDate = d.Date
WHERE d.Date >= @MonthStart
AND d.Date < @NextMonthStart
GROUP BY d.Date, e.SomeInt
ORDER BY d.Date, e.SomeInt
SET STATISTICS TIME OFF
GO
PRINT REPLICATE('=',100)
GO
--===== Method with RBAR looping function
PRINT '===== Method with RBAR looping function ====='
SET STATISTICS TIME ON
DECLARE @MonthYear DATETIME,
@MonthStart DATETIME,
@MonthEnd DATETIME,
@NextMonthStart DATETIME
SELECT @MonthYear = 'May 2008',
@MonthStart = @MonthYear,
@NextMonthStart = DATEADD(mm,1,@MonthStart),
@MonthEnd = @NextMonthStart-1
SET NOCOUNT ON
SELECT d.Date, e.SomeInt, SUM(e.SomeMoney) AS Total
FROM dbo.Event e
RIGHT OUTER JOIN
dbo.GenRefDates(@MonthStart,@MonthEnd) d
ON e.SomeDate = d.Date
GROUP BY d.Date, e.SomeInt
ORDER BY d.Date, e.SomeInt
SET STATISTICS TIME OFF
GO
-------------------------------------------------------------
GO
I don't know about you, but on my humble 6 year old, single 1.8 Ghz CPU, 1GB ram, IDE hard drive system running SQL Server 2005 Developer's Edition sp2, here's what I get for runtimes...
[font="Arial Black"]===== Set based method with Date table =====[/font]
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
[font="Arial Black"]CPU time = 390 ms, elapsed time = 930 ms.[/font]
====================================================================================================
[font="Arial Black"]===== Method with RBAR looping function =====[/font]
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
[font="Arial Black"]CPU time = 19078 ms, elapsed time = 20878 ms.[/font]
Like I said, I don't know what else you changed in the report code or what condition their RefDate table was in or whatever other improper thing they may have had going on, but your RBAR function will never be faster than proper set based code and neither will any other form of RBAR.
Thank you for going to all that effort but your query shows nothing about the speed of RBAR TVF compared to inline. It is different from my query as there is a complicated join in my one.
But my point is if your queries intend to show the difference between the two methods they should be equivalent in every other way like this:
--===== Set based method with Date table
PRINT '===== Set based method with Date table ====='
SET STATISTICS TIME ON
DECLARE @MonthYear DATETIME,
@MonthStart DATETIME,
@MonthEnd DATETIME,
@NextMonthStart DATETIME
SELECT @MonthYear = 'May 2008',
@MonthStart = @MonthYear,
@NextMonthStart = DATEADD(mm,1,@MonthStart),
@MonthEnd = @NextMonthStart-1
SET NOCOUNT ON
SELECT d.Date, e.SomeInt, SUM(e.SomeMoney) AS Total
FROM dbo.Event e
RIGHT OUTER JOIN
dbo.RefDates d
ON e.SomeDate = d.Date
WHERE d.Date >= @MonthStart
AND d.Date < @NextMonthStart
GROUP BY d.Date, e.SomeInt
ORDER BY d.Date, e.SomeInt
SET STATISTICS TIME OFF
GO
PRINT REPLICATE('=',100)
GO
--===== Method with RBAR looping function
PRINT '===== Method with RBAR looping function ====='
SET STATISTICS TIME ON
DECLARE @MonthYear DATETIME,
@MonthStart DATETIME,
@MonthEnd DATETIME,
@NextMonthStart DATETIME
SELECT @MonthYear = 'May 2008',
@MonthStart = @MonthYear,
@NextMonthStart = DATEADD(mm,1,@MonthStart),
@MonthEnd = @NextMonthStart-1
SET NOCOUNT ON
SELECT d.Date, e.SomeInt, SUM(e.SomeMoney) AS Total
FROM dbo.Event e
RIGHT OUTER JOIN
dbo.GenRefDates(@MonthStart,@MonthEnd) d
ON e.SomeDate = d.Date
WHERE d.Date >= @MonthStart
AND d.Date < @NextMonthStart
GROUP BY d.Date, e.SomeInt
ORDER BY d.Date, e.SomeInt
SET STATISTICS TIME OFF
GO
I've tested it on my machine and the results were:
===== Set based method with Date table =====
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 312 ms, elapsed time = 417 ms.
====================================================================================================
===== Method with RBAR looping function =====
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 299 ms, elapsed time = 376 ms.
The RBAR was actually quicker.
The reason for the difference you found is nothing to do with the RBAR function but the difference in the where clause:
RIGHT OUTER JOIN
dbo.RefDate d
ON e.SomeDate = d.Date
WHERE d.Date >= @MonthStart
AND d.Date < @NextMonthStart
vs.
RIGHT OUTER JOIN
dbo.GenRefDates(@MonthStart,@MonthEnd) d
ON e.SomeDate = d.Date
I'm really sorry to say that your example doesn't stand up to even the quickest analysis.
The performance difference you are measuring is due to a completely different matter.
November 28, 2008 at 8:06 am
I'm not sure what's going on... I ran the code you modified and here's what I got...
===== Set based method with Date table =====
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
CPU time = 375 ms, elapsed time = 931 ms.
====================================================================================================
===== Method with RBAR looping function =====
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
CPU time = 10531 ms, elapsed time = 11474 ms.
In other words, on my machine, your changes helped the RBAR solution a bit, but the set bsed solution still ate it up.
I also noticed that you changed my table reference from RefDate to RefDate[font="Arial Black"]s[/font] which means you didn't use the same demo table as I which, of course, means another difference that I can't see. This is going to be a tough one to resolve.
It would be interesting to see what other folks get as a result from "our" test code.
Anyone care to assist?
--Jeff Moden
Change is inevitable... Change for the better is not.
November 28, 2008 at 8:10 am
Joe Celko (11/28/2008)
Let me make an appeal to simple logic.You build a look-up table once (and if you did a proper calendar table, you also benefit from other uses). It probably has a clustered index or key to take advantage of the sequential nature of time. It probably has other constraints to ensure data integrity.
You have a table-valued function that creates a new table every time, without an index or key. And certainly without constraints for the optimizer to use.
There are (n) sessions in the DB. The Calendar table is cached in main storage -- 20, 50 or even 100 years is small. The function is re-computed over and over.
So before you get to the query, the look-up table has won on both speed and integrity.
There is more than one way to skin a cat.
Viewing 15 posts - 76 through 90 (of 106 total)
You must be logged in to reply to this topic. Login to reply