January 31, 2008 at 9:02 am
I used
CREATE CLUSTERED INDEX IX_Yak ON TestData (StaffName, ReviewDate)
Please reread my previous post with timings and execution plans involved.
My suggestion only needs 32 reads whereas Sergiy needs 4385 reads for doing the same job (137 times as much).
My suggestion averages at 2 ms whereas Sergiy averages at 395 ms (198 times as much).
And you still have two more things to do with Sergiy's suggestion.
1) Format the dates (could be done at crystal or reportings services report design)
2) Decide what to put in the columns where NULL is present (could be done at crystal or reporting services report design)
N 56°04'39.16"
E 12°55'05.25"
January 31, 2008 at 9:29 am
Interesting.
I am going to modify your suggestion as required for my needs and run it on production data against my modified Sergiy method and see what happens.
Null fields and Date formatting I'm not concerned with handled in Crystal Report.
The actual data returned is a concern though. When I run the code from your original post I get the following results:
[font="Courier New"]Blow, Joe 2006-03-23 2007-03-13 2008-01-24
Doe, Jane 2006-05-16 2007-03-12 2008-01-22
Dunno, I 2006-12-10 2007-07-05
2006-07-18 2007-06-28
2006-11-19 2007-06-11
2007-12-18
2007-12-21
2007-12-16 [/font]
Which makes it appear as if Joe Blow and Jane Doe have only 3 reviews each, and the remainder were I Dunno's. I'll check your procedure to see if can modify accordingly though.
January 31, 2008 at 10:50 am
It appears I may need to modify my production code *sigh* thanks to Peter...
After modifying his original post for SQL2000 method to sort properly it works great!
Here's the modification:
[font="Courier New"]Declare @Y0 int; Set @Y0 = Year(GetDate())
Declare @Y1 int; Set @Y1 = @Y0-1
Declare @Y2 int; Set @Y2 = @Y0-2
Declare @MinYear varchar(12); Set @MinYear = Convert(varchar,@Y0-2)+'-01-01'
Declare @MaxYear varchar(12); Set @MaxYear = Convert(varchar,@Y0+1)+'-01-01'
Select Case When t3.theIndex=0 Then t3.StaffName Else '' End As StaffName,
t3.[Prior_Year_2], t3.[Prior_Year_1], t3.[Current_Year]
From (
Select Top 100 Percent
u.theIndex,
u.StaffName,
Max(Case When Year(u.ReviewDate)=@Y2 Then u.ReviewDate Else '' End) As [Prior_Year_2],
Max(Case When Year(u.ReviewDate)=@Y1 Then u.ReviewDate Else '' End) As [Prior_Year_1],
Max(Case When Year(u.ReviewDate)=@Y0 Then u.ReviewDate Else '' End) As [Current_Year]
FROM (
SELECT t1.StaffName,
CONVERT(CHAR(10), t1.ReviewDate, 120) AS ReviewDate,
( SELECT COUNT(ReviewDate)
FROM #TestData AS t2
WHERE t2.StaffName = t1.StaffName
AND YEAR(t2.ReviewDate) = YEAR(t1.ReviewDate)
AND t2.ReviewDate < t1.ReviewDate
AND t2.ReviewDate >= @MinYear
AND t2.ReviewDate < @MaxYear) As theIndex
FROM #TestData AS t1
) AS u
Group By u.StaffName, u.theIndex
Order By u.StaffName, u.theIndex
) As t3
Go[/font]
January 31, 2008 at 12:32 pm
Mike Nuessler (1/31/2008)
Interesting.I am going to modify your suggestion as required for my needs and run it on production data against my modified Sergiy method and see what happens.
When you tested, did you notice a significant difference between them?
N 56°04'39.16"
E 12°55'05.25"
January 31, 2008 at 12:42 pm
Peter Larsson (1/31/2008)
That TABLE SPOOL looks nasty to me. And 10 clustered index scans?
I did not have TABLE SPOOL in my test runs.
Are you sure you copied my code as it was?
_____________
Code for TallyGenerator
January 31, 2008 at 12:59 pm
Peter Larsson (1/31/2008)
When you tested, did you notice a significant difference between them?
I just created a stored procedure for Peter's method and executed the two of them.
Sergiy's proc ran first, and the Query cost relative to the batch was 97.3%
Peter's proc's cost relative to the batch was obviously the other 2.7%
With the ridiculously small amount of data this is being run on either would work for my purposes.
(for that matter so does my original While..Continue temp table methods)
However from what I'm gathering here from you guys who know more about this than me
and the staggering differences in query costs (97% to 3%) it seems I should go with the
one from Peter.
That and there's no need for tally table or extra function, or the need to increase the Tally
table join condition (which really would "never" happen in this particular environment but still)
I see the table spool as well. Not a clue what that is though I need to do some reading..!
January 31, 2008 at 1:25 pm
Mike Nuessler (1/31/2008)I see the table spool as well.
Did you copy HASH JOIN?
_____________
Code for TallyGenerator
January 31, 2008 at 1:49 pm
Sergiy (1/31/2008)
Mike Nuessler (1/31/2008)I see the table spool as well.
Did you copy HASH JOIN?
I just tried it again. Copied straight from your post, pasted into QA, and ran it.
Got one Table Spool. Out of curiousity I removed all "HASH" and ran it again and got 4 Table Spools. What it means - couldn't tell you.
Either way I have gained a few tricks from everything posted and I think the end result was a bit from everyone. I still have to read up on Table Spools (links anyone?) which are apparently evil, and go over a bunch of older code to try out some of Carl's, Sergiy's, and Peter's methods.
{edit - sorry, missed Gova - still like that little trick!}
January 31, 2008 at 1:50 pm
Well - while the bases seem pretty well covered, so just remember this - costs aren't always what they're cracked up to be. They represent just one component for deciding what's "best". At least in my mind - that "cost" is usually an indicator of how much CPU effort it might take, although the reads tends to balance that out somewhat. Also - certain things (like DBCC system maintenance functions) are completely not counted in the costs, so they're not always telling the truth, or at least - not the whole truth.
Another way to get a different read on that is to turn on IO statistics, which will show you metrics on the IO resources being used.
finally - another side would be pure duration.
depending on what your server is lacking most at the time, it may be better to go with a more "costly" query that executes faster, or one with somewhat higher IO. So - a low cost, high IO solution would stink if the drives are getting hammered by something else: a high-cost scenario might suck if the processor is busy all of the time, etc...
In short - "best" is in the eye of the DBA. Your mileage may vary.
That being said - both solutions you've been given look to be plenty viable...
(you didn't expect this would an easy answer, did you?)
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
January 31, 2008 at 2:36 pm
Mike Nuessler (1/31/2008)With the ridiculously small amount of data this is being run on either would work for my purposes.
(for that matter so does my original While..Continue temp table methods)
However from what I'm gathering here from you guys who know more about this than me
and the staggering differences in query costs (97% to 3%) it seems I should go with the
one from Peter.
Key word is ridiculously small amount of data.
My script is designed to be scalable up to 5k entries per person per year.
Performance for hundreds reviews will be about the same as for 5.
Peter's query builds "tally table" on fly and limits it to the actual number of reviews for the Staff Member. So, it's better for "ridiculously small amount of data".
But problem with this approach is it uses correlated subquery.
They name it "hidden cursor".
As for any cursor performance is quite OK on beginning stage, when there are up to hundreds of rows.
But expense of such query increases exponentially with increasing number of rows.
Fortunately for those contractors they are far away from the company by that time (problems appear usually after 2-3 years) and it's not their problem. Current DBA is the one to blame for not keeping performance of good working system on required level.
Yes, I agree, for small amounts those 5000 spare lines look like some kind of wasting resources.
But I use to have this Tally table pinned in memory, so no actual reads happened here.
And if you keen to remove those extra lines you may use this:
[Code]DECLARE @N int
SELECT @N = MAX (CNT)
FROM (select COUNT(*) CNT
FROM dbo.tTestData
GROUP BY StaffName
) DT
select CASE WHEN T.N = 1 THEN S.StaffName ELSE '' END,
T.N, Y2006.ReviewDate [2006_Dates], Y2007.ReviewDate [2007_Dates], Y2008.ReviewDate [2008_Dates]
from dbo.tTestData S
INNER JOIN dbo.Tally T ON T.N > 0 AND T.N < @N
-- Now number of rows retrieved from Tally table is limited to the total number of reviews for every staff member
LEFT HASH JOIN dbo.tReviewsOfYear ('2006-01-01') Y2006 ON S.StaffName = Y2006.StaffName AND T.N = Y2006.ReviewNo
LEFT HASH JOIN dbo.tReviewsOfYear ('2007-01-01') Y2007 ON S.StaffName = Y2007.StaffName AND T.N = Y2007.ReviewNo
LEFT HASH JOIN dbo.tReviewsOfYear ('2008-01-01') Y2008 ON S.StaffName = Y2008.StaffName AND T.N = Y2008.ReviewNo
WHERE Y2006.ReviewNo IS NOT NULL OR Y2007.ReviewNo IS NOT NULL OR Y2008.ReviewNo IS NOT NULL
GROUP BY S.StaffName, T.N, Y2006.ReviewDate , Y2007.ReviewDate , Y2008.ReviewDate
ORDER BY S.StaffName, T.N
[/Code]
P.S. And I checked execution plan.
When I tested my script it was limitation to 1000 entries, not 5000. I did not check it for 5k thoroughly, and it appears there are some hidden reasons why table spool happens for 5000 lines and does not happen for 1000.
_____________
Code for TallyGenerator
January 31, 2008 at 3:06 pm
Here is another approach with no ridiculus tally table pinned into memory, where more vital information can be stored and used.
With the original sample data, there is now only 67 reads. It is not as good as 32 reads but still much better than 4385 reads.
My original solution has a hidden triangular join (which Sergiy writes), which makes the algorithm equal to n*n + n internal iterations.
This algorithm has 3*n internal iterations. It's quite better for a large number of sample data.
I tested with 365 records per Staffname per year, 100 StaffNames per year and 20 years in total.
It works VERY nice!
Sergiy's solution is somewhat better than my original. It has n * m internal iterations.
m is maximum number of entries for a staff and n is number of staff. So both his suggestion and my original tends to be power of 2 internal iterations.
But this is about linear, 3 * n.
And it is about 5 * n if you want to include all staff even if they have no reviews in the wanted date range.
CREATE PROCEDURE dbo.uspGetStatistics
(
@BaseYear SMALLINT = NULL,
@IncludeAllStaff BIT = 0
)
AS
SET NOCOUNT ON
DECLARE@MinDate SMALLDATETIME,
@MaxDate SMALLDATETIME
IF @BaseYear IS NULL
SET @BaseYear = DATEPART(YEAR, GETDATE())
SELECT@MaxDate = DATEADD(YEAR, @BaseYear - 1899, '19000101'),
@MinDate = DATEADD(YEAR, -3, @MaxDate)
CREATE TABLE#Stage
(
RowID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
StaffName VARCHAR(40) NOT NULL,
ReviewDate VARCHAR(10) NOT NULL
)
INSERT#Stage
(
StaffName,
ReviewDate
)
SELECT DISTINCTStaffName,
CONVERT(CHAR(10), ReviewDate, 120)
FROMTestData
WHEREReviewDate >= @MinDate
AND ReviewDate < @MaxDate
ORDER BYStaffName,
CONVERT(CHAR(10), ReviewDate, 120) DESC
IF @IncludeAllStaff = 1
INSERT#Stage
(
StaffName,
ReviewDate
)
SELECT DISTINCTt.StaffName,
''
FROMTestData AS t
LEFT JOIN#Stage AS s ON s.StaffName = t.StaffName
WHEREs.StaffName IS NULL
SELECTCASE WHEN u.theIndex = 0 THEN u.StafName ELSE '' END AS StaffName,
MAX(CASE WHEN u.theYear = @BaseYear - 2 THEN u.ReviewDate ELSE '' END) AS y2,
MAX(CASE WHEN u.theYear = @BaseYear - 1 THEN u.ReviewDate ELSE '' END) AS y1,
MAX(CASE WHEN u.theYear = @BaseYear THEN u.ReviewDate ELSE '' END) AS y0
FROM(
SELECTs.StaffName AS StafName,
m.theYear,
s.ReviewDate,
s.RowID - m.minRowID AS theIndex
FROM#Stage AS s
INNER JOIN(
SELECTStaffName,
LEFT(ReviewDate, 4) AS theYear,
MIN(RowID) AS minRowID
FROM#Stage
GROUP BYStaffName,
LEFT(ReviewDate, 4)
) AS m ON m.StaffName = s.StaffName
WHEREs.ReviewDate LIKE m.theYear + '%'
) AS u
GROUP BYu.StafName,
u.theIndex
ORDER BYu.StafName,
u.theIndex
DROP TABLE#Stage
N 56°04'39.16"
E 12°55'05.25"
January 31, 2008 at 4:11 pm
Peter, my solution has 3 internal iterations: one per year.
It does not iterate neither over staff members, nor over reviews.
It builds 3 temp tables for each year and joins them together.
_____________
Code for TallyGenerator
January 31, 2008 at 9:15 pm
These are my favourite kind of discussion. I'd be happy to say that 99% of SQL Server users probably couldn't have come up with either approach. The use of a tally table opens all kinds of possibilities and it often pays to consider it to solve this sort of problem. Having said that the other solution presented also looks elegant. If I had the time I'd try to do some benchmarking and do a thorough comparison of the approaches - but I don't. I'll certainly keep an eye on the thread though! 😀
It's unfortunate that you can't get your reporting tool to do some of the heavy lifting to lay out the data? Had you considered making the report side of things slightly more complex to allow for easy SQL? As it stands now though you'd be crazy to make the report more complex given the nice SQL output you're now getting, but for a more complex problem in future sometimes it's easier/faster to fiddle with the reporting tool rather than getting SQL to do some of the formatting.... my 2c
January 31, 2008 at 9:15 pm
And Mike, don't trust much those "Query cost" numbers in QA.
Every time verify it with simple but 100% reliable check:
[Code]
DECLARE @Time datetime
SET @Time = GETDATE()
Set of queries No.1 here
PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)
SET @Time = GETDATE()
Set of queries No.2 here
PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)
[/Code]
I ran my code in line with Peter's code.
In Execution plan my query took 99.69% of total cost.
Guess what time meter indicated?
00:00:00:093 - Sergiy's
vs.
00:00:00:157 - Peter's
Here is the code I executed (just in case you can see a mistake I made somewhere):
[Code]
DBCC FREEPROCCACHE
DECLARE @Time datetime
SET @Time = GETDATE()
DECLARE @N int
SELECT @N = MAX (CNT)
FROM (select COUNT(*) CNT
FROM dbo.tTestData
GROUP BY StaffName
) DT
select CASE WHEN T.N = 1 THEN S.StaffName ELSE '' END,
T.N, Y2006.ReviewDate [2006_Dates], Y2007.ReviewDate [2007_Dates], Y2008.ReviewDate [2008_Dates]
from dbo.tTestData S
INNER JOIN dbo.Tally T ON T.N > 0 AND T.N < @N
-- If you expect more than 5000 reviews per year per customer increase this number
LEFT HASH JOIN dbo.tReviewsOfYear ('2006-11-01') Y2006 ON S.StaffName = Y2006.StaffName AND T.N = Y2006.ReviewNo
LEFT HASH JOIN dbo.tReviewsOfYear ('2007-01-01') Y2007 ON S.StaffName = Y2007.StaffName AND T.N = Y2007.ReviewNo
LEFT HASH JOIN dbo.tReviewsOfYear ('2008-01-01') Y2008 ON S.StaffName = Y2008.StaffName AND T.N = Y2008.ReviewNo
WHERE Y2006.ReviewNo IS NOT NULL OR Y2007.ReviewNo IS NOT NULL OR Y2008.ReviewNo IS NOT NULL
GROUP BY S.StaffName, T.N, Y2006.ReviewDate , Y2007.ReviewDate , Y2008.ReviewDate
ORDER BY S.StaffName, T.N
PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)
SET @Time = GETDATE()
-- CREATE PROCEDURE dbo.uspGetStatistics
-- (
-- @BaseYear SMALLINT = NULL,
-- @IncludeAllStaff BIT = 0
-- )
-- AS
--
-- SET NOCOUNT ON
DECLARE @BaseYear SMALLINT , @IncludeAllStaff BIT
DECLARE@MinDate SMALLDATETIME,
@MaxDate SMALLDATETIME
IF @BaseYear IS NULL
SET @BaseYear = DATEPART(YEAR, GETDATE())
SELECT@MaxDate = DATEADD(YEAR, @BaseYear - 1899, '19000101'),
@MinDate = DATEADD(YEAR, -3, @MaxDate)
CREATE TABLE#Stage
(
RowID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
StaffName VARCHAR(40) NOT NULL,
ReviewDate VARCHAR(10) NOT NULL
)
INSERT#Stage
(
StaffName,
ReviewDate
)
SELECT DISTINCTStaffName,
CONVERT(CHAR(10), ReviewDate, 120)
FROMtTestData
WHEREReviewDate >= @MinDate
AND ReviewDate < @MaxDate
ORDER BYStaffName,
CONVERT(CHAR(10), ReviewDate, 120) DESC
IF @IncludeAllStaff = 1
INSERT#Stage
(
StaffName,
ReviewDate
)
SELECT DISTINCTt.StaffName,
''
FROMtTestData AS t
LEFT JOIN#Stage AS s ON s.StaffName = t.StaffName
WHEREs.StaffName IS NULL
SELECTCASE WHEN u.theIndex = 0 THEN u.StafName ELSE '' END AS StaffName,
MAX(CASE WHEN u.theYear = @BaseYear - 2 THEN u.ReviewDate ELSE '' END) AS y2,
MAX(CASE WHEN u.theYear = @BaseYear - 1 THEN u.ReviewDate ELSE '' END) AS y1,
MAX(CASE WHEN u.theYear = @BaseYear THEN u.ReviewDate ELSE '' END) AS y0
FROM(
SELECTs.StaffName AS StafName,
m.theYear,
s.ReviewDate,
s.RowID - m.minRowID AS theIndex
FROM#Stage AS s
INNER JOIN(
SELECTStaffName,
LEFT(ReviewDate, 4) AS theYear,
MIN(RowID) AS minRowID
FROM#Stage
GROUP BYStaffName,
LEFT(ReviewDate, 4)
) AS m ON m.StaffName = s.StaffName
WHEREs.ReviewDate LIKE m.theYear + '%'
) AS u
GROUP BYu.StafName,
u.theIndex
ORDER BYu.StafName,
u.theIndex
DROP TABLE#Stage
PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)
SET @Time = GETDATE()
[/Code]
_____________
Code for TallyGenerator
February 1, 2008 at 12:15 am
Sergiy (1/31/2008)
Peter, my solution has 3 internal iterations: one per year.It does not iterate neither over staff members, nor over reviews.
It builds 3 temp tables for each year and joins them together.
from dbo.tTestData SINNER JOIN dbo.Tally T ON T.N > 0 AND T.N < @N
N 56°04'39.16"
E 12°55'05.25"
Viewing 15 posts - 16 through 30 (of 38 total)
You must be logged in to reply to this topic. Login to reply