February 1, 2008 at 12:56 am
Here is how I both created sample data and timetested...
SET NOCOUNT ON
-- Create sample data
CREATE TABLETestData
(
StaffName VARCHAR(40),
ReviewDate SMALLDATETIME
)
GO
-- Populate the table with 512000 records
INSERTTestData
(
StaffName,
ReviewDate
)
SELECTsn.Number,
36524 + ABS(CHECKSUM(NEWID())) % 7304
FROMmaster..spt_values AS sn
INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'
INNER JOINmaster..spt_values AS y ON y.Type = 'p'
WHEREsn.Type = 'p'
AND sn.Number < 100
AND rd.Number < 256
AND y.Number < 20
GO
CREATE CLUSTERED INDEX IX_TestData_StaffName_ReviewDate ON TestData (StaffName, ReviewDate)
GO
-- Create Sergiys tally table with 65536 rescords
CREATE TABLETally
(
N INT PRIMARY KEY CLUSTERED
)
INSERTTally
(
N
)
SELECT DISTINCT256 * v1.Number + v2.Number
FROMmaster..spt_values AS v1
INNER JOINmaster..spt_values AS v2 ON v2.Type = 'p'
WHEREv1.Type = 'p'
AND v1.Number < 256
AND v2.Number < 256
GO
-- Create Sergiys function
CREATE FUNCTION dbo.tReviewsOfYear
(
@ReviewYear SMALLDATETIME
)
RETURNS @reviews TABLE (
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
StaffName VARCHAR(40),
ReviewNo int NULL,
ReviewDate SMALLDATETIME NOT NULL
)
AS
BEGIN
-- To make sure we deal with "year only" value
SET @ReviewYear = DATEADD(YY, DATEDIFF(YY, 0, @ReviewYear), 0)
DECLARE @StaffName VARCHAR(200)
DECLARE @Count INT
INSERT INTO @reviews (StaffName, ReviewDate)
SELECT StaffName, ReviewDate
FROM TestData
WHERE ReviewDate >= @ReviewYear
AND ReviewDate < DATEADD(YY, 1, @ReviewYear)
ORDER BY StaffName, ReviewDate
-- Sequential update driven by PRIMARY KEY
UPDATE R
SET @Count = ReviewNo = CASE WHEN @StaffName = StaffName THEN @Count + 1 ELSE 1 END,
@StaffName = StaffName
FROM @reviews R
RETURN
END
GO
-- Prepare timetesting
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
DECLARE @Time datetime
SET @Time = GETDATE()
-- Now we timetest Sergiys suggestion
DECLARE @N int
SELECT @N = MAX (CNT)
FROM (select COUNT(*) CNT
FROM TestData
GROUP BY StaffName
) DT
select CASE WHEN T.N = 1 THEN S.StaffName ELSE '' END,
T.N, Y2006.ReviewDate [2006_Dates], Y2007.ReviewDate [2007_Dates], Y2008.ReviewDate [2008_Dates]
from TestData S
INNER JOIN Tally T ON T.N > 0 AND T.N < @N
-- If you expect more than 5000 reviews per year per customer increase this number
LEFT HASH JOIN dbo.tReviewsOfYear ('2006-11-01') Y2006 ON S.StaffName = Y2006.StaffName AND T.N = Y2006.ReviewNo
LEFT HASH JOIN dbo.tReviewsOfYear ('2007-01-01') Y2007 ON S.StaffName = Y2007.StaffName AND T.N = Y2007.ReviewNo
LEFT HASH JOIN dbo.tReviewsOfYear ('2008-01-01') Y2008 ON S.StaffName = Y2008.StaffName AND T.N = Y2008.ReviewNo
WHERE Y2006.ReviewNo IS NOT NULL OR Y2007.ReviewNo IS NOT NULL OR Y2008.ReviewNo IS NOT NULL
GROUP BY S.StaffName, T.N, Y2006.ReviewDate , Y2007.ReviewDate , Y2008.ReviewDate
ORDER BY S.StaffName, T.N
-- Now we display the time for Sergyis suggestion
PRINT 'Sergyi time taken ' + CONVERT(varchar(20), DATEDIFF(MILLISECOND, @Time, GETDATE())) + ' milliseconds.'
-- Prepare timetesting
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
-- Now we timetest Pesos suggestion
SET @Time = GETDATE()
DECLARE @BaseYear SMALLINT , @IncludeAllStaff BIT
DECLARE @MinDate SMALLDATETIME,
@MaxDate SMALLDATETIME
IF @BaseYear IS NULL
SET @BaseYear = DATEPART(YEAR, GETDATE())
SELECT @MaxDate = DATEADD(YEAR, @BaseYear - 1899, '19000101'),
@MinDate = DATEADD(YEAR, -3, @MaxDate)
CREATE TABLE #Stage
(
RowID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
StaffName VARCHAR(40) NOT NULL,
ReviewDate VARCHAR(10) NOT NULL
)
INSERT #Stage
(
StaffName,
ReviewDate
)
SELECT DISTINCT StaffName,
CONVERT(CHAR(10), ReviewDate, 120)
FROM TestData
WHERE ReviewDate >= @MinDate
AND ReviewDate < @MaxDate
ORDER BY StaffName,
CONVERT(CHAR(10), ReviewDate, 120) DESC
IF @IncludeAllStaff = 1
INSERT #Stage
(
StaffName,
ReviewDate
)
SELECT DISTINCT t.StaffName,
''
FROM tTestData AS t
LEFT JOIN #Stage AS s ON s.StaffName = t.StaffName
WHERE s.StaffName IS NULL
SELECT CASE WHEN u.theIndex = 0 THEN u.StafName ELSE '' END AS StaffName,
MAX(CASE WHEN u.theYear = @BaseYear - 2 THEN u.ReviewDate ELSE '' END) AS y2,
MAX(CASE WHEN u.theYear = @BaseYear - 1 THEN u.ReviewDate ELSE '' END) AS y1,
MAX(CASE WHEN u.theYear = @BaseYear THEN u.ReviewDate ELSE '' END) AS y0
FROM (
SELECT s.StaffName AS StafName,
m.theYear,
s.ReviewDate,
s.RowID - m.minRowID AS theIndex
FROM #Stage AS s
INNER JOIN (
SELECT StaffName,
LEFT(ReviewDate, 4) AS theYear,
MIN(RowID) AS minRowID
FROM #Stage
GROUP BY StaffName,
LEFT(ReviewDate, 4)
) AS m ON m.StaffName = s.StaffName
WHERE s.ReviewDate LIKE m.theYear + '%'
) AS u
GROUP BY u.StafName,
u.theIndex
ORDER BY u.StafName,
u.theIndex
DROP TABLE #Stage
-- Now we display the time for Pesos suggestion
PRINT 'Peso time taken ' + CONVERT(varchar(20), DATEDIFF(MILLISECOND, @Time, GETDATE())) + ' milliseconds.'
Well.. How can I write this in a nice way?
I run my code 60 times and the average time on my computer is 2200 milliseconds (yes 2.2 seconds only!)
The average reads were 115000.
During this time, Sergiys suggestion is still running the first time. I promise, I will get back and post the time taken and average reads.
N 56°04'39.16"
E 12°55'05.25"
February 1, 2008 at 1:25 am
I have to break my promise.
After over 30 minutes for first run of Sergiys suggestion I had to cancel the query.
My profiler then showed my this result for Sergiy
Duration - 1,874,907
Reads - 10,773,782
N 56°04'39.16"
E 12°55'05.25"
February 1, 2008 at 1:33 am
I took the liberty to optimize Sergiys suggestion a little.
Two things.
SELECT @N = MAX (CNT)
FROM (select COUNT(*) CNT
FROM TestData
GROUP BY StaffName , year(reviewdate)
) DT
and
INNER JOIN Tally T ON T.N > 0 AND T.N <= @N
not much improvement. Only 10%.
N 56°04'39.16"
E 12°55'05.25"
February 1, 2008 at 1:50 am
I did some test with SET STATISTICS IO ON and got this result for original 15 records
Sergiy
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 48, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 2, logical reads 6, physical reads 1, read-ahead reads 7, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#336AA144'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#318258D2'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#2F9A1060'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Sergyi time taken 63 milliseconds.
Peso
Table '#Stage'. Scan count 0, logical reads 31, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Stage'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Peso time taken 46 milliseconds.
My statistics io result for 512000 records of sample data is
Table 'TestData'. Scan count 3, logical reads 1543, physical reads 0, read-ahead reads 16, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Stage'. Scan count 0, logical reads 111250, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Stage'. Scan count 2, logical reads 424, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Peso time taken 2250 milliseconds.
N 56°04'39.16"
E 12°55'05.25"
February 1, 2008 at 2:16 am
And just for the fun of it (to prove Sergiy wrong about my original suggestion with the hidden triangular join which he didn't like and wasn't afraid to tell), I timetested that too.
With 40 runs, the time taken averaged at 12.1 seconds and averaged at 820000 reads.
Still hell a lot better than Sergiys suggestion with tally table, function and join hints!
The statistics io result were for my original suggestion and 512000 sample records
Table 'TestData'. Scan count 55103, logical reads 295419, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 131558, logical reads 520202, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
N 56°04'39.16"
E 12°55'05.25"
February 1, 2008 at 4:28 am
I tried to rerun Sergiys suggestion, and this time my server died after 2 hours and 7 minutes.
There were no longer space to expand my tempdb!
sp_spaceused testdata
namerowsreserveddataindex_sizeunused
TestData512000 11272 KB11160 KB80 KB32 KB
When checking the size of TempDB i realized it was now 124 000 MB !
For a table of 11 MB size.
N 56°04'39.16"
E 12°55'05.25"
February 1, 2008 at 5:04 am
I now come to the conclusion I must create less sample data, so I started with 4000 records
INSERTTestData
(
StaffName,
ReviewDate
)
SELECTsn.Number,
36524 + ABS(CHECKSUM(NEWID())) % 7304
FROMmaster..spt_values AS sn
INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'
INNER JOINmaster..spt_values AS y ON y.Type = 'p'
WHEREsn.Type = 'p'
AND sn.Number < 20
AND rd.Number < 20
AND y.Number < 10
Now Sergiys suggestion finalized with an average of 5.1 s and 15000 reads.
My suggestion averaged at 0.2 s and 1600 reads.
Then I went on to create 18000 sample records like thisINSERTTestData
(
StaffName,
ReviewDate
)
SELECTsn.Number,
36524 + ABS(CHECKSUM(NEWID())) % 7304
FROMmaster..spt_values AS sn
INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'
INNER JOINmaster..spt_values AS y ON y.Type = 'p'
WHEREsn.Type = 'p'
AND sn.Number < 30
AND rd.Number < 30
AND y.Number < 20
Now Sergiys suggestion averaged with an average of 120 s and 143000 reads.
My suggestion averaged at 0.3 seconds and 5600 reads.
As you can see, Sergiys suggestion doesnt scale well. Far from.
4.5 times the sample data (from 4000 to 18000) makes his suggestion need 9.5 number of reads and 24 times the time taken.
And with 512000 (128 times more compared to 4000 records) sample records my suggestion did take 115000 reads and 2.2 seconds.
That is 72 times more reads only and 88 times the time taken.
And with 512000 (7 times more compared to 18000 records) sample records my suggestion did take 115000 reads and 2.2 seconds.
That is 20 times more reads only and 7 times the time taken.
So my suggestion scales very well! In practical my suggestion scales linear. Double the sample data, double the time!
What if we could calculate the time taken for Sergiy with 512000 sample records?
To do this we need a calculatable set of sample data. I created this with
INSERTTestData
(
StaffName,
ReviewDate
)
SELECTsn.Number,
36524 + ABS(CHECKSUM(NEWID())) % 7304
FROMmaster..spt_values AS sn
INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'
INNER JOINmaster..spt_values AS y ON y.Type = 'p'
WHEREsn.Type = 'p'
AND sn.Number < 50
AND rd.Number < 40
AND y.Number < 20
That is double the number of sample records (40000 records and that made tempdb expand to 2300 MB!). And as I suspected Sergiys suggestion now did take 308 s and used 400000 reads.
That is 40000 sample records (2 times more compared to 18000 records).
Which is 2.5 times more time taken and 3 times more used reads.
For a sample set of 512000 records, Sergiys suggestion would take about 23 hours...
Now I just wait to hear from Sergiy and he tells me I am using "a ridicolus large amount of sample data".
Peso algorithm
Sample recordsTime takenUsed reads
---------------------------------------
4,000 0.2 s 1,600 reads
18,000 0.3 s 5,600 reads
512,000 2.2 s 115,000 reads
1,600,000 5.6 s 235,000 reads(same time as Sergiy had, but for 4,000 records only)
40,000,000 126.0 s3,266,000 reads(same time as Sergiy had, but for 18,000 records only)
N 56°04'39.16"
E 12°55'05.25"
February 1, 2008 at 5:05 am
Sergiy, I think you are reasonable skilled at many things for Microsoft SQL Server.
But in this specific topic I think I have proved that this is not one of your strengths.
N 56°04'39.16"
E 12°55'05.25"
February 4, 2008 at 9:40 am
I think Peter has proven his case here..
Thank you to everyone for suggestions; in the long run every one is valuable to improve my SQL skills.
Cheers, Mike
Viewing 9 posts - 31 through 38 (of 38 total)
You must be logged in to reply to this topic. Login to reply