June 25, 2009 at 5:06 pm
Hi Chris,
first of all thank you for taking the time to get the data together and post it here!
Great job! :Wow:
When looking at the numbers in your table I started scratching my head. I'm a baldhead by now. :crazy:
It looks very strange that you'd get only 20% performance gain from no index to using index on Act.
Would you mind posting the actual execution plan for both scenarios? Note: Please make sure to include the actual and not the estimated execution plan.
I'm not at the point to follow Paul's conclusion yet (PIVOT is slightly more efficient), since the total numbers don't make sense to me. Not yet. It looks like one of the performance gurus around should have a look at it. Maybe we overlook the obvious...
But one of the first question from the gurus -aside from execution plan- would be table def together with any index definition (beside the one's that are part of the "challenge").
Do you think you'll find the time to provide those information? That would really be great!
Note: It always prevents me from sleeping well if there are issues around I don't understand at all but that contradict my current understanding - in this case I don't understand why there's so little difference between no index and a variety of tested indexes...
June 26, 2009 at 7:34 pm
lmu92 (6/22/2009)
You don't need the subselect:
SELECT
ACCT_DEBTOR,
MAX(Case WHEN OCCURRENCE=1 THEN LANDLINE_CONTACT_NO ELSE null END) AS LandLineNumber1,
MAX(Case WHEN OCCURRENCE=2 THEN LANDLINE_CONTACT_NO ELSE null END) AS LandLineNumber2,
MAX(Case WHEN OCCURRENCE=3 THEN LANDLINE_CONTACT_NO ELSE null END) AS LandLineNumber3
FROM Post_File082_Landline_No
GROUP BY ACCT_DEBTOR
I've had great success with a little trick that Peter Larson taught me a while back. I actually incorporated his method into one of my articles. Peter calls it "pre-aggregation". In the presence of the correct indexes, the method will usually beat PIVOT.
[font="Courier New"] SELECT preagg.Acct_Debtor,
MAX(CASE WHEN preagg.Occurance=1 THEN preagg.Max_LandLine_Contact_No ELSE NULL END) AS LandLineNumber1,
MAX(CASE WHEN preagg.Occurance=2 THEN preagg.Max_LandLine_Contact_No ELSE NULL END) AS LandLineNumber2,
MAX(CASE WHEN preagg.Occurance=3 THEN preagg.Max_LandLine_Contact_No ELSE NULL END) AS LandLineNumber3
FROM
(--==== Pre-aggregate the data. This will obviously work much better with the correct index
SELECT Acct_Debtor, Occurance, MAX(LandLine_Contact_No) AS Max_LandLine_Contact_No
FROM dbo.Post_File082_Landline_No
GROUP BY Acct_Debtor, Occurance
)preagg
GROUP BY preagg.Acct_Debtor[/font]
I don't know what your indexes are, but I would think that the following index would be beneficial to the task...
[font="Courier New"] CREATE INDEX IX_Post_File082_Landline_No_Cover
ON dbo.Post_File082_Landline_No (Acct_Debtor, Occurance)
INCLUDE (LandLine_Contact_No)[/font]
Of course, I don't have your data to test with so I don't guarantee the code but it should be fairly close.
_____________________________________________________________________________________________________________________________
--Jeff Moden
Change is inevitable... Change for the better is not.
June 26, 2009 at 7:43 pm
Paul White (6/25/2009)
Cool stuff Chris. Of course it leaves me wondered about the hash aggregate versus stream aggregate thing but hey.So PIVOT can be slightly more efficient - if you can be bothered to fight the syntax, and if the query you need suits it.
😎
Heh... not if you do it right. 😉
http://www.sqlservercentral.com/articles/T-SQL/63681/
--Jeff Moden
Change is inevitable... Change for the better is not.
June 26, 2009 at 7:56 pm
Heh... I almost forgot... you can get a tiny bit more speed out of it if you move the pre-aggregation from the derived table to a CTE like this...
[font="Courier New"];WITH
ctePreAgg AS
(--==== Pre-aggregate the data. This will obviously work much better with the correct index
SELECT Acct_Debtor, Occurance, MAX(LandLine_Contact_No) AS Max_LandLine_Contact_No
FROM dbo.Post_File082_Landline_No
GROUP BY Acct_Debtor, Occurance
)
SELECT preagg.Acct_Debtor,
MAX(CASE WHEN preagg.Occurance=1 THEN preagg.Max_LandLine_Contact_No ELSE NULL END) AS LandLineNumber1,
MAX(CASE WHEN preagg.Occurance=2 THEN preagg.Max_LandLine_Contact_No ELSE NULL END) AS LandLineNumber2,
MAX(CASE WHEN preagg.Occurance=3 THEN preagg.Max_LandLine_Contact_No ELSE NULL END) AS LandLineNumber3
FROM ctePreAgg AS preagg
GROUP BY preagg.Acct_Debtor[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2009 at 1:01 am
Here is a test rig to try the various methods on the traditional one million rows.
(By the way moving a sub-query like this to a common-table expression generally results in an identical plan - there is no speed-up. It does look better though - at least it does to me).
-- Everyone has one of these
USE tempdb;
GO
-- Don't print x row(s) affected messages
SETNOCOUNT ON;
GO
-- Conditional drops
IFOBJECT_ID(N'#CaseVersion', N'P') IS NOT NULL DROP PROCEDURE #CaseVersion;
IFOBJECT_ID(N'#PreAggVersion', N'P') IS NOT NULL DROP PROCEDURE #PreAggVersion;
IFOBJECT_ID(N'#PivotVersion', N'P') IS NOT NULL DROP PROCEDURE #PivotVersion;
IFOBJECT_ID(N'dbo.ClientContact', N'U') IS NOT NULL DROP TABLE dbo.ClientContact;
GO
-- Test table
CREATE TABLE dbo.ClientContact
(
client_idVARCHAR(10)NOT NULL,
sequence_idINTEGERNOT NULL,
telephoneVARCHAR(10)NOT NULL,
CONSTRAINT [PK dbo.ClientContact client_id, sequence_id]
PRIMARY KEY CLUSTERED (client_id, sequence_id)
);
GO
-- Add 999,999 test rows
-- (this may take 30 seconds or so)
WITHNumbers (client_id)
AS(
SELECTTOP (333333)
'A' + RIGHT(1000000000 + ROW_NUMBER() OVER (ORDER BY C2.[object_id]), 9)
FROMmaster.sys.columns C1, master.sys.columns C2
)
INSERTdbo.ClientContact
(client_id, sequence_id, telephone)
SELECTclient_id,
Sequence.id,
Telephone.tel
FROMNumbers
CROSS
APPLY(SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) Sequence (id)
CROSS
APPLY(SELECT RIGHT(10000000000 + CONVERT(INT, RAND(CHECKSUM(NEWID())) * 123456789), 10)) Telephone (tel);
GO
-- CASE-based version
CREATE PROCEDURE #CaseVersion AS
BEGIN
SET NOCOUNT ON;
IFOBJECT_ID(N'#DumpTable', N'U') IS NOT NULL DROP TABLE #DumpTable;
SELECTclient_id,
MAX(CASE WHEN sequence_id = 1 THEN telephone ELSE NULL END) AS LandLineNumber1,
MAX(CASE WHEN sequence_id = 2 THEN telephone ELSE NULL END) AS LandLineNumber2,
MAX(CASE WHEN sequence_id = 3 THEN telephone ELSE NULL END) AS LandLineNumber3
INTO#DumpTable
FROMdbo.ClientContact
GROUPBY
client_id
ORDERBY
client_id
OPTION(MAXDOP 1);
END;
GO
-- Pre-aggregated CASE version
CREATE PROCEDURE #PreAggVersion AS
BEGIN
SET NOCOUNT ON;
IFOBJECT_ID(N'#DumpTable', N'U') IS NOT NULL DROP TABLE #DumpTable;
SELECTpreagg.client_id,
MAX(CASE WHEN preagg.sequence_id = 1 THEN preagg.max_telephone ELSE NULL END) AS LandLineNumber1,
MAX(CASE WHEN preagg.sequence_id = 2 THEN preagg.max_telephone ELSE NULL END) AS LandLineNumber2,
MAX(CASE WHEN preagg.sequence_id = 3 THEN preagg.max_telephone ELSE NULL END) AS LandLineNumber3
INTO#DumpTable
FROM
(--==== Pre-aggregate the data. This will obviously work much better with the correct index
SELECTclient_id, sequence_id, MAX(telephone) AS max_telephone
FROMdbo.ClientContact
GROUPBY
client_id, sequence_id
) preagg
GROUPBY
preagg.client_id
ORDERBY
preagg.client_id
OPTION(MAXDOP 1);
END;
GO
-- PIVOT version
CREATE PROCEDURE #PivotVersion AS
BEGIN
SET NOCOUNT ON;
IFOBJECT_ID(N'#DumpTable', N'U') IS NOT NULL DROP TABLE #DumpTable;
SELECTPVT.client_id,
PVT.[1] AS [LandLineNumber1],
PVT.[2] AS [LandLineNumber2],
PVT.[3] AS [LandLineNumber3]
INTO#DumpTable
FROM(SELECT client_id, sequence_id, telephone FROM dbo.ClientContact) CC
PIVOT(MAX(telephone) FOR sequence_id IN ([1],[2],[3])) PVT
OPTION(MAXDOP 1);
END;
GO
-- Show actual execution plans with row counts, in text form
SET STATISTICS PROFILE ON;
EXECUTE #CaseVersion;
EXECUTE#PreAggVersion;
EXECUTE #PivotVersion;
GO
SET STATISTICS PROFILE OFF;
GO
-- Clear cached plans
DBCC FREESYSTEMCACHE('Object Plans');
WAITFOR DELAY '00:00:05';
GO
-- Run each procedure five times
EXECUTE #CaseVersion;
GO 5
EXECUTE#PreAggVersion;
GO 5
EXECUTE #PivotVersion;
GO 5
-- Results summary
SELECT[text],
execution_count,
min_worker_time / 1000 AS min_worker_time_ms,
max_worker_time / 1000 AS max_worker_time_ms,
total_worker_time / execution_count / 1000 AS avg_worker_time_ms,
total_elapsed_time / execution_count / 1000 AS avg_elapsed_time_ms,
plan_generation_num,
query_plan
FROMsys.dm_exec_query_stats
CROSS
APPLYsys.dm_exec_sql_text ([sql_handle])
CROSS
APPLYsys.dm_exec_query_plan ([plan_handle])
WHERE[text] LIKE '--%#%Version%'
AND[text] NOT LIKE '-- Results summary%'
ORDERBY
creation_time ASC;
GO
June 27, 2009 at 10:04 am
Hi Paul,
thanx for the test scenario!
When I ran your scenario (1 mill rows, 333333 client id's) on my laptop I got approx. 1.1sec for all three versions (CASE, aggregate and PIVOT).
I extended it to 9 mill rows, 333333 client id's, ending up with 7.4sec for CASE and aggregate and 7.3 for PIVOT (I'd still call this a tie between the three...)
After that I reviewed the table provided by Chris. (http://www.sqlservercentral.com/Forums/Attachment3400.aspx)
I stumbled over the number of Accounts (9mill) vs. the total number of rows (12mill).
Based on those numbers I think the pre-aggregation Jeff's example is using most probably won't reduce the number of rows to be processed significantly. It's still a great way to do in a scenario where you'd have several rows with identical Acct_Debtor and Occurance. Unfortunately, it doesn't look like the data distribution in this scenario would be like this.
I then repeated your scenario with 9 mill rows , 3 mill client id's and it took 12sec CASE and aggregate and 11.8sec for PIVOT, which is still 10 times faster compared to what Chris is getting.
I'm not sure how the data are distributed over Acct_Debtor and Occurance, but I assume we have one row per client_id with sequence_id = 1.
If that's the case, it may help to eliminate those 9mill rows from the pivot and just work with the remaining rows (code can be included in your test scenario):
CREATE PROCEDURE #seq1Version AS
BEGIN
SET NOCOUNT ON;
IFOBJECT_ID(N'#DumpTable', N'U') IS NOT NULL DROP TABLE #DumpTable;
;WITH seq1
AS
(
SELECTclient_id AS client_id,
telephone AS LandLineNumber1
FROMdbo.ClientContact
WHERE sequence_id = 1
)
,
seqNot1
AS
(
SELECTclient_id AS client_id,
MAX(CASE WHEN sequence_id = 2 THEN telephone ELSE NULL END) AS LandLineNumber2,
MAX(CASE WHEN sequence_id = 3 THEN telephone ELSE NULL END) AS LandLineNumber3
FROM dbo.ClientContact
WHERE sequence_id > 1
GROUP BY client_id
)
SELECTseq1.client_id AS client_id,
seq1.LandLineNumber1 AS LandLineNumber1,
seqNot1.LandLineNumber2 AS LandLineNumber2,
seqNot1.LandLineNumber3 AS LandLineNumber3
FROMseq1
LEFT OUTER JOIN seqNot1 ON seq1.client_id = seqNot1.client_id
When I ran this with 1mill rows and a data distribution of 900k client_id with seq_id=1 and 100k client_id with seq_id=2 I got 1.7sec for the three versions you evaluated and 0.9sec for the version above.
-- generate 1mill rows with 900k row with sequence_id=1 and 100k rows with sequence_id=2
;WITHNumbers (client_id)
AS(
SELECTTOP (900000)
'A' + RIGHT(1000000000 + ROW_NUMBER() OVER (ORDER BY C2.[object_id]), 9)
FROMmaster.sys.columns C1, master.sys.columns C2
)
INSERTdbo.ClientContact
(client_id, sequence_id, telephone)
SELECTclient_id,
1,
Telephone.tel
FROMNumbers
CROSS
APPLY(SELECT RIGHT(10000000000 + CONVERT(INT, RAND(CHECKSUM(NEWID())) * 123456789), 10)) Telephone (tel);
;WITHNumbers (client_id)
AS(
SELECTTOP (100000)
'A' + RIGHT(1000000000 + ROW_NUMBER() OVER (ORDER BY C3.[object_id]), 9)
FROMmaster.sys.columns C1, master.sys.columns C2, master.sys.columns C3
)
INSERTdbo.ClientContact
(client_id, sequence_id, telephone)
SELECTclient_id,
2,
Telephone.tel
FROMNumbers
CROSS
APPLY(SELECT RIGHT(10000000000 + CONVERT(INT, RAND(CHECKSUM(NEWID())) * 123456789), 10)) Telephone (tel);
So, I think if it's possible to move a large number of rows from the pivot to a separate subquery, the performance should improve.
Edit: Note: the solution above will only be faster under special data distribution (in this scenario: more than 65% of the rows can be moved to the subquery). It should not be considered as a general solution.
June 27, 2009 at 8:40 pm
Hey Lutz,
My results were the same as yours; and I agree with your assessment.
Clearly, in this case, and with this data distribution, there is nothing to choose between the three methods.
PIVOT is a bit of an odd duck: I don't find the syntax particularly intuitive, it struggles with multiple aggregations, but it can produce more compact code, and produces the same basic plan as the CASE code (for basic tasks) leading to comparable performance in many instances.
When I get a free moment I might take another look at some of the more complex examples there are floating about on this site, to see if the 'pre-aggregation' can be incorporated in the PIVOT method too. I use the quotes there since, in the plans I have seen so far, the aggregation still follows the evaluation of the CASE statement for each row (in the Compute Scalar) despite the way the query is written.
Paul
June 27, 2009 at 8:57 pm
Paul White (6/27/2009)
Clearly, in this case, and with this data distribution, there is nothing to choose between the three methods.
I'll probably stick to using the Cross-Tab code (one method or the other depending on the data but will usually be the pre-agg method) for a couple of reasons...
1. Since it's a bit faster in the other cases, I don't want to get into the habit of writing PIVOTs.
2. Since it's a bit faster in the other cases, I don't want other people to read my code and use a PIVOT I may have written for one of those other cases.
3. Admittedly a personal preference, but I prefer the readability of Cross-Tabs over that of PIVOTs even for the simple stuff.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 28, 2009 at 1:21 am
Jeff Moden (6/27/2009)
I'll probably stick to using the Cross-Tab code
Me too. I'm really quite anti-PIVOT myself (as I have said at least once earlier on this thread) but I do like to give it a fair go whenever I come across it...and also, I didn't want to not post my test rig results just because they didn't prove what I wanted 🙂
I'm mostly against PIVOT because it is such a missed opportunity (in terms of functionality and performance), has a clumsy syntax (despite Books Online's protestations to the contrary!) and the SUM...CASE construction which we have all used for many years does the job better anyway.
As an example of my willingness to give PIVOT a 'fair go' take another look at my post in the discussion of your first article on this subject. That shows how to do a nested PIVOT (rather than a join) to achieve the same performance as the 'pre-aggregated' cross-tab. I don't make a big thing about that since the syntax is just so horrible.
Also, I have just noticed that 2008 has broken the code I posted there. As written it produces:
[font="Courier New"]Msg 8624, Level 16, State 21, Line x
Internal Query Processor Error: The query processor could not produce a query plan. For more information, contact Customer Support Services.
[/font]
...which is just fantastic :hehe:
I think this is a bug - since the PIVOT consumes not just the column you would expect, but also any columns aliased to it. If you have a copy of 2008 lying around, the following illustrates this 'bug':
DECLARE @T TABLE (A INT NOT NULL, B INT NOT NULL)
INSERT@T (A, B)
SELECT1, 1 UNION ALL
SELECT2, 2 UNION ALL
SELECT3, 3;
WITHBase AS
(SELECTA, A AS A2, B FROM @T)
SELECTA2, [1], [2], [3]
FROMBase
PIVOT(SUM(B) FOR A IN ([1],[2],[3])) AS pvt
Substituting CONVERT(INT, A) AS A2 for 'A AS A2' makes it work again :sigh:
In fact, anything dumb no-op like 1 * A AS A2 also works.
Optimizers eh?
To 'unbreak' my nested-PIVOT code for 2008, change '[Quarter] AS Q2' to '1 *[Quarter] AS Q2', or similar.
Paul
June 28, 2009 at 2:25 am
Paul White (6/27/2009)
Clearly, in this case, and with this data distribution, there is nothing to choose between the three methods. ...Paul
But how about the method I posted above (let's call it "Sub-Cross-Tab")? With the data distribution as shown the three options you compared take twice as long.
This thread and Jeff's last post lead me to the the conclusion that I'll stop playing around with PIVOT. It just has been proofed that a database cannot be turned into Excel (where pivot works fine) 🙂
Edit1: One more reason to stop using it: PIVOT cannot be used for dynamic number of columns. Dynamic Cross Tab can do it.
Edit2: Have to partially retract my first edit note since it's simply not true. I'm sorry. :blush:
June 28, 2009 at 4:12 am
lmu92 (6/28/2009)
Paul White (6/27/2009)
Clearly, in this case, and with this data distribution, there is nothing to choose between the three methods. ...Paul
But how about the method I posted above (let's call it "Sub-Cross-Tab")?
That's why I stressed 'in this case, and with this data distribution' 😛
lmu92 (6/28/2009)
With the data distribution as shown the three options you compared take twice as long.
Well yes - but then we aren't comparing apples with apples anymore. Your method is specifically optimized for the data distribution you set up - and it is faster simply because it doesn't do any aggregation at all for the 900K rows with sequence_id = 1. The number of rows which do go through the stream aggregate (100K instead of 1M) drops the cost enough so that introducing the extra table scan and merge join still results in a cheaper overall plan.
The PIVOT can be optimized the same way, and runs 50ms faster on my laptop:
CREATE PROCEDURE #PivotVersion2 AS
BEGIN
SET NOCOUNT ON;
IFOBJECT_ID(N'#DumpTable', N'U') IS NOT NULL DROP TABLE #DumpTable;
WITHseq1
AS(
SELECT client_id AS client_id,
telephone AS LandLineNumber1
FROM dbo.ClientContact
WHEREsequence_id = 1
)
,seqNot1
AS(
SELECTPVT.client_id,
PVT.[1] AS [LandLineNumber1],
PVT.[2] AS [LandLineNumber2],
PVT.[3] AS [LandLineNumber3]
FROM(SELECT client_id, sequence_id, telephone FROM dbo.ClientContact WHERE sequence_id > 1) CC
PIVOT(MAX(telephone) FOR sequence_id IN ([1],[2],[3])) PVT
)
SELECT seq1.client_id AS client_id,
seq1.LandLineNumber1 AS LandLineNumber1,
seqNot1.LandLineNumber2 AS LandLineNumber2,
seqNot1.LandLineNumber3 AS LandLineNumber3
INTO#DumpTable
FROMseq1
LEFT
OUTER
JOINseqNot1
ONseq1.client_id = seqNot1.client_id
ORDERBY
client_id
OPTION(MAXDOP 1)
END;
lmu92 (6/28/2009)
This thread and Jeff's last post lead me to the the conclusion that I'll stop playing around with PIVOT. It just has been proofed that a database cannot be turned into Excel (where pivot works fine) 🙂Edit: One more reason to stop using it: PIVOT cannot be used for dynamic number of columns. Dynamic Cross Tab can do it.
Jeff, could you let me have the number of your handrail-installation team please? 😀
PIVOT can be used with dynamic SQL, just the same as the CASE...SUM idea.
Paul
June 28, 2009 at 4:32 am
Paul White (6/28/2009)
... Well yes - but then we aren't comparing apples with apples anymore. Your method is specifically optimized for the data distribution you set up - and it is faster simply because it doesn't do any aggregation at all for the 900K rows with sequence_id = 1. The number of rows which do go through the stream aggregate (100K instead of 1M) drops the cost enough so that introducing the extra table scan and merge join still results in a cheaper overall plan.
You're definitely right: the solution I came up with is faster only under a special kind of data distribution. The break even is around 65% of the rows (if you can eliminate more than 65% of your rows you'll start being faster than with the standard pivot methods). My interpretation of the data Chris provided in one of his posts brought me to the idea to eliminate the largest amount of rows before even starting the pivot.
So it is a solution modified for the special request of the OP in this thread, not a general solution. Sorry, if I've left that impression.
My goal was to give Chris another option to play with to get his special case resolved. 😉
I'll edit my original post to clarify.
June 28, 2009 at 4:44 am
Paul White (6/28/2009)
lmu92 (6/28/2009)
Paul White (6/27/2009)
Clearly, in this case, and with this data distribution, there is nothing to choose between the three methods. ...Paul
But how about the method I posted above (let's call it "Sub-Cross-Tab")?
That's why I stressed 'in this case, and with this data distribution' 😛
I might have been a little confused about what you where referring to as "this case , and with this data distribution".
I thought you were talking about the scenario Chris described. But after reviewing your posts it becomes clear that you were talking about the test scenario you set up. Sorry for the misinterpretation and the confusion I caused. :Whistling:
June 28, 2009 at 4:44 am
lmu92 (6/28/2009)
You're definitely right: the solution I came up with is faster only under a special kind of data distribution. The break even is around 65% of the rows (if you can eliminate more than 65% of your rows you'll start being faster than with the standard pivot methods). My interpretation of the data Chris provided in one of his posts brought me to the idea to eliminate the largest amount of rows before even starting the pivot.So it is a solution modified for the special request of the OP in this thread, not a general solution. Sorry, if I've left that impression.
Hey Lutz,
I make allowances for the NEAFLs :w00t: 😀 :w00t:
It's all good. I just had to respond with the equivalent PIVOT code - for balance 😉
Paul
June 28, 2009 at 5:13 am
What makes me wondering though is what Chris (the OP who started this thread in the first place) ended up with to tune his query... The subject he was talking about got a little out of focus, I think.
I hope we didn't scare him away...
@chris-2: Are you still with us??
Edit: Btw: Congrats, Paul for being "TenCenturies" now!! :Wow: Well deserved!
Viewing 15 posts - 16 through 30 (of 33 total)
You must be logged in to reply to this topic. Login to reply