October 5, 2015 at 12:47 pm
First, thanks to all who have taken an interest in this.
I've played around with this code. One observation - any form of UNIQUE constraint on the temp table results in the poor execution plan. Leaving the temp table as a heap gives the quickest performance. I've boiled the case down to essentials; here is the SQL code in two parts. The first part generates the test data; the second part runs the query. You'll need a tally table to help generate the test data.
-- Note: You will need a tally table to create the test data
-- Early data is monthly; later data is daily
-- Define a mapping table [DateMap] that maps early monthly data to daily dates
-- and later daily data to daily dates so data can be pulled for daily
-- dates over the entire date range
IF OBJECT_ID('dbo.DateMap') IS NOT NULL drop table dbo.DateMap
CREATE TABLE [dbo].[DateMap]
(
[adate] [smalldatetime] NOT NULL,
[calcdate] [smalldatetime] NULL)
INSERT INTO dbo.DateMap(adate)
SELECT CAST(N AS SMALLDATETIME)
FROM dbo.tally2 -- First use of Tally Table
WHERE N BETWEEN 29249 AND 42275
UPDATE dbo.DateMap
SET calcdate = adate
WHERE adate >= '31-aug-2008'
-- Define two CTEs
;WITH datecalc AS(
SELECT *, y= Year(adate), m= Month(adate), rn = ROW_NUMBER()OVER(ORDER BY adate)
FROM dbo.DateMap)
, monthends AS(
SELECT DC1.*
FROM datecalc DC1
JOIN datecalc DC2
ON DC1.rn = DC2.rn-1
WHERE DC1.m <> DC2.m)
UPDATE DM
SET calcDate = X.adate
FROM dbo.dateMap DM
OUTER APPLY( SELECT TOP 1 adate
FROM monthends M
WHERE DM.adate >= M.adate
ORDER BY M.adate DESC
) X
WHERE DM.calcDate IS NULL
ALTER TABLE dbo.DateMap ADD CONSTRAINT PK_DateMap PRIMARY KEY CLUSTERED(Adate ASC)
CREATE NONCLUSTERED INDEX IX_DateMap_calcdate_adate ON dbo.DateMap(CalcDate ASC, Adate ASC)
-- Create the test data
IF OBJECT_ID('dbo.TestCalcs') IS NOT NULL drop table dbo.TestCalcs
CREATE TABLE dbo.TestCalcs(adate SMALLDATETIME NOT NULL,
hid INT NOT NULL,
value FLOAT NULL,
PRIMARY KEY (adate ASC, hid ASC))
INSERT INTO dbo.testCalcs
SELECT X.adate,Y.hid, RAND(Y.hid)
FROM (select distinct calcdate as adate from dbo.dateMap) X
CROSS JOIN (select top 4000 N as hid from dbo.tally2 order by N) Y -- second use of tally table
CREATE NONCLUSTERED INDEX IX_testCalcs ON dbo.testCalcs(hid ASC, adate ASC)
-- Create the view that joins the test data to the mapping table
IF OBJECT_ID('dbo.TestCalcsView') IS NOT NULL drop view dbo.TestCalcsView
GO
CREATE VIEW dbo.testCalcsView AS
SELECT
DM.adate,
CL.hid,
CL.value
FROM
rc_temp.dbo.DateMap DM
JOIN
rc_temp.dbo.testCalcs CL
ON DM.CalcDate = CL.Adate
GO
Once the test data is generated, here is the test run code.
-- Test Run
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
-- Temp tables
IF OBJECT_ID('tempdb..#hidvals') IS NOT NULL DROP TABLE #hidvals
CREATE TABLE #hidvals(hid INT NOT NULL,
-- hid INT NOT NULL UNIQUE,
extel2code CHAR(2),
start SMALLDATETIME,
finish SMALLDATETIME,
value1 FLOAT,
value2 FLOAT)
-- CONSTRAINT hidvals_unique_hid UNIQUE (hid))
-- PRIMARY KEY CLUSTERED(hid))
--CREATE UNIQUE CLUSTERED INDEX IDX_hidvals ON #hidvals(hid)
INSERT INTO #hidvals(hid, start, finish)
SELECT hid ,'17-sep-2015', '18-sep-2015'
FROM testCalcs
WHERE adate = '18-sep-2015'
-- UPDATE STATISTICS #hidvals -- this makes no differen
-- Here's the statement with the slow/fast plan (depending)
UPDATE H
SET value1 = X1.[value], value2 = X2.[value]
FROM #hidvals H
LEFT JOIN rc_temp.dbo.testCalcsView X1
ON H.hid = X1.hid AND H.start = X1.adate
LEFT JOIN rc_temp.dbo.testCalcsView X2
ON H.hid = X2.hid AND H.finish = X2.adate
October 8, 2015 at 7:19 pm
Kristen-173977 (9/29/2015)
rchantler (9/28/2015)
CREATE CLUSTERED INDEX IDX_hidvals ON #hidvals(hid)That is not declared as UNIQUE, so not quite the same thing as a PRIMARY KEY's Clustered Index (SQL will add a tie-break value which will make the key size wider).
Also, Primary Key will not allow NULLs which might effect how the optimiser treats the index - although you have declared [hid] as NOT NULL so maybe that is as broad as it is long ...
Maybe I am missing something, but I think Kristen hit the real issue here. A Clustered Index does NOT imply uniqueness, while a Primary Key does. So when you are trying to update the table with a Primary Key definition, SQL Server must test each new record for non-existence in the table.
Simply defining an index as Clustered just causes efficient organization of the key/data. SQL Server doesn't need to test each new value to see if it is already in the table.
Unless I missed a test in there somewhere, did you try defining that index as a UNIQUE CLUSTERED index? I would expect that to have the same performance as the Primary Key.
October 9, 2015 at 2:06 am
Kim Crosser (10/8/2015)
Kristen-173977 (9/29/2015)
rchantler (9/28/2015)
CREATE CLUSTERED INDEX IDX_hidvals ON #hidvals(hid)That is not declared as UNIQUE, so not quite the same thing as a PRIMARY KEY's Clustered Index (SQL will add a tie-break value which will make the key size wider).
Also, Primary Key will not allow NULLs which might effect how the optimiser treats the index - although you have declared [hid] as NOT NULL so maybe that is as broad as it is long ...
Maybe I am missing something, but I think Kristen hit the real issue here. A Clustered Index does NOT imply uniqueness, while a Primary Key does. So when you are trying to update the table with a Primary Key definition, SQL Server must test each new record for non-existence in the table.
Simply defining an index as Clustered just causes efficient organization of the key/data. SQL Server doesn't need to test each new value to see if it is already in the table.
Unless I missed a test in there somewhere, did you try defining that index as a UNIQUE CLUSTERED index? I would expect that to have the same performance as the Primary Key.
The script posted by rchantler (thanks, very handy) demonstrates the effect nicely. I've made a number of changes to it to eliminate some chaff and get a better idea of the problem domain. Both queries now use nonclustered indexes except of course for the index on the temp table, and the problem remains. The index on the temp table is now just a clustered index. If you comment out either one of the two left joins, or change either of them to inner joins, the problem goes. If you add [start] or [finish] (or both) to the key list of the clustered index on the temp table as the second key, the problem changes but it's not fixed.
Note that #hidvals.start has the same value for every row, as does #hidvals.finish.
Creating statistics on [start] / [finish] doesn't change anything, the problem remains. Here's the modified script:
-- Note: You will need a tally table to create the test data
-- Early data is monthly; later data is daily
-- Define a mapping table [DateMap] that maps early monthly data to daily dates
-- and later daily data to daily dates so data can be pulled for daily
-- dates over the entire date range
-- create tally table
IF OBJECT_ID('TempDB..#tally2') IS NOT NULL drop table #tally2
;WITH Ten AS (SELECT * FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))
SELECT n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) INTO #tally2 FROM Ten a, Ten b, Ten c, Ten d, Ten e
CREATE UNIQUE CLUSTERED INDEX cx_tally ON #tally2 (n)
GO
IF OBJECT_ID('dbo.DateMap') IS NOT NULL drop table dbo.DateMap
CREATE TABLE [dbo].[DateMap]
(
[adate] [smalldatetime] NOT NULL,
[calcdate] [smalldatetime] NULL,
Guid1 UNIQUEIDENTIFIER,
Guid2 UNIQUEIDENTIFIER,
Guid3 UNIQUEIDENTIFIER
)
INSERT INTO dbo.DateMap(adate, Guid1, Guid2, Guid3 )
SELECT CAST(N AS SMALLDATETIME), NEWID(), NEWID(), NEWID() -- make table wider to encourage use of index CJM
FROM #tally2 -- First use of Tally Table
WHERE N BETWEEN 29249 AND 42275
UPDATE dbo.DateMap
SET calcdate = adate
WHERE adate >= '31-aug-2008'
-- Define two CTEs
;WITH datecalc AS(
SELECT *, y= Year(adate), m= Month(adate), rn = ROW_NUMBER()OVER(ORDER BY adate)
FROM dbo.DateMap)
, monthends AS(
SELECT DC1.*
FROM datecalc DC1
JOIN datecalc DC2
ON DC1.rn = DC2.rn-1
WHERE DC1.m <> DC2.m)
UPDATE DM
SET calcDate = X.adate
FROM dbo.dateMap DM
OUTER APPLY( SELECT TOP 1 adate
FROM monthends M
WHERE DM.adate >= M.adate
ORDER BY M.adate DESC
) X
WHERE DM.calcDate IS NULL
--ALTER TABLE dbo.DateMap ADD CONSTRAINT PK_DateMap PRIMARY KEY CLUSTERED(Adate ASC)
CREATE UNIQUE CLUSTERED INDEX UCX_DateMap ON dbo.DateMap (Guid1)
CREATE UNIQUE NONCLUSTERED INDEX IX_DateMap_calcdate_adate ON dbo.DateMap (CalcDate ASC, Adate ASC)
CREATE UNIQUE NONCLUSTERED INDEX IX_DateMap_adate ON dbo.DateMap (Adate ASC) INCLUDE (CalcDate)
GO
-- Create the test data
IF OBJECT_ID('dbo.TestCalcs') IS NOT NULL drop table dbo.TestCalcs
CREATE TABLE dbo.TestCalcs(adate SMALLDATETIME NOT NULL,
hid INT NOT NULL,
value FLOAT NULL)
INSERT INTO dbo.testCalcs
SELECT X.adate,Y.hid, RAND(Y.hid)
FROM (select distinct calcdate as adate from dbo.dateMap) X
CROSS JOIN (select top 4000 N as hid from #tally2 order by N) Y -- second use of tally table
CREATE UNIQUE CLUSTERED INDEX UCX_testCalcs_adate_hid ON dbo.testCalcs (adate, hid)
CREATE UNIQUE NONCLUSTERED INDEX IX_testCalcs_hid_adate ON dbo.testCalcs(hid ASC, adate ASC) INCLUDE (value)
CREATE UNIQUE NONCLUSTERED INDEX IX_testCalcs_adate_value ON dbo.testCalcs(adate ASC, value) INCLUDE (hid)
-- Create the view that joins the test data to the mapping table
IF OBJECT_ID('dbo.TestCalcsView') IS NOT NULL drop view dbo.TestCalcsView
GO
CREATE VIEW dbo.testCalcsView AS
SELECT
DM.adate,
CL.hid,
CL.value
FROM dbo.DateMap DM
JOIN dbo.testCalcs CL
ON DM.CalcDate = CL.Adate
GO
SELECT COUNT(*) FROM dbo.testCalcsView -- 52,108,000 rows
--=============================================================================================================================================
-- Temp table
IF OBJECT_ID('tempdb..#hidvals') IS NOT NULL DROP TABLE #hidvals
CREATE TABLE #hidvals(hid INT NOT NULL,
extel2code CHAR(2),
start SMALLDATETIME,
finish SMALLDATETIME,
value1 FLOAT,
value2 FLOAT)
INSERT INTO #hidvals(hid, start, finish)
SELECT hid ,'17-sep-2015', '18-sep-2015'
FROM testCalcs
WHERE adate = '18-sep-2015'
CREATE UNIQUE CLUSTERED INDEX ucx_hidvals ON #hidvals(hid)
CREATE STATISTICS stx_finish ON #hidvals (finish)
CREATE STATISTICS stx_start ON #hidvals (start)
CREATE STATISTICS stx_startfinish ON #hidvals (start,finish)
UPDATE STATISTICS testCalcs
UPDATE STATISTICS DateMap
--=======================================================================================================
DROP INDEX ucx_hidvals ON #hidvals
SELECT -- Q1, no index
value1 = X1.[value],
value2 = X2.[value]
FROM #hidvals H
LEFT JOIN dbo.testCalcsView X1
ON H.hid = X1.hid AND H.start = X1.adate
left JOIN dbo.testCalcsView X2
ON H.hid = X2.hid AND H.finish = X2.adate
--CREATE UNIQUE CLUSTERED INDEX ucx_hidvals ON #hidvals(hid)
CREATE UNIQUE CLUSTERED INDEX ucx_hidvals ON #hidvals(hid,finish)
SELECT -- Q2, unique clustered index
value1 = X1.[value],
value2 = X2.[value]
FROM #hidvals H
LEFT JOIN dbo.testCalcsView X1
ON H.hid = X1.hid AND H.start = X1.adate
left JOIN dbo.testCalcsView X2
ON H.hid = X2.hid AND H.finish = X2.adate
--=======================================================================================================
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 9, 2015 at 7:34 am
Yes, there are at least 2 questions - 1. why the poor performance with adding a UNIQUE constraint and 2. why does switching from left join to inner join result in good performance even with the UNIQUE constraint.
Seems to me 'the answer' would explain both questions.
October 9, 2015 at 7:57 am
rchantler (10/9/2015)
Yes, there are at least 2 questions - 1. why the poor performance with adding a UNIQUE constraint and 2. why does switching from left join to inner join result in good performance even with the UNIQUE constraint.Seems to me 'the answer' would explain both questions.
It doesn't have to be a constraint. It doesn't have to be a clustered index either - an ordinary index (with no clustered index on the table) like this reproduces the effect too:
CREATE UNIQUE INDEX ucx_hidvals ON #hidvals(hid) INCLUDE (start, finish)
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 9, 2015 at 8:02 am
You can even define the field to be UNIQUE in the table definition to get the effect.
October 9, 2015 at 8:03 am
Anything that results in uniqueness, it seems.
October 9, 2015 at 8:14 am
rchantler (10/9/2015)
You can even define the field to be UNIQUE in the table definition to get the effect.
That's enforced by a UNIQUE index, so it's the same as we have now. A unique index, clustered or not.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 9, 2015 at 8:31 am
ChrisM@Work (10/9/2015)
rchantler (10/9/2015)
You can even define the field to be UNIQUE in the table definition to get the effect.That's enforced by a UNIQUE index, so it's the same as we have now. A unique index, clustered or not.
Also, you don't need an index on the temp table (or a temp table, it can be permanent) to switch the effect on and off - you can use statistics objects on a heap.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 9, 2015 at 11:28 am
It looks to me like the following is occurring.
Where there is no Unique index (or constraint) on the HID column (and/or the statistics don't show that the HID column is a nice flat histogram), the query optimizer decides to do a table scan on the table, which appears to be pretty efficient. (i.e., read a row, compare all three fields and update in one pass.) The table shown is pretty compact, and without knowing the probable percentage of rows to be updated in advance, a table scan may be the best solution.
However, when the Uniqueness constraint is defined, or the statistics show the flat distribution of HID, the optimizer decides that the index is worth using. Unfortunately, this causes lots of index seeks, followed by row fetches of a lot of rows in order to compare the datetime columns (it may even wind up doing two data row fetches per actual data row as a result of the two separate left joins on the different datetime columns.
I would bet that either of the following would perform very well:
create unique index IDX_Hidvals_hid_start on #hidvals(hid,start);
create unique index IDX_Hidvals_hid_finish on #hidvals(hid,finish);
or
create unique index IDX_Hidvals_hid_start on #hidvals(hid) include (start);
create unique index IDX_Hidvals_hid_finish on #hidvals(hid) include (finish);
Either of these would eliminate the need to do the data row fetches to validate the "start" or "finish" criteria, and should be at least as fast as the non-unique clustered index (probably a lot faster - assuming that the update isn't updating a large percentage of the table rows).
Clustering one of the two indexes might or might not make a difference - what is ordered well by HID+Start may not be ordered well for a HID+Finish query, although if the dates "tend" to be similarly ordered, then I would cluster the HID+Start index.
October 9, 2015 at 1:20 pm
Thanks Kim,
With either of both indexes in your first set of 2 I see quick execution i.e. the good plan.
With either of both indexes in your second set of 2 I get slow execution i.e. the poor plan.
Based on the estimated subtree cost the heap is still the fastest.
October 9, 2015 at 1:23 pm
That last reply should say 'with either or both" not "of both"
October 9, 2015 at 1:31 pm
I should have been clearer - I would recommend creating BOTH indexes in either of the two sets, not just one or the other. Because the query has two outer joins, one on HID+Start and one on HID+Finish, I would create the pair of indexes (in either style - although the first ones with the composite keys are probably always going to be better in this case).
(edited)
BTW - the (possibly?) optimum configuration is probably:
create unique clustered index IDX_Hidvals_hid_start on #hidvals(hid,start);
create unique index IDX_Hidvals_hid_finish on #hidvals(hid,finish);
This will organize the data by HID+Start as well as creating efficient indexes.
October 23, 2015 at 2:47 pm
I think I have at least a partial explanation re the slow/fast query.
The query produces two different execution plans. If table #hidvals is a heap, the optimizer generates plan A. If table #hidvals has any form of UNIQUE constraint on the hid field the optimizer generates plan B.
Using the original test data plan A runs far quicker than plan B. The question is, given the optimizer has enough scope to choose either plan A or B, why does it choose the much slower plan B when #hidvals is not a heap?
The explanation is the original data in table #hidvals is not random. I.e. #hidvals has 2 date fields, start and finish. In the original formulation finish has the same single value for each row in the table and start has the same single value (the date before finish) for each row.
The optimizer does not seem to use the fact that each date field contains only one value.
Once the optimizer creates the ‘best heap plan’, when it runs against the special case dates it finishes in an instant. However, the heap plan does not perform so well against randomized dates.
The ‘best non-heap plan’ has comparable performance whether the dates are special case or random.
In fact the ‘best non-heap plan’ outperforms the ‘best heap plan’ (by some measures) running against randomized dates.
October 26, 2015 at 2:37 am
rchantler (10/23/2015)
I think I have at least a partial explanation re the slow/fast query.The query produces two different execution plans. If table #hidvals is a heap, the optimizer generates plan A. If table #hidvals has any form of UNIQUE constraint on the hid field the optimizer generates plan B.
Using the original test data plan A runs far quicker than plan B. The question is, given the optimizer has enough scope to choose either plan A or B, why does it choose the much slower plan B when #hidvals is not a heap?
The explanation is the original data in table #hidvals is not random. I.e. #hidvals has 2 date fields, start and finish. In the original formulation finish has the same single value for each row in the table and start has the same single value (the date before finish) for each row.
The optimizer does not seem to use the fact that each date field contains only one value.
Once the optimizer creates the ‘best heap plan’, when it runs against the special case dates it finishes in an instant. However, the heap plan does not perform so well against randomized dates.
The ‘best non-heap plan’ has comparable performance whether the dates are special case or random.
In fact the ‘best non-heap plan’ outperforms the ‘best heap plan’ (by some measures) running against randomized dates.
Eirikur and I have spent more than a little time investigating this and we haven't finished yet. It's almost certainly a cardinality estimator bug since a) the estimates appear to be off by three orders of magnitude when using the unique index vs the non-unique index (or equivalent statistics objects) and b) the same effect isn't observed in SQL Server 2014, which has a new cardinality estimator.
We have a simplified pair of plans (queries) which show the effect more clearly than the original but need a little more time to reduce noise from the sample data set.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 15 posts - 16 through 30 (of 30 total)
You must be logged in to reply to this topic. Login to reply