April 6, 2010 at 2:30 am
Paul White NZ (4/6/2010)
yingchai (4/6/2010)
I'm not so concern about the I/O performance as I predict the Cost fact table will grow to approximately 1 mil records in 10 years. Currently, the database server is running on 16 GB RAM. Should be no problem right?Given correct indexing, it will be fine.
Hi Paul,
Can you elaborate more on that? Or can you link me to some articles on this matter.
Currently, all my Fact tables do not have Primary keys and not too sure whether it is needed for indexing...
April 6, 2010 at 4:37 am
yingchai (4/6/2010)
Currently, all my Fact tables do not have Primary Keys...
Your fact table should always have a Primary Key - whether it is a surrogate key, or based on your dimension foreign keys. Anyway, that's a different discussion.
As far as indexing on your UPDATE query is concerned:
USE tempdb;
GO
IF OBJECT_ID(N'dbo.Sample1')
IS NOT NULL
DROP TABLE dbo.Sample1;
GO
CREATE TABLE dbo.Sample1
(
product_id CHAR(7) NOT NULL,
entity_id CHAR(3) NOT NULL,
period INTEGER NOT NULL,
cost SMALLMONEY NULL
);
GO
INSERT Sample1
(product_id, entity_id, period, cost)
SELECT 'CMP1000', 'PM1', 20090101, 100 UNION ALL
SELECT 'CMP1000', 'PM1', 20090201, 80 UNION ALL
SELECT 'CMP1000', 'PM1', 20090301, NULL UNION ALL
SELECT 'CMP1000', 'PM1', 20090401, 110 UNION ALL
SELECT 'CMP1000', 'PM1', 20090501, NULL UNION ALL
SELECT 'CMP1000', 'PM1', 20090601, NULL UNION ALL
SELECT 'CMP1000', 'PM1', 20090701, NULL;
-- Index required for good performance
CREATE UNIQUE NONCLUSTERED INDEX nc1
ON dbo.Sample1
(product_id, entity_id, period DESC)
INCLUDE (cost);
GO
UPDATE S
SET cost = CA.cost
FROM Sample1 S
CROSS
APPLY (
SELECT TOP (1)
S2.cost
FROM Sample1 S2
WHERE S2.product_id = S.product_id
AND S2.entity_id = S.entity_id
AND S2.cost IS NOT NULL
AND S2.period <= S.period
ORDER BY
S2.period DESC
) CA
WHERE S.cost IS NULL;
GO
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 6, 2010 at 6:53 am
yingchai (4/6/2010)
I'm not so concern about the I/O performance as I predict the Cost fact table will grow to approximately 1 mil records in 10 years. Currently, the database server is running on 16 GB RAM. Should be no problem right?
Heh... if you say so. Me? I have a real problem with only 10,000 rows causing a quarter million reads (even in the presence of proper indexing) and taking 36 seconds without the proper indexing and 4 seconds even with the proper indexing to execute on a 4 processor 8 GB system. It's called "technical debt" and everytime I've seen someone build in such technical dept, they've found some way to regret it in the very near future.
4 seconds per 10,000 rows in the worst case scenario of having to smear a single row of data over 10,000 other rows and the problem grows exponentially ((N2+N)/2). Maybe you'll get lucky and the worst case won't actually happen (and there's a good chance in this scenario that it won't happen, but it's still a "chance"). Maybe you'll get lucky and someone won't copy the code and use its essence on something where the worst case will happen.
There's no doubt that the code is clever and that it'll work reasonably well (albeit with some unnecessary extra reads) when a worst case doesn't happen... I just have a hard time with taking such a gamble.
Heh... ok. I'll get off the soapbox now. It's up to you.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 6, 2010 at 7:11 am
Please do read Jeff's article on Triangular Joins so you understand what an important concept it is.
It will also help you to understand why the code I posted does not suffer from the problem, given the index provided.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 6, 2010 at 6:14 pm
Paul White NZ (4/6/2010)
Please do read Jeff's article on Triangular Joins so you understand what an important concept it is.It will also help you to understand why the code I posted does not suffer from the problem, given the index provided.
Maybe this will help you to understand why the code you posted does suffer from the same problem even given the index provided... 😉
First, a bit more test data... a paltry 10K rows just like I said previously...
USE tempdb;
GO
IF OBJECT_ID(N'dbo.Sample1')
IS NOT NULL
DROP TABLE dbo.Sample1;
GO
CREATE TABLE dbo.Sample1
(
product_id CHAR(7) NOT NULL,
entity_id CHAR(3) NOT NULL,
period DATETIME NOT NULL,
cost SMALLMONEY NULL
);
GO
--===== Populate the test table with 10k rows of similar data
WITH
cteTally AS
(
SELECT ROW_NUMBER() OVER (ORDER BY ac1.Object_ID) AS N
FROM Master.sys.All_Columns ac1
CROSS JOIN Master.sys.All_Columns ac2
)
INSERT INTO dbo.Sample1
(product_id, entity_id, period, cost)
SELECT 'CMP1000', 'PM1', DATEADD(dd,t.N-1,'20100101'), NULL
FROM cteTally t
WHERE t.N BETWEEN 1 AND 10000;
--===== Seed the first row with data
UPDATE dbo.Sample1
SET Cost = 123 WHERE Period = '20100101';
-- Index required for good performance
CREATE UNIQUE NONCLUSTERED INDEX nc1
ON dbo.Sample1
(product_id, entity_id, period DESC)
INCLUDE (cost);
GO
... and now the test...
SET NOCOUNT ON;
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
PRINT '********** Simple UPDATE for comparison **********';
UPDATE dbo.Sample1
SET Cost = NULL;
PRINT '********** Triangular Join **********';
UPDATE S
SET cost = CA.cost
FROM Sample1 S
CROSS
APPLY (
SELECT TOP (1)
S2.cost
FROM Sample1 S2
WHERE S2.product_id = S.product_id
AND S2.entity_id = S.entity_id
AND S2.cost IS NOT NULL
AND S2.period <= S.period
ORDER BY
S2.period DESC
) CA
WHERE S.cost IS NULL;
SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;
SET NOCOUNT OFF;
...and, of course, the results...
********** Simple UPDATE for comparison **********
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table 'Sample1'. Scan count 1, logical reads 41, physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 104 ms.
********** Triangular Join **********
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table 'Sample1'. [font="Arial Black"]Scan count 10001, logical reads 253067[/font], physical reads 0, read-ahead reads 3,
lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
[font="Arial Black"]CPU time = 39094 ms, elapsed time = 43389 ms[/font].
Like I admitted, because of the nature of the data for this problem, you might get lucky and not have such a thing happen but the code still has the problem. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
April 6, 2010 at 8:08 pm
Jeff Moden (4/6/2010)
Maybe this will help you to understand why the code you posted does suffer from the same problem even given the index provided.
Consider this code my apple sauce to your pork chop 😉
I took the liberty of changing the test order so that the cost column 'comparison simple update' did not overwrite the strategically placed single non-NULL value before running my code.
Test data
USE tempdb;
GO
IF OBJECT_ID(N'dbo.Sample1')
IS NOT NULL
DROP TABLE dbo.Sample1;
GO
CREATE TABLE dbo.Sample1
(
product_id CHAR(7) NOT NULL,
entity_id CHAR(3) NOT NULL,
period DATETIME NOT NULL,
cost SMALLMONEY NULL,
-- Computed column (not persisted yet)
flag_null AS CASE WHEN cost IS NULL THEN 1 ELSE 0 END
);
GO
--===== Populate the test table with 10k rows of similar data - cost all NULL
WITH Numbers (n)
AS (
SELECT ROW_NUMBER()
OVER (
ORDER BY (SELECT 0))
FROM master.sys.all_columns ac1,
master.sys.all_columns ac2
)
INSERT dbo.Sample1
(product_id, entity_id, period, cost)
SELECT 'CMP1000', 'PM1', DATEADD(DAY, N.n - 1, '2010-01-01T00:00:00'), NULL
FROM Numbers N
WHERE N.n BETWEEN 1 AND 10000;
--===== Seed the first row with data
UPDATE dbo.Sample1
SET Cost = 123
WHERE Period = '20100101';
-- Index required for good performance
CREATE UNIQUE CLUSTERED INDEX nc1
ON dbo.Sample1
(flag_null, product_id, entity_id, period DESC);
GO
The test
SET NOCOUNT ON;
SET STATISTICS IO, TIME ON;
PRINT '********** Triangular Join? **********';
UPDATE S
SET cost = CA.cost
FROM Sample1 S
CROSS
APPLY (
SELECT TOP (1)
S2.cost
FROM Sample1 S2
WHERE S2.product_id = S.product_id
AND S2.entity_id = S.entity_id
AND S2.flag_null = 0
AND S2.period <= S.period
ORDER BY
S2.period DESC
) CA
WHERE S.flag_null = 1;
PRINT '********** Simple UPDATE for comparison **********';
UPDATE dbo.Sample1
SET Cost = NULL;
SET STATISTICS IO, TIME OFF;
SET NOCOUNT OFF;
Results:
********** Triangular Join? **********
Table 'Sample1'. Scan count 10000, logical reads 61782
Table 'Worktable'. Scan count 1, logical reads 20348
CPU time = 282 ms, elapsed time = 280 ms.
********** Simple UPDATE for comparison **********
Table 'Sample1'. Scan count 1, logical reads 40477
Table 'Worktable'. Scan count 1, logical reads 20290
CPU time = 250 ms, elapsed time = 298 ms.
The issue with the previous code was with the indexing. It is not possible to perform two range seeks in one index operation, so the cost test became a predicate. The solution above redresses this oversight, and produces excellent performance regardless of the NULL distribution.
The above repro still represents the worst case. Best case is when there are no NULL values:
Table 'Sample1'. Scan count 1, logical reads 2
Table 'Worktable'. Scan count 1, logical reads 0
CPU time = 0 ms, elapsed time = 0 ms.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 6, 2010 at 9:28 pm
VERY clever. Seriously. And, removing the following line doesn't hurt much...
WHERE S.flag_null = 1
In that same vain (playing around with your code), it also doesn't seem to hurt if the caclulated column is a real column defaulted to 0.
Yes, it's still a Triangular Join although it's an incredibly fast one. Nicely done. You've also given me an idea that I have to try. If it works out, I'll let you know.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 6, 2010 at 9:34 pm
Jeff Moden (4/6/2010)
VERY clever. Seriously.
Thanks! :blush:
I have my moments...and you are a pain for making me spend half an hour finding that optimization!
Just kidding - I loved the challenge, actually.
Yes, it's still a Triangular Join although it's an incredibly fast one. Nicely done. You've also given me an idea that I have to try. If it works out, I'll let you know.
Technically, I suppose it is, on deeper reflection.
Paul
edit: Changed my mind about the triangle thing.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
Viewing 8 posts - 16 through 22 (of 22 total)
You must be logged in to reply to this topic. Login to reply