January 14, 2012 at 6:52 am
Table_1(custid,store,amont)
Insert into Table_1 values (1,store1,1000)
Insert into Table_1 values (2,store2,2000)
Insert into Table_1 values (3,store3,3000)
Insert into Table_1 values (4,store4,4000)
Table_2 (store,product,col3,col4...)
Insert into Table_2 values(store1,prod1,value)
Insert into Table_2 values(store1,prod2,value)
Insert into Table_2 values(store1,prod3,value)
Insert into Table_2 values(store2,prod1,value)
Insert into Table_2 values(store2,prod5,value)
Insert into Table_2 values(store3,prod4,value)
Insert into Table_2 values(store3,prod2,value)
Insert into Table_2 values(store3,prod6,value)
Insert into Table_2 values(store4,prod6,value)
Insert into Table_2 values(store4,prod1,value)
Insert into Table_2 values(store4,prod2,value)
select custid,store,Amount
from Table_1 where store not in (select store from table_2 where product != 'prod6')
custid, store, amount
---------------------
1,store1,1000
2,store2,2000
I need a query to get the same output in a more optimized way....because i have 500 millions in Table_1 and 700 millions of rows in Table_2
Please help me out it is a production issue, need to fix this asap
January 14, 2012 at 6:52 am
Please post table definitions, index definitions and execution plan, as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 14, 2012 at 9:46 am
A quick educated guess, before the requested details turn up:
CREATE VIEW dbo.T2V
WITH SCHEMABINDING AS
SELECT
t.product,
t.store,
COUNT_BIG(*) AS cnt
FROM dbo.Table_2 AS t
GROUP BY
t.store,
t.product
GO
CREATE UNIQUE CLUSTERED INDEX i
ON T2V (product, store)
GO
CREATE INDEX i
ON dbo.Table_1 (store)
INCLUDE (custid, amount)
If this is not Enterprise Edition, you would need to replace FROM Table_2 in the query with: FROM T2V WITH (NOEXPAND). Be aware that all access to Table_2, even reading, will be blocked while the indexed view is created.
January 14, 2012 at 10:00 am
In the meantime, you will get some mileage from this
SELECT custid, store, Amount
FROM Table_1 t1
WHERE NOT EXISTS (SELECT 1 FROM table_2 t2 WHERE t2.store = t1.store AND t2.product != 'prod6')
It will work ok if [store] is indexed in both tables - and it's kinda standard (I reckon).
Gail & Paul are likely to come up with something more appropriate for tables of this size.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 14, 2012 at 10:18 am
ChrisM@home (1/14/2012)
In the meantime, you will get some mileage from this
SELECT custid, store, Amount
FROM Table_1 t1
WHERE NOT EXISTS (SELECT 1 FROM table_2 t2 WHERE t2.store = t1.store AND t2.product != 'prod6')
It will work ok if [store] is indexed in both tables - and it's kinda standard (I reckon).
Gail & Paul are likely to come up with something more appropriate for tables of this size.
That query only has the same semantics as the original if the store columns are defined as NOT NULL. In my tests, with store constrained to be NOT NULL, both forms (original and yours above) produce exactly the same plan:
With both columns allowing NULLs (even if none actually exist) the possibility of different results necessarily results in a different plan (again, for both textual forms):
January 14, 2012 at 10:44 am
Thanks Paul - looks like it's time to revisit Gail's NOT EXISTS vs etc blog 😉
Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 14, 2012 at 10:47 am
ChrisM@home (1/14/2012)
Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.
No, I just ran:
UPDATE STATISTICS dbo.Table_1
WITH ROWCOUNT = 500000000, PAGECOUNT = 5000000
UPDATE STATISTICS dbo.Table_2
WITH ROWCOUNT = 700000000, PAGECOUNT = 7000000
January 14, 2012 at 11:01 am
SQL Kiwi (1/14/2012)
ChrisM@home (1/14/2012)
Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.No, I just ran:
UPDATE STATISTICS dbo.Table_1
WITH ROWCOUNT = 500000000, PAGECOUNT = 5000000
UPDATE STATISTICS dbo.Table_2
WITH ROWCOUNT = 700000000, PAGECOUNT = 7000000
Cheat. I'm so disappointed.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 15, 2012 at 1:29 am
How many distinct stores are there in each table?
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 15, 2012 at 6:11 am
Sample data for testing:
DROP TABLE Table_1
CREATE TABLE Table_1(
RowID INT IDENTITY (1,1) PRIMARY KEY CLUSTERED,
custid INT,
store VARCHAR(9) NOT NULL,
amount INT)
INSERT INTO Table_1 (custid, store, amount) -- 2 million rows
SELECT [custid], [store], [Amount]
FROM (SELECT TOP 1000 [custid] = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b) customers
CROSS JOIN (SELECT [store] = 'store' + CAST(rn AS VARCHAR(4)) FROM (
SELECT TOP 1000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)
d) stores
CROSS JOIN (SELECT [Amount] = 1000 UNION ALL SELECT 2000) Amounts
CREATE NONCLUSTERED INDEX [IX_Store] ON Table_1 (Store)
DROP TABLE Table_2
CREATE TABLE Table_2 (
RowID INT IDENTITY (1,1),
store VARCHAR(9) NOT NULL,
product VARCHAR(8),
value INT)
INSERT INTO Table_2 (s.store, p.product, v.value) -- 7920040 rows
SELECT store, product, value
FROM (SELECT [store] = 'store' + CAST(rn AS VARCHAR(4)) FROM (
SELECT TOP 1000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)
d) s
CROSS JOIN (SELECT [product] = 'prod' + CAST(rn AS VARCHAR(4)) FROM (
SELECT TOP 2000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)
d) p
CROSS JOIN (SELECT [Value] = 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4) v
WHERE NOT (product != 'prod6' AND store IN ('store126', 'store185', 'store284', 'store299', 'store359', 'store667', 'store682', 'store793', 'store947', 'store957'))
/*
If I create table_2 with "RowID INT IDENTITY (1,1) PRIMARY KEY CLUSTERED, ", then the following error occurs:
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
The error doesn't occur if there are only 2 values for [value] i.e. table half the size
*/
CREATE UNIQUE CLUSTERED INDEX [CX_RowID] ON Table_2 (RowID)
CREATE NONCLUSTERED INDEX [IX_Store_product] ON Table_2 (Store, product)
Testing:
DECLARE @custid INT, @store VARCHAR(9), @amount INT
SET STATISTICS IO,TIME ON
DBCC DROPCLEANBUFFERS
PRINT 'Query 1 -----------------------------------------------------------------'
--DECLARE @custid INT, @store VARCHAR(9), @amount INT
select @custid = custid, @store = store, @amount = Amount
from Table_1 where store not in (select store from table_2 where product != 'prod6')
PRINT 'Query 1 ================================================================='
DBCC DROPCLEANBUFFERS
PRINT 'Query 2 -----------------------------------------------------------------'
--DECLARE @custid INT, @store VARCHAR(9), @amount INT
;WITH Table_2_Stores AS
(
SELECT store = MIN(T.store)
FROM dbo.Table_2 T
WHERE product != 'prod6'
UNION ALL
SELECT R.store
FROM (
SELECT T.store,
rn = ROW_NUMBER() OVER (ORDER BY T.store)
FROM dbo.Table_2 T
JOIN Table_2_Stores R
ON R.store < T.store
WHERE T.product != 'prod6'
) R
WHERE R.rn = 1
)
SELECT @custid = custid, @store = store, @amount = Amount
FROM Table_1 t1
WHERE store NOT IN (SELECT store FROM Table_2_Stores)
OPTION (MAXRECURSION 0);
PRINT 'Query 2 ================================================================='
Results:
Query 1 -----------------------------------------------------------------
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table 'Table_2'. Scan count 3, logical reads 31014, physical reads 78, read-ahead reads 7124, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table_1'. Scan count 3, logical reads 8245, physical reads 43, read-ahead reads 8176, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 7456 ms, elapsed time = 6335 ms.
Query 2 -----------------------------------------------------------------
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table 'Worktable'. Scan count 6, logical reads 17823, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table_2'. Scan count 2973, logical reads 8922, physical reads 227, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table_1'. Scan count 3, logical reads 8245, physical reads 43, read-ahead reads 8176, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2247 ms, elapsed time = 3330 ms.
Subjectively, it's different - Q2 runs in about a second, Q1 in about 8 seconds. <-- Edit: corrected
Q2 is an idea nicked from here, another of Paul White's excellent queries.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 15, 2012 at 8:49 am
Hi Chris,
Yes, this could well be a perfect use for the super-fast DISTINCT recursive CTE - well spotted!
The error about the engine not being able to obtain a LOCK resource is a consequence of a server with a small amount of configured memory, and an unfixed regression bug in SQL Server 2008+ where lock escalation does not work for INSERT statements. SQL Server continues to take row or page level locks, and runs out of memory (sigh). The workaround is to use a locking hint: INSERT TableName WITH (TABLOCKX) (column list). I do this automatically these days when inserting a significant number of rows, especially if the table is empty table. From memory, I think this bug relates to changes made to allow minimally-logged INSERT in SQL Server 2008.
I'm normally a fan of DBCC DROPCLEANBUFFERS for testing (though one should issue a CHECKPOINT immediately beforehand to ensure dirty pages are flushed too). In this case, though, all we end up testing is which query issues read-ahead most efficiently, and how fast the I/O system is. Assuming we all generally test on laptops and home PCs with a single relatively slow disk, this may not give a fair assessment of how the query might be expected to perform in reality.
Another consideration is: do we really expect *none* of the tables' pages to be in memory when this query typically runs? Clearly for an important query, one would hope that all the non-leaf index pages would be in buffer pool, and probably a significant proportion of the leaf index and data pages too. For these reasons, I would give more weight to tests run with all pages in memory than tests run with a completely cold cache.
Finally, what does subjectively mean here "Subjectively, it's different - Q1 runs in about a second, Q2 in about 8 seconds". I cannot get query 1 to run in a second, even with parallelism and zero disk reads...?
January 15, 2012 at 9:23 am
SQL Kiwi (1/15/2012)
The error about the engine not being able to obtain a LOCK resource is a consequence of a server with a small amount of configured memory, and an unfixed regression bug in SQL Server 2008+ where lock escalation does not work for INSERT statements. SQL Server continues to take row or page level locks, and runs out of memory (sigh). The workaround is to use a locking hint: INSERT TableName WITH (TABLOCKX) (column list). I do this automatically these days when inserting a significant number of rows, especially if the table is empty table. From memory, I think this bug relates to changes made to allow minimally-logged INSERT in SQL Server 2008.
Since it worked with around 4M rows and failed with around 8M rows, I figured it must be a bug. Thanks for the info.
I'm normally a fan of DBCC DROPCLEANBUFFERS for testing (though one should issue a CHECKPOINT immediately beforehand to ensure dirty pages are flushed too). In this case, though, all we end up testing is which query issues read-ahead most efficiently, and how fast the I/O system is. Assuming we all generally test on laptops and home PCs with a single relatively slow disk, this may not give a fair assessment of how the query might be expected to perform in reality.
Another consideration is: do we really expect *none* of the tables' pages to be in memory when this query typically runs? Clearly for an important query, one would hope that all the non-leaf index pages would be in buffer pool, and probably a significant proportion of the leaf index and data pages too. For these reasons, I would give more weight to tests run with all pages in memory than tests run with a completely cold cache.
There are too many unknowns. We don't know how often the tables are accessed or the configuration of the server. I tested with and without a clean data cache and the results from the perspective of comparing the queries were consistent - Q2 is about 4x faster. This lappy has two hdd's, data on one, tempdb and log on the other - I kinda assume this helps it mimic a real situation more closely than with only one hdd.
Finally, what does subjectively mean here "Subjectively, it's different - Q1 runs in about a second, Q2 in about 8 seconds". I cannot get query 1 to run in a second, even with parallelism and zero disk reads...?
My mistake, corrected. Q2 runs in about a second on its own (as opposed to in a batch with Q1 and other statements), with 4M rows or 8M rows. Q1 runs in about 4s with 4M rows and about 7.5s with 8M rows.
I concentrated effort on the subquery on the assumption that the number of stores (qualifying or not) would be relatively small resulting in significant duplication, which the "ultra-fast distinct" excels at. Improving the LHS (table_1) will almost certainly involve invoking sneaky tricks like indexed views.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 16, 2012 at 2:47 am
calibar2k (1/14/2012)
Table_1(custid,store,amont)Insert into Table_1 values (1,store1,1000)
Insert into Table_1 values (2,store2,2000)
Insert into Table_1 values (3,store3,3000)
Insert into Table_1 values (4,store4,4000)
Table_2 (store,product,col3,col4...)
Insert into Table_2 values(store1,prod1,value)
Insert into Table_2 values(store1,prod2,value)
Insert into Table_2 values(store1,prod3,value)
Insert into Table_2 values(store2,prod1,value)
Insert into Table_2 values(store2,prod5,value)
Insert into Table_2 values(store3,prod4,value)
Insert into Table_2 values(store3,prod2,value)
Insert into Table_2 values(store3,prod6,value)
Insert into Table_2 values(store4,prod6,value)
Insert into Table_2 values(store4,prod1,value)
Insert into Table_2 values(store4,prod2,value)
select custid,store,Amount
from Table_1 where store not in (select store from table_2 where product != 'prod6')
custid, store, amount
---------------------
1,store1,1000
2,store2,2000
Hello,
reading your post I see your example results are erroneous. Your inner SELECT,
select store from table_2 where product != 'prod6'
will return store1, store2, store3, store4, so your whole SELECT,
select custid,store,Amount
from Table_1 where store not in (select store from table_2 where product != 'prod6')
will return nothing. Where the mistake is, in the results or in the SELECT?.
Anyway I prefer code joins rather than subselects, your query can be rewritten as
select custid,store,Amount
from Table_1
left join table_2 on Table_1.store = table_2.store and product != 'prod6'
where table_2.store is null
Regards,
Francesc
January 16, 2012 at 3:14 am
You are right about the sample data not matching the output.
frfernan (1/16/2012)
Anyway I prefer code joins rather than subselects...
The LEFT JOIN...IS NULL construction does not have the quite same semantic as the original query (as mentioned to Chris earlier), and will frequently be less efficient than an anti-semi-join (however it is expressed in T-SQL). See http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/
January 16, 2012 at 4:01 am
I was believing that a LEFT JOIN was better in performance that a NOT IN clause, more generically, I accepted that a JOIN is generally better in performance than a SUBSELECT. You say that not always, many thanks for your explanations and links.
Francesc
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply