Select Query - need optimization

Question

Select Query - need optimization

calibar2k

SSC Rookie

Points: 44
More actions
January 14, 2012 at 6:52 am

#254881

Table_1(custid,store,amont)
Insert into Table_1 values (1,store1,1000)
Insert into Table_1 values (2,store2,2000)
Insert into Table_1 values (3,store3,3000)
Insert into Table_1 values (4,store4,4000)
Table_2 (store,product,col3,col4...)
Insert into Table_2 values(store1,prod1,value)
Insert into Table_2 values(store1,prod2,value)
Insert into Table_2 values(store1,prod3,value)
Insert into Table_2 values(store2,prod1,value)
Insert into Table_2 values(store2,prod5,value)
Insert into Table_2 values(store3,prod4,value)
Insert into Table_2 values(store3,prod2,value)
Insert into Table_2 values(store3,prod6,value)
Insert into Table_2 values(store4,prod6,value)
Insert into Table_2 values(store4,prod1,value)
Insert into Table_2 values(store4,prod2,value)
select custid,store,Amount
from Table_1 where store not in (select store from table_2 where product != 'prod6')
custid, store, amount
---------------------
1,store1,1000
2,store2,2000
I need a query to get the same output in a more optimized way....because i have 500 millions in Table_1 and 700 millions of rows in Table_2
Please help me out it is a production issue, need to fix this asap

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

Please post table definitions, index definitions and execution plan, as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Paul White SSC Guru Points: 150467 More actions · Answer 2

A quick educated guess, before the requested details turn up:

CREATE VIEW dbo.T2V

WITH SCHEMABINDING AS

SELECT

t.product,

t.store,

COUNT_BIG(*) AS cnt

FROM dbo.Table_2 AS t

GROUP BY

t.store,

t.product

GO

CREATE UNIQUE CLUSTERED INDEX i

ON T2V (product, store)

GO

CREATE INDEX i

ON dbo.Table_1 (store)

INCLUDE (custid, amount)

If this is not Enterprise Edition, you would need to replace FROM Table_2 in the query with: FROM T2V WITH (NOEXPAND). Be aware that all access to Table_2, even reading, will be blocked while the indexed view is created.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 3

In the meantime, you will get some mileage from this

SELECT custid, store, Amount

FROM Table_1 t1

WHERE NOT EXISTS (SELECT 1 FROM table_2 t2 WHERE t2.store = t1.store AND t2.product != 'prod6')

It will work ok if [store] is indexed in both tables - and it's kinda standard (I reckon).

Gail & Paul are likely to come up with something more appropriate for tables of this size.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Paul White SSC Guru Points: 150467 More actions · Answer 4

ChrisM@home (1/14/2012)
In the meantime, you will get some mileage from this
SELECT custid, store, Amount
FROM Table_1 t1
WHERE NOT EXISTS (SELECT 1 FROM table_2 t2 WHERE t2.store = t1.store AND t2.product != 'prod6')
It will work ok if [store] is indexed in both tables - and it's kinda standard (I reckon).
Gail & Paul are likely to come up with something more appropriate for tables of this size.

That query only has the same semantics as the original if the store columns are defined as NOT NULL. In my tests, with store constrained to be NOT NULL, both forms (original and yours above) produce exactly the same plan:

With both columns allowing NULLs (even if none actually exist) the possibility of different results necessarily results in a different plan (again, for both textual forms):

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 5

Thanks Paul - looks like it's time to revisit Gail's NOT EXISTS vs etc blog 😉

Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Paul White SSC Guru Points: 150467 More actions · Answer 6

ChrisM@home (1/14/2012)
Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.

No, I just ran:

UPDATE STATISTICS dbo.Table_1

WITH ROWCOUNT = 500000000, PAGECOUNT = 5000000

UPDATE STATISTICS dbo.Table_2

WITH ROWCOUNT = 700000000, PAGECOUNT = 7000000

See http://msdn.microsoft.com/en-us/library/ms187348.aspx

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 7

SQL Kiwi (1/14/2012)
ChrisM@home (1/14/2012)
Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.
No, I just ran:
UPDATE STATISTICS dbo.Table_1
WITH ROWCOUNT = 500000000, PAGECOUNT = 5000000
UPDATE STATISTICS dbo.Table_2
WITH ROWCOUNT = 700000000, PAGECOUNT = 7000000
See http://msdn.microsoft.com/en-us/library/ms187348.aspx

Cheat. I'm so disappointed.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 8

How many distinct stores are there in each table?

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 9

Sample data for testing:

DROP TABLE Table_1

CREATE TABLE Table_1(

RowID INT IDENTITY (1,1) PRIMARY KEY CLUSTERED,

custid INT,

store VARCHAR(9) NOT NULL,

amount INT)

INSERT INTO Table_1 (custid, store, amount) -- 2 million rows

SELECT [custid], [store], [Amount]

FROM (SELECT TOP 1000 [custid] = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b) customers

CROSS JOIN (SELECT [store] = 'store' + CAST(rn AS VARCHAR(4)) FROM (

SELECT TOP 1000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)

d) stores

CROSS JOIN (SELECT [Amount] = 1000 UNION ALL SELECT 2000) Amounts

CREATE NONCLUSTERED INDEX [IX_Store] ON Table_1 (Store)

DROP TABLE Table_2

CREATE TABLE Table_2 (

RowID INT IDENTITY (1,1),

store VARCHAR(9) NOT NULL,

product VARCHAR(8),

value INT)

INSERT INTO Table_2 (s.store, p.product, v.value) -- 7920040 rows

SELECT store, product, value

FROM (SELECT [store] = 'store' + CAST(rn AS VARCHAR(4)) FROM (

SELECT TOP 1000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)

d) s

CROSS JOIN (SELECT [product] = 'prod' + CAST(rn AS VARCHAR(4)) FROM (

SELECT TOP 2000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)

d) p

CROSS JOIN (SELECT [Value] = 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4) v

WHERE NOT (product != 'prod6' AND store IN ('store126', 'store185', 'store284', 'store299', 'store359', 'store667', 'store682', 'store793', 'store947', 'store957'))

/*

If I create table_2 with "RowID INT IDENTITY (1,1) PRIMARY KEY CLUSTERED, ", then the following error occurs:

The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

The error doesn't occur if there are only 2 values for [value] i.e. table half the size

*/

CREATE UNIQUE CLUSTERED INDEX [CX_RowID] ON Table_2 (RowID)

CREATE NONCLUSTERED INDEX [IX_Store_product] ON Table_2 (Store, product)

Testing:

DECLARE @custid INT, @store VARCHAR(9), @amount INT

SET STATISTICS IO,TIME ON

DBCC DROPCLEANBUFFERS

PRINT 'Query 1 -----------------------------------------------------------------'

--DECLARE @custid INT, @store VARCHAR(9), @amount INT

select @custid = custid, @store = store, @amount = Amount

from Table_1 where store not in (select store from table_2 where product != 'prod6')

PRINT 'Query 1 ================================================================='

DBCC DROPCLEANBUFFERS

PRINT 'Query 2 -----------------------------------------------------------------'

--DECLARE @custid INT, @store VARCHAR(9), @amount INT

;WITH Table_2_Stores AS

(

SELECT store = MIN(T.store)

FROM dbo.Table_2 T

WHERE product != 'prod6'

UNION ALL

SELECT R.store

FROM (

SELECT T.store,

rn = ROW_NUMBER() OVER (ORDER BY T.store)

FROM dbo.Table_2 T

JOIN Table_2_Stores R

ON R.store < T.store

WHERE T.product != 'prod6'

) R

WHERE R.rn = 1

)

SELECT @custid = custid, @store = store, @amount = Amount

FROM Table_1 t1

WHERE store NOT IN (SELECT store FROM Table_2_Stores)

OPTION (MAXRECURSION 0);

PRINT 'Query 2 ================================================================='

Results:

Query 1 -----------------------------------------------------------------

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Table_2'. Scan count 3, logical reads 31014, physical reads 78, read-ahead reads 7124, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Table_1'. Scan count 3, logical reads 8245, physical reads 43, read-ahead reads 8176, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 7456 ms, elapsed time = 6335 ms.

Query 2 -----------------------------------------------------------------

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Worktable'. Scan count 6, logical reads 17823, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Table_2'. Scan count 2973, logical reads 8922, physical reads 227, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Table_1'. Scan count 3, logical reads 8245, physical reads 43, read-ahead reads 8176, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 2247 ms, elapsed time = 3330 ms.

Subjectively, it's different - Q2 runs in about a second, Q1 in about 8 seconds. <-- Edit: corrected

Q2 is an idea nicked from here, another of Paul White's excellent queries.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Paul White SSC Guru Points: 150467 More actions · Answer 10

Hi Chris,

Yes, this could well be a perfect use for the super-fast DISTINCT recursive CTE - well spotted!

The error about the engine not being able to obtain a LOCK resource is a consequence of a server with a small amount of configured memory, and an unfixed regression bug in SQL Server 2008+ where lock escalation does not work for INSERT statements. SQL Server continues to take row or page level locks, and runs out of memory (sigh). The workaround is to use a locking hint: INSERT TableName WITH (TABLOCKX) (column list). I do this automatically these days when inserting a significant number of rows, especially if the table is empty table. From memory, I think this bug relates to changes made to allow minimally-logged INSERT in SQL Server 2008.

I'm normally a fan of DBCC DROPCLEANBUFFERS for testing (though one should issue a CHECKPOINT immediately beforehand to ensure dirty pages are flushed too). In this case, though, all we end up testing is which query issues read-ahead most efficiently, and how fast the I/O system is. Assuming we all generally test on laptops and home PCs with a single relatively slow disk, this may not give a fair assessment of how the query might be expected to perform in reality.

Another consideration is: do we really expect *none* of the tables' pages to be in memory when this query typically runs? Clearly for an important query, one would hope that all the non-leaf index pages would be in buffer pool, and probably a significant proportion of the leaf index and data pages too. For these reasons, I would give more weight to tests run with all pages in memory than tests run with a completely cold cache.

Finally, what does subjectively mean here "Subjectively, it's different - Q1 runs in about a second, Q2 in about 8 seconds". I cannot get query 1 to run in a second, even with parallelism and zero disk reads...?

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 11

SQL Kiwi (1/15/2012)
The error about the engine not being able to obtain a LOCK resource is a consequence of a server with a small amount of configured memory, and an unfixed regression bug in SQL Server 2008+ where lock escalation does not work for INSERT statements. SQL Server continues to take row or page level locks, and runs out of memory (sigh). The workaround is to use a locking hint: INSERT TableName WITH (TABLOCKX) (column list). I do this automatically these days when inserting a significant number of rows, especially if the table is empty table. From memory, I think this bug relates to changes made to allow minimally-logged INSERT in SQL Server 2008.

Since it worked with around 4M rows and failed with around 8M rows, I figured it must be a bug. Thanks for the info.

I'm normally a fan of DBCC DROPCLEANBUFFERS for testing (though one should issue a CHECKPOINT immediately beforehand to ensure dirty pages are flushed too). In this case, though, all we end up testing is which query issues read-ahead most efficiently, and how fast the I/O system is. Assuming we all generally test on laptops and home PCs with a single relatively slow disk, this may not give a fair assessment of how the query might be expected to perform in reality.
Another consideration is: do we really expect *none* of the tables' pages to be in memory when this query typically runs? Clearly for an important query, one would hope that all the non-leaf index pages would be in buffer pool, and probably a significant proportion of the leaf index and data pages too. For these reasons, I would give more weight to tests run with all pages in memory than tests run with a completely cold cache.

There are too many unknowns. We don't know how often the tables are accessed or the configuration of the server. I tested with and without a clean data cache and the results from the perspective of comparing the queries were consistent - Q2 is about 4x faster. This lappy has two hdd's, data on one, tempdb and log on the other - I kinda assume this helps it mimic a real situation more closely than with only one hdd.

Finally, what does subjectively mean here "Subjectively, it's different - Q1 runs in about a second, Q2 in about 8 seconds". I cannot get query 1 to run in a second, even with parallelism and zero disk reads...?

My mistake, corrected. Q2 runs in about a second on its own (as opposed to in a batch with Q1 and other statements), with 4M rows or 8M rows. Q1 runs in about 4s with 4M rows and about 7.5s with 8M rows.

I concentrated effort on the subquery on the assumption that the number of stores (qualifying or not) would be relatively small resulting in significant duplication, which the "ultra-fast distinct" excels at. Improving the LHS (table_1) will almost certainly involve invoking sneaky tricks like indexed views.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

gofrancesc Hall of Fame Points: 3269 More actions · Answer 12

calibar2k (1/14/2012)
Table_1(custid,store,amont)
Insert into Table_1 values (1,store1,1000)
Insert into Table_1 values (2,store2,2000)
Insert into Table_1 values (3,store3,3000)
Insert into Table_1 values (4,store4,4000)
Table_2 (store,product,col3,col4...)
Insert into Table_2 values(store1,prod1,value)
Insert into Table_2 values(store1,prod2,value)
Insert into Table_2 values(store1,prod3,value)
Insert into Table_2 values(store2,prod1,value)
Insert into Table_2 values(store2,prod5,value)
Insert into Table_2 values(store3,prod4,value)
Insert into Table_2 values(store3,prod2,value)
Insert into Table_2 values(store3,prod6,value)
Insert into Table_2 values(store4,prod6,value)
Insert into Table_2 values(store4,prod1,value)
Insert into Table_2 values(store4,prod2,value)
select custid,store,Amount
from Table_1 where store not in (select store from table_2 where product != 'prod6')
custid, store, amount
---------------------
1,store1,1000
2,store2,2000

Hello,

reading your post I see your example results are erroneous. Your inner SELECT,

select store from table_2 where product != 'prod6'

will return store1, store2, store3, store4, so your whole SELECT,

select custid,store,Amount

from Table_1 where store not in (select store from table_2 where product != 'prod6')

will return nothing. Where the mistake is, in the results or in the SELECT?.

Anyway I prefer code joins rather than subselects, your query can be rewritten as

select custid,store,Amount

from Table_1

left join table_2 on Table_1.store = table_2.store and product != 'prod6'

where table_2.store is null

Regards,

Francesc

Paul White SSC Guru Points: 150467 More actions · Answer 13

You are right about the sample data not matching the output.

frfernan (1/16/2012)
Anyway I prefer code joins rather than subselects...

The LEFT JOIN...IS NULL construction does not have the quite same semantic as the original query (as mentioned to Chris earlier), and will frequently be less efficient than an anti-semi-join (however it is expressed in T-SQL). See http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

gofrancesc Hall of Fame Points: 3269 More actions · Answer 14

I was believing that a LEFT JOIN was better in performance that a NOT IN clause, more generically, I accepted that a JOIN is generally better in performance than a SUBSELECT. You say that not always, many thanks for your explanations and links.

Francesc