December 1, 2008 at 9:14 am
HI All,
Which would you consider to be a better query:
I would go with query 2 cause it uses a join rather than like in query one it uses a correlated subquery
What are you thoughts.
--QUERY 1
Select Count(*) as ClientsWithoutOrder
from Client A
Where not exists (Select *
From Order B
Where B.ClientId = A.Id
)
--QUERY 2
Select Count(*) as ClientsWithoutOrder
from Client A
LEFT JOIN Order B
ON B.ClientId = A.Id
WHERE B.ClientId IS NULL
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
December 1, 2008 at 9:26 am
In this case (and I tested out in adventureworks) the exists is better because by using that you're telling SQL that it doesn't actually need to get the rows from the Order, all you're interested in is the presence or absence of a match. SQL does what's called a semi-join (or, in this particular case, an anti-semi-join)
Whereas, in the join, it has to fetch the data from orders and join it in and then filter out the unwanted ros afterwards.
As always, running it is the proof. I can attach exec plans if anyone wants.
Query 1 shows a scan on customers, a scan and stream aggregate on order details, then a merge join (left anti-semi-join) and another stream aggregate. QO puts that at 48% cost
Query 2 shows scans of customers and sales order header, then a merge join (left outer join), a filter and finally a stream aggregate. QO puts that at 52% of the batch cost.
USE Adventureworks
GO
Query 1
Select Count(*) as ClientsWithoutOrder
from Sales.Customer A
Where not exists (Select *
From Sales.SalesOrderHeader B
Where B.CustomerID = A.CustomerID
)
--QUERY 2
Select Count(*) as ClientsWithoutOrder
from Sales.Customer A
LEFT JOIN Sales.SalesOrderHeader B
ON B.CustomerID = A.CustomerID
WHERE B.CustomerID IS NULL
Query 1:
Table 'SalesOrderHeader'. Scan count 1, logical reads 45
Table 'Customer'. Scan count 1, logical reads 123
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 14 ms.
Query 2:
Table 'SalesOrderHeader'. Scan count 1, logical reads 45
Table 'Customer'. Scan count 1, logical reads 123
SQL Server Execution Times:
CPU time = 78 ms, elapsed time = 71 ms.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 1, 2008 at 9:35 am
Thanks Gail,
Now that is what I call taking the time to explain something with proof π
I could see from the plan it was better but just wasn't sure why..
thanks again that's an awesome reply π
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
December 1, 2008 at 9:43 am
Gail, did you notice in your testing if there was significant benefit in using
Where not exists (Select 1
instead of Where not exists (Select *
in Query 1? I'm sure I've come across this recently, with SELECT 1 claiming top performance.
A significant difference between the two statements is that Query 1 will always preserve cardinality of the Sales table, whereas Query 2 would typically require a GROUP BY CustomerID.
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 1, 2008 at 11:01 am
Glad Gail answered this one. I have always personally preferred option 2 because I find the LEFT JOIN syntax easier to read, but I have learned that the EXISTS/NOT EXISTS performs better for the reasons Gail explains (better than I ever could have).
I believe as used in an EXISTS/NOT EXISTS that there is no performance difference between Select * and Select 1. I read it somewhere, but can't remember where. I believe this is because of what Gail said:
the exists is better because by using that you're telling SQL that it doesn't actually need to get the rows from the Order, all you're interested in is the presence or absence of a match. SQL does what's called a semi-join (or, in this particular case, an anti-semi-join)
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
December 1, 2008 at 11:17 am
I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.
December 1, 2008 at 1:19 pm
Chris Morris (12/1/2008)
Gail, did you notice in your testing if there was significant benefit in using
Where not exists (Select 1
instead of
Where not exists (Select *
in Query 1? I'm sure I've come across this recently, with SELECT 1 claiming top performance.
There's no performance difference. 1
Exists doesn't return any of the columns and what column you specify is immaterial. I use 1 just to make it very clear to me and to anyone reading that there's no column access and the column (if any) mentioned is immaterial. It's a readability thing.
(1). During query execution, the expansion of select * to the list of columns comes before the removal of unnecessary columns, like in the exists clause. Hence there may be a minuscule metadata overhead. Certainly not enough to worry about though.
Ref: http://sqlskills.com/BLOGS/CONOR/post/EXISTS-Subqueries-SELECT-1-vs-SELECT-*.aspx
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 2, 2008 at 9:58 am
Thanks for the explanation, Gail.
I use 1 just to make it very clear to me and to anyone reading that there's no column access and the column (if any) mentioned is immaterial. It's a readability thing.
I like this too. The little guy perches on your shoulder and nags you to make the code more readable as you go along. It doesn't take up any more time but boy does it make a difference.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 4, 2008 at 12:15 pm
I learned about using conventional joins by reading these articles.
Joins Clarified By Sanket Naik http://www.sqlservercentral.com/articles/Miscellaneous/joinsclarified/1333/[/url]
A Refresher on Joins By Jambu Krishnamurthy http://www.sqlservercentral.com/articles/Basic+Querying/2937/[/url]
Hidden RBAR: Triangular Joins By Jeff Moden http://www.sqlservercentral.com/articles/T-SQL/61539/[/url]
I found this article on how Oracle uses semi-joins and anti semi-joins.
I am ready for the next step. Can anyone recommend SQL Server Central article(s) for learning about the differences/similarities of semi-joins, INTERSECT, EXISTS, anti-semi-joins, and EXCEPT? I do not know when to use each and don't know how only for (anti) semi-joins.
Paul DB
December 4, 2008 at 1:02 pm
You can't explicitly use the semi-joins. They're used by SQL when it processes an IN or an EXISTS. Both of those require that the contents be checked, but not joined, hence the term semi-join.
INTERSECT and Except aren't joins. They're used the same way as union/union all, to operate on two result sets.
Intersect has the rows in common between the two (it's a set intersection). Except gives all rows in one resultset that aren't in the second. (It's a set subtraction)
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 5, 2008 at 5:57 pm
Steve Jones - Editor (12/1/2008)
I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.
While I know this is a matter of opinion, I found 1 to be easier to read. It shows that the subquery is there merely as a conditional and it makes it obvious that it returns no data. I have run into many times when modifying procedures written by others that knowing whether or not a join table returns data or is only used for a conditional (or in some cases, just got left there pointlessly when someone else modified it earlier) can matter a great deal.
In 2, it is less obviously that the second table is just a conditional.
---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/
December 5, 2008 at 6:05 pm
timothyawiseman (12/5/2008)
Steve Jones - Editor (12/1/2008)
I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.While I know this is a matter of opinion, I found 1 to be easier to read. It shows that the subquery is there merely as a conditional and it makes it obvious that it returns no data. I have run into many times when modifying procedures written by others that knowing whether or not a join table returns data or is only used for a conditional (or in some cases, just got left there pointlessly when someone else modified it earlier) can matter a great deal.
In 2, it is less obviously that the second table is just a conditional.
I see what you are saying, but I "grew up" using the Left Join and Where column is null so I see the exclusion immediately. I never really used EXISTS/NOT EXISTS until the last year or so, thus I am just getting used to it.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
December 6, 2008 at 6:22 pm
Christopher Stobbs (12/1/2008)
HI All,Which would you consider to be a better query:
I would go with query 2 cause it uses a join rather than like in query one it uses a correlated subquery
What are you thoughts.
--QUERY 1
Select Count(*) as ClientsWithoutOrder
from Client A
Where not exists (Select *
From Order B
Where B.ClientId = A.Id
)
--QUERY 2
Select Count(*) as ClientsWithoutOrder
from Client A
LEFT JOIN Order B
ON B.ClientId = A.Id
WHERE B.ClientId IS NULL
Alright... how'd ya do it? How'd you get that right parenthesis to hang out in mid air like that without it turning into a smiley face?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 6, 2008 at 6:51 pm
Actually, I'll take door #3 π ... performance is the same as WHERE EXISTS and it's even easier to read especially for newbies that may be on the team...
SET STATISTICS IO ON
SET STATISTICS TIME ON
--===== QUERY 1
SELECT COUNT(*) AS ClientsWithoutOrder
FROM #Client c
WHERE NOT EXISTS (SELECT *
FROM #Order o
WHERE c.Id = o.ClientId)
--===== QUERY 2
SELECT COUNT(*) AS ClientsWithoutOrder
FROM #Client c
LEFT OUTER JOIN #Order o
ON c.Id = o.ClientId
WHERE o.ClientId IS NULL
--===== QUERY 3
SELECT COUNT(*) AS ClientsWithoutOrder
FROM #Client c
WHERE c.ID NOT IN (SELECT b.ClientID
FROM #Order b)
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
Heh... Yes, of course I tested it. You can, too! Here're the data generators...
--===== Create and populate a 1,000,000 row ORDER test table.
-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers
-- Column "ClientID" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
ClientID = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),
SomeHex12 = RIGHT(NEWID(),12)
INTO #Order
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE #Order
ADD PRIMARY KEY CLUSTERED (RowNum)
--===== Add an index on ClientID
CREATE INDEX IX_#Order_ClientID
ON #Order (ClientID)
--===== Ensure that 1000 clients have no entries
DELETE #Order
WHERE ClientID % 50 = 0
--===== Create a CLIENT test table containing every possible client
SELECT TOP 50000
IDENTITY(INT,1,1) AS ID
INTO #Client
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== Add the primary key
ALTER TABLE #Client
ADD PRIMARY KEY CLUSTERED (ID)
--Jeff Moden
Change is inevitable... Change for the better is not.
December 6, 2008 at 6:57 pm
Sorry... forgot to post the results I got... single P4 1.8Ghz, 1GB RAM, 80GB IDE hard drive, SQL Server 2005 Dev Edition SP2...
(1 row(s) affected)
Table '#Client_____________________________________________________________________________________________________________00000000006E'.
Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Order______________________________________________________________________________________________________________00000000006D'.
Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 563 ms, elapsed time = 602 ms.
================================================================================
(1 row(s) affected)
Table '#Order______________________________________________________________________________________________________________00000000006D'.
Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Client_____________________________________________________________________________________________________________00000000006E'.
Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 641 ms, elapsed time = 669 ms.
================================================================================
(1 row(s) affected)
Table '#Client_____________________________________________________________________________________________________________00000000006E'.
Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Order______________________________________________________________________________________________________________00000000006D'.
Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 562 ms, elapsed time = 601 ms.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 38 total)
You must be logged in to reply to this topic. Login to reply