Which is better

Question

Which is better

Christopher Stobbs

SSC-Insane

Points: 21098
More actions
December 1, 2008 at 9:14 am

#125248

HI All,
Which would you consider to be a better query:
I would go with query 2 cause it uses a join rather than like in query one it uses a correlated subquery
What are you thoughts.
--QUERY 1
Select Count(*) as ClientsWithoutOrder
from Client A
Where not exists (Select *
From Order B
Where B.ClientId = A.Id
)
--QUERY 2
Select Count(*) as ClientsWithoutOrder
from Client A
LEFT JOIN Order B
ON B.ClientId = A.Id
WHERE B.ClientId IS NULL
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
SQL-4-Life

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

In this case (and I tested out in adventureworks) the exists is better because by using that you're telling SQL that it doesn't actually need to get the rows from the Order, all you're interested in is the presence or absence of a match. SQL does what's called a semi-join (or, in this particular case, an anti-semi-join)

Whereas, in the join, it has to fetch the data from orders and join it in and then filter out the unwanted ros afterwards.

As always, running it is the proof. I can attach exec plans if anyone wants.

Query 1 shows a scan on customers, a scan and stream aggregate on order details, then a merge join (left anti-semi-join) and another stream aggregate. QO puts that at 48% cost

Query 2 shows scans of customers and sales order header, then a merge join (left outer join), a filter and finally a stream aggregate. QO puts that at 52% of the batch cost.

USE Adventureworks

GO

Query 1

Select Count(*) as ClientsWithoutOrder

from Sales.Customer A

Where not exists (Select *

From Sales.SalesOrderHeader B

Where B.CustomerID = A.CustomerID

)

--QUERY 2

Select Count(*) as ClientsWithoutOrder

from Sales.Customer A

LEFT JOIN Sales.SalesOrderHeader B

ON B.CustomerID = A.CustomerID

WHERE B.CustomerID IS NULL

Query 1:

Table 'SalesOrderHeader'. Scan count 1, logical reads 45

Table 'Customer'. Scan count 1, logical reads 123

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 14 ms.

Query 2:

Table 'SalesOrderHeader'. Scan count 1, logical reads 45

Table 'Customer'. Scan count 1, logical reads 123

SQL Server Execution Times:

CPU time = 78 ms, elapsed time = 71 ms.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 2

Thanks Gail,

Now that is what I call taking the time to explain something with proof

I could see from the plan it was better but just wasn't sure why..

thanks again that's an awesome reply

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

Gail, did you notice in your testing if there was significant benefit in using

Where not exists (Select 1

instead of Where not exists (Select *

in Query 1? I'm sure I've come across this recently, with SELECT 1 claiming top performance.

A significant difference between the two statements is that Query 1 will always preserve cardinality of the Sales table, whereas Query 2 would typically require a GROUP BY CustomerID.

Cheers

ChrisM

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Jack Corbett SSC Guru Points: 184394 More actions · Answer 4

Glad Gail answered this one. I have always personally preferred option 2 because I find the LEFT JOIN syntax easier to read, but I have learned that the EXISTS/NOT EXISTS performs better for the reasons Gail explains (better than I ever could have).

I believe as used in an EXISTS/NOT EXISTS that there is no performance difference between Select * and Select 1. I read it somewhere, but can't remember where. I believe this is because of what Gail said:

the exists is better because by using that you're telling SQL that it doesn't actually need to get the rows from the Order, all you're interested in is the presence or absence of a match. SQL does what's called a semi-join (or, in this particular case, an anti-semi-join)

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Steve Jones - SSC Editor SSC Guru Points: 737535 More actions · Answer 5

I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

Chris Morris (12/1/2008)
Gail, did you notice in your testing if there was significant benefit in using
Where not exists (Select 1
instead of Where not exists (Select *
in Query 1? I'm sure I've come across this recently, with SELECT 1 claiming top performance.

There's no performance difference. ¹

Exists doesn't return any of the columns and what column you specify is immaterial. I use 1 just to make it very clear to me and to anyone reading that there's no column access and the column (if any) mentioned is immaterial. It's a readability thing.

(1). During query execution, the expansion of select * to the list of columns comes before the removal of unnecessary columns, like in the exists clause. Hence there may be a minuscule metadata overhead. Certainly not enough to worry about though.

Ref: http://sqlskills.com/BLOGS/CONOR/post/EXISTS-Subqueries-SELECT-1-vs-SELECT-*.aspx

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 7

Thanks for the explanation, Gail.

I use 1 just to make it very clear to me and to anyone reading that there's no column access and the column (if any) mentioned is immaterial. It's a readability thing.

I like this too. The little guy perches on your shoulder and nags you to make the code more readable as you go along. It doesn't take up any more time but boy does it make a difference.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Paul DB SSC Eights! Points: 870 More actions · Answer 8

I learned about using conventional joins by reading these articles.

Joins Clarified By Sanket Naik http://www.sqlservercentral.com/articles/Miscellaneous/joinsclarified/1333/[/url]

A Refresher on Joins By Jambu Krishnamurthy http://www.sqlservercentral.com/articles/Basic+Querying/2937/[/url]

Hidden RBAR: Triangular Joins By Jeff Moden http://www.sqlservercentral.com/articles/T-SQL/61539/[/url]

I found this article on how Oracle uses semi-joins and anti semi-joins.

Speeding Up Queries with Semi-Joins and Anti-Joins by Roger Schrag http://www.dbspecialists.com/files/presentations/semijoins.html

I am ready for the next step. Can anyone recommend SQL Server Central article(s) for learning about the differences/similarities of semi-joins, INTERSECT, EXISTS, anti-semi-joins, and EXCEPT? I do not know when to use each and don't know how only for (anti) semi-joins.

Paul DB

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 9

You can't explicitly use the semi-joins. They're used by SQL when it processes an IN or an EXISTS. Both of those require that the contents be checked, but not joined, hence the term semi-join.

INTERSECT and Except aren't joins. They're used the same way as union/union all, to operate on two result sets.

Intersect has the rows in common between the two (it's a set intersection). Except gives all rows in one resultset that aren't in the second. (It's a set subtraction)

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

TimothyAWiseman SSCrazy Eights Points: 8819 More actions · Answer 10

Steve Jones - Editor (12/1/2008)
I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.

While I know this is a matter of opinion, I found 1 to be easier to read. It shows that the subquery is there merely as a conditional and it makes it obvious that it returns no data. I have run into many times when modifying procedures written by others that knowing whether or not a join table returns data or is only used for a conditional (or in some cases, just got left there pointlessly when someone else modified it earlier) can matter a great deal.

In 2, it is less obviously that the second table is just a conditional.

---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/

Jack Corbett SSC Guru Points: 184394 More actions · Answer 11

timothyawiseman (12/5/2008)
Steve Jones - Editor (12/1/2008)
I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.
While I know this is a matter of opinion, I found 1 to be easier to read. It shows that the subquery is there merely as a conditional and it makes it obvious that it returns no data. I have run into many times when modifying procedures written by others that knowing whether or not a join table returns data or is only used for a conditional (or in some cases, just got left there pointlessly when someone else modified it earlier) can matter a great deal.
In 2, it is less obviously that the second table is just a conditional.

I see what you are saying, but I "grew up" using the Left Join and Where column is null so I see the exclusion immediately. I never really used EXISTS/NOT EXISTS until the last year or so, thus I am just getting used to it.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Jeff Moden SSC Guru Points: 1004567 More actions · Answer 12

Christopher Stobbs (12/1/2008)
HI All,
Which would you consider to be a better query:
I would go with query 2 cause it uses a join rather than like in query one it uses a correlated subquery
What are you thoughts.
--QUERY 1
Select Count(*) as ClientsWithoutOrder
from Client A
Where not exists (Select *
From Order B
Where B.ClientId = A.Id
)
--QUERY 2
Select Count(*) as ClientsWithoutOrder
from Client A
LEFT JOIN Order B
ON B.ClientId = A.Id
WHERE B.ClientId IS NULL

Alright... how'd ya do it? How'd you get that right parenthesis to hang out in mid air like that without it turning into a smiley face?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004567 More actions · Answer 13

Actually, I'll take door #3 ... performance is the same as WHERE EXISTS and it's even easier to read especially for newbies that may be on the team...

SET STATISTICS IO ON

SET STATISTICS TIME ON

--===== QUERY 1

SELECT COUNT(*) AS ClientsWithoutOrder

FROM #Client c

WHERE NOT EXISTS (SELECT *

FROM #Order o

WHERE c.Id = o.ClientId)

--===== QUERY 2

SELECT COUNT(*) AS ClientsWithoutOrder

FROM #Client c

LEFT OUTER JOIN #Order o

ON c.Id = o.ClientId

WHERE o.ClientId IS NULL

--===== QUERY 3

SELECT COUNT(*) AS ClientsWithoutOrder

FROM #Client c

WHERE c.ID NOT IN (SELECT b.ClientID

FROM #Order b)

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

Heh... Yes, of course I tested it. You can, too! Here're the data generators...

--===== Create and populate a 1,000,000 row ORDER test table.

-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

-- Column "ClientID" has a range of 1 to 50,000 non-unique numbers

-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

-- for all rows.

-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

-- Jeff Moden

SELECT TOP 1000000

RowNum = IDENTITY(INT,1,1),

ClientID = ABS(CHECKSUM(NEWID()))%50000+1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),

SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

SomeHex12 = RIGHT(NEWID(),12)

INTO #Order

FROM Master.dbo.SysColumns t1

CROSS JOIN Master.dbo.SysColumns t2

--===== A table is not properly formed unless a Primary Key has been assigned

ALTER TABLE #Order

ADD PRIMARY KEY CLUSTERED (RowNum)

--===== Add an index on ClientID

CREATE INDEX IX_#Order_ClientID

ON #Order (ClientID)

--===== Ensure that 1000 clients have no entries

DELETE #Order

WHERE ClientID % 50 = 0

--===== Create a CLIENT test table containing every possible client

SELECT TOP 50000

IDENTITY(INT,1,1) AS ID

INTO #Client

FROM Master.dbo.SysColumns t1

CROSS JOIN Master.dbo.SysColumns t2

--===== Add the primary key

ALTER TABLE #Client

ADD PRIMARY KEY CLUSTERED (ID)

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004567 More actions · Answer 14

Sorry... forgot to post the results I got... single P4 1.8Ghz, 1GB RAM, 80GB IDE hard drive, SQL Server 2005 Dev Edition SP2...

(1 row(s) affected)

Table '#Client_____________________________________________________________________________________________________________00000000006E'.

Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#Order______________________________________________________________________________________________________________00000000006D'.

Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 563 ms, elapsed time = 602 ms.

================================================================================

(1 row(s) affected)

Table '#Order______________________________________________________________________________________________________________00000000006D'.

Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#Client_____________________________________________________________________________________________________________00000000006E'.

Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 641 ms, elapsed time = 669 ms.

================================================================================

(1 row(s) affected)

Table '#Client_____________________________________________________________________________________________________________00000000006E'.

Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#Order______________________________________________________________________________________________________________00000000006D'.

Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 562 ms, elapsed time = 601 ms.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Which is better

Cookies on SQLServerCentral