Subquery and Join (Speed)

Question

Subquery and Join (Speed)

ry.rith

SSC Enthusiast

Points: 182
More actions
October 31, 2008 at 9:10 pm

#200717

most time, if i want to retrieve a value from other i always use subquery, but if i want many value(more than two column) from other column i use join. but i don't know deeply about how different between both method, especially the speed (while the table become larger and larger).
what is the correct way should i use?
i know both type have their own advantage, what?
thank for supporting idea and documents,

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004533 More actions · Answer 1

There's virtually no difference between a JOIN and an IN (subquery)... they will both create the same execution plan and execute within milliseconds of each other... unless there's a correlation in the sub-query... then, it depends on whether you've used an equality in the correlation or not. If you've used an inequality based on something temporal (IDENTITY or DATETIME column, etc), then you could end up with a "Triangular Join"... a form of RBAR on steroids.

Please see the following article...

http://www.sqlservercentral.com/articles/T-SQL/61539/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jsteinbeck-618119 Default port Points: 1408 More actions · Answer 2

I read somewhere (i think it was AskTom) but depending on how you wanted your data or the business logic (the way you write your query if using a subquery or join) your results may vary...

I'm not the expert, but I just remember reading it and seeing the example, so I thought i would share that, not hoping to confuese...lol

John

Jeff Moden SSC Guru Points: 1004533 More actions · Answer 3

jsteinbeck (10/31/2008)
I read somewhere (i think it was AskTom) but depending on how you wanted your data or the business logic (the way you write your query if using a subquery or join) your results may vary...
I'm not the expert, but I just remember reading it and seeing the example, so I thought i would share that, not hoping to confuese...lol
John

First, "Ask Tom" is about Oracle, not SQL Server... two different engines... two different worlds.

INNER JOIN and WHERE IN are exactly the same in SQL Server... there's a ton of posts on these forums that prove it. Some have shown that certain uses (not many) of WHERE EXISTS can be a bit faster, but normally the rest of the code is such crap, folks should concentrate more on writing the other parts of the query correctly. Correllated SUBqueries that are based on equalities will also normally resolve to an INNER JOIN... sometimes they resolve to something as bad or far worse than a cursor.

Of course, having the wrong criteria in an INNER JOIN can also be far worse than a cursor...

... and if you use a cursor to process RBAR instead of using it to control the processing of sets of information, then you're doing something wrong. Chances are, if you're using a CLR (with very few exceptions like RegExReplace), you're also doing something wrong.

Ok... duck! Here come the zealots that will insist you can use a cursor or CLR if you're pressed for time or don't know T-SQL well enough to do it the right way. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ry.rith SSC Enthusiast Points: 182 More actions · Answer 4

if i use correlated subquery, and join to two table.

what way should i choose for optimal performance when there large number of rows.

For Example

--************Retrieve a value from any one-table *****************************

--JOIN

select col.* , obj.Name

from ( syscolumns as col inner join sysobjects as obj

on col.id=obj.id)

order by obj.Name

--Correlated SubQuery

select col.* , (Select obj.Name from sysobjects as obj where obj.id=col.id) as XXName

from syscolumns as col

Order by xxName

--************Retrieve a value from any one-table *****************************

--JOIN

select col.*

, obj.Name

,type.Name as TypeName

from ( syscolumns as col inner join sysobjects as obj on col.id=obj.id)

inner join systypes as type on col.xtype=type.xtype and type.xtype <>321

order by obj.Name

--SubQuery

select

col.*

,(Select obj.Name from sysobjects as obj where obj.id=col.id) as XXName

,(select type.Name from systypes as type where type.xtype=col.xtype and type.xtype<>231) as TypeName

from syscolumns as col

Order by xxName

thank for your reply

Jeff Moden SSC Guru Points: 1004533 More actions · Answer 5

ry.rith (10/31/2008)
if i use correlated subquery, and join to two table.
what way should i choose for optimal performance when there large number of rows.
For Example ...

For the type of queries you wrote, there's no question that the INNER JOIN method would be the one that I would use. But, just to demonstrate, let's build a larger test table than the 4000-5000 rows normally available in the Master database of SQL Server 2000 and see... 100,000 rows should do it...

--=============================================================================

-- Setup to test on more data than available in SysColumns

--=============================================================================

--===== Identify the database to use (NOTHING will be written here)

USE Master

--===== Create a larger test table than SysColumns -- 22 Seconds

SELECT TOP 100000 sc1.*

INTO #BigTable

FROM dbo.SysColumns sc1

CROSS JOIN dbo.SysColumns sc2

--===== Give it the best index we can

CREATE CLUSTERED INDEX CX_BigTable_ID ON #BigTable (ID) -- 4 seconds

... and, now... your first set of queries... run them (in the same window as above) with and without the ORDER BY and study both the Estimated and Actual Execution Plans... IGNORE THE % OF BATCH because it lies, really bad sometimes...

--=============================================================================

-- These are the 1st set of tests like what you had in your example code

--=============================================================================

--===== Begin measuring durations

SET STATISTICS TIME ON

--************Retrieve a value from any one-table *****************************

--===== JOIN

SELECT col.*,

obj.Name

FROM #BigTable col

INNER JOIN dbo.SysObjects obj ON col.ID = obj.ID

ORDER BY obj.Name

PRINT REPLICATE('=',80)

--===== Correlated SubQuery

SELECT col.*,

(SELECT obj.Name FROM dbo.SysObjects obj WHERE obj.ID = col.ID) AS XXName

FROM #BigTable col

ORDER BY xxName

--===== Turn off duration measurements

SET STATISTICS TIME OFF

Here's the performance output I get on my box (SQL Server 2000 Developer's Edition). Clearly, the INNER JOIN method wins especially where precious CPU time is involved...

[font="Courier New"]SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

(100000 row(s) affected)

SQL Server Execution Times:

CPU time = 891 ms, elapsed time = 5684 ms.

================================================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 13 ms.

(100000 row(s) affected)

SQL Server Execution Times:

CPU time = 2000 ms, elapsed time = 6250 ms.[/font]

Now... let's do something similar with your second set of queries... are you in for a surprise!

--=============================================================================

-- These are the 2nd set of tests like what you had in your example code

--=============================================================================

--===== Begin measuring durations

SET STATISTICS TIME ON

--************Retrieve a value from any one-table *****************************

--===== JOIN

SELECT col.*,

obj.Name,

type.Name AS TypeName

FROM #BigTable col

INNER JOIN dbo.SysObjects obj ON col.ID = obj.ID

INNER JOIN dbo.SysTypes type ON col.XType = type.XType AND type.XType <> 231

ORDER BY obj.Name

PRINT REPLICATE('=',80)

--===== JOIN with proper use of criteria

SELECT col.*,

obj.Name,

type.Name AS TypeName

FROM #BigTable col

INNER JOIN dbo.SysObjects obj ON col.ID = obj.ID

INNER JOIN dbo.SysTypes type ON col.XType = type.XType

WHERE type.XType <> 231

ORDER BY obj.Name

PRINT REPLICATE('=',80)

--===== Correlated SubQuery (returns a different number of rows than either of the above)

SELECT col.*,

(SELECT obj.Name FROM dbo.SysObjects obj WHERE obj.ID = col.ID) AS XXName,

(SELECT type.Name FROM dbo.SysTypes type WHERE type.XType = col.XType AND type.XType <> 231) AS TypeName

FROM #BigTable col

ORDER BY xxName

--===== Turn off duration measurements

SET STATISTICS TIME OFF

First, notice that I added an extra code snippet to put the filter criteria where it belongs. As you can see below, it DOES make a difference in performance...

[font="Courier New"]

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

(95044 row(s) affected)

SQL Server Execution Times:

CPU time = 2828 ms, elapsed time = 7660 ms.

================================================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

(95044 row(s) affected)

SQL Server Execution Times:

CPU time = 2672 ms, elapsed time = 7343 ms.

================================================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

(100000 row(s) affected)

SQL Server Execution Times:

CPU time = 2094 ms, elapsed time = 6986 ms.

[/font]

Hmmm.... also notice that you're query returns ALL the rows instead of being filtered... that's a common mistake when folks use sub-queries... the criteria in a correlated subquery in the Select list has NO effect on the number of rows returned.

Now, for the biggest "surprise" of them all... look at the output of the 3 queries above... see anything wrong? Anything horribly wrong? Heh... [font="Arial Black"]the 3rd output is NOT sorted correctly even though there is a very obvious ORDER BY in the code![/font] That's because the optimizer processes by COLUMN... not by ROW and it resolves the RBAR join between #BigTable and SysObjects first and sorts it and then finally resovles the RBAR join between that result and SysTypes which, of course, results in the wrong sort order. That's also why it appears that the correlated sub-query is faster... the sort isn't working quite as hard.

The bottom line here is that having correlated sub-queries in the Select list are a form of RBAR and databases just weren't meant to use RBAR. Because of that, correlated sub-queries can sometimes create unexpected results and, in other cases, can be a lot slower than good set-based joins.

My recommendation would be this... with maybe 1 or 2 very special case exceptions, don't use correlated sub-queries in the Select list... and since a correlated sub-query with equalities as criteria in the WHERE clause resolve to an INNER JOIN, there almost never a reason to use them in the WHERE clause either. The exception to that might be that WHERE NOT EXISTS and WHERE NOT IN resolve a bit faster than a LEFT OUTER JOIN with an ISNULL criteria to do the same thing.

And, don't try to justify the use of correlated sub-queries based on row counts. Estimated usage can change drastically and, even if they don't, some poor slob will copy your code and use it on something with much larger row counts.

Correlated sub-queries ARE a form a RBAR and should be avoided at all costs with only 1 or 2 exceptions that I've ever done. The exception that I've used is finding "gaps" in serial numbers (heh... don't ask ;)).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

Jeff, are you putting your articles in thread posts again? 🙂

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Jeff Moden SSC Guru Points: 1004533 More actions · Answer 7

Heh... it would appear so... maybe this one would make a good article. Thanks, Barry.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

karthik M One Orange Chip Points: 29139 More actions · Answer 8

Hi Jeff,

Why can't you publish it as an article ? So that it will be useful to lot of folks like me.

karthik

kevin_nikolai SSCarpal Tunnel Points: 4343 More actions · Answer 9

Hi Jeff, thanks for the excellent explanation, much appreciated. I prefer replies like these compared to the - ya perhaps, maybe, try this, depends.

Regards

Kevin

sonyt65@yahoo.com

Sergiy SSC Guru Points: 110209 More actions · Answer 10

clive (11/3/2008)
Hi Jeff, thanks for the excellent explanation, much appreciated. I prefer replies like these compared to the - ya perhaps, maybe, try this, depends.
Regards
Kevin
sonyt65@yahoo.com

When was the last time when yoursel gave such explanation to somebody else?

😉

_____________
Code for TallyGenerator

Jeff Moden SSC Guru Points: 1004533 More actions · Answer 11

I appreciate the comments guys. Thank you very much. Perhaps I will write it up as an article with a couple of more "gotcha" examples... Thanks again.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ranjanswainmca Newbie Points: 5 More actions · Answer 12

In case of sub query sql server affects in speed so as less as possible use this best use is join

Ranjan

Developer in http://www.Hamarashehar.com

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 13

Please note: 3 year old thread.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass