Semi joins

Question

Semi joins

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 1

Well here is something interesting. Unfortunately, I'm still learning about the underlying database engine (I'm only half way through Karen Delaney's excellent Query Tuning and Optimization book for SQL Server 2005...) but this is what I'm seeing when I use Jeff's test data tables (btw, thanks Jeff!):

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT o.SomeInt

FROM #MyInTable AS o

WHERE o.SomeInt in

(

SELECT top 1 p.SomeInt FROM JBMTest AS p

WHERE p.SomeInt = o.SomeInt

)

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

(1000 row(s) affected)

Table 'Worktable'. Scan count 1000, logical reads 2920331, physical reads 0, read-ahead reads 0,

lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'JBMTest'. Scan count 1, logical reads 15186, physical reads 0, read-ahead reads 0, lob logical reads 0,

lob physical reads 0, lob read-ahead reads 0.

Table '#MyInTable ___ 000000000081'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0,

lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 4469 ms, elapsed time = 4478 ms.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT o.SomeInt

FROM #MyInTable AS o

WHERE exists --o.SomeInt in

(

SELECT top 1 p.SomeInt FROM JBMTest AS p

WHERE p.SomeInt = o.SomeInt

)

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

(1000 row(s) affected)

Table '#MyInTable___000000000081'. Scan count 9, logical reads 7, physical reads 0, read-ahead reads 0,

lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'JBMTest'. Scan count 9, logical reads 15296, physical reads 0, read-ahead reads 0, lob logical reads 0,

lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0,

lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 406 ms, elapsed time = 59 ms.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

Look at the logical reads on the first query as compared to the second query!!!!

Random Technical Stuff[/url]

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 2

Ummmm... apologies, it seems I've accidently widened the page

Anyone know how to fix the post?

Random Technical Stuff[/url]

Jeff Moden SSC Guru Points: 1004581 More actions · Answer 3

You'd have to shorten the lines in the code box you made... but you're ok.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 4

Fixed... sorry about that.

I'll post some observations tomorrow.

Random Technical Stuff[/url]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 5

One thing that I noticed is that you have reversed the roles of the tables in Jeff's example. Originally, Jeff had #MyInTable as the subordinate table, but you are using it as the primary table.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 6

RBarryYoung (12/30/2008)
One thing that I noticed is that you have reversed the roles of the tables in Jeff's example. Originally, Jeff had #MyInTable as the subordinate table, but you are using it as the primary table.

I did this at work... I'll go back to my workstation tomorrow and have a look to see I did everything correctly... but I thought that's what I said 100,000 people in the OrgUnit... where #MyInTable is 1000 records (thus it is OrgUnit) and 100,000 records for JBMTest (Person table).

I may have been a tad unclear.

Random Technical Stuff[/url]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 7

Peculiar... on our 2008 development box (which is pretty quiet this week), I ran Jeff's original test and all three ran almost neck and neck. EXISTS squeaked out a win over Join and IN came in third.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1004581 More actions · Answer 8

Bob Hovious (12/30/2008)
Peculiar... on our 2008 development box (which is pretty quiet this week), I ran Jeff's original test and all three ran almost neck and neck. EXISTS squeaked out a win over Join and IN came in third.

Whew! Glad I'm not the only one!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 9

It's interesting looking at the execution plans for the two queries:

SELECT o.SomeInt

FROM #MyInTable AS o

WHERE o.SomeInt in

(

SELECT top 1 p.SomeInt FROM JBMTest AS p

WHERE p.SomeInt = o.SomeInt

)

|--Nested Loops(Left Semi Join, OUTER REFERENCES: ([o].[SomeInt]))

|--Index Scan(OBJECT: ([tempdb].[dbo].[#MyInTable] AS [o]))

|--Filter(WHERE: ([ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]=

[tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

|--Top(TOP EXPRESSION: ((1)))

|--Index Spool(SEEK: ([p].[SomeInt]=

[tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

|--Clustered Index Scan

(OBJECT: ([ISS_V8_CS].[dbo].[JBMTest].[PK__JBMTest__7EAD8B99] AS [p]))

SELECT o.SomeInt

FROM #MyInTable AS o

WHERE EXISTS

(

SELECT top 1 p.SomeInt FROM JBMTest AS p

WHERE p.SomeInt = o.SomeInt

)

|--Parallelism(Gather Streams)

|--Hash Match(Left Semi Join, HASH: ([o].[SomeInt])=([p].[SomeInt]),

RESIDUAL: ([ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]

=[tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

|--Bitmap(HASH: ([o].[SomeInt]), DEFINE: ([Bitmap1004]))

| |--Parallelism(Repartition Streams, Hash Partitioning,

PARTITION COLUMNS: ([o].[SomeInt]))

| |--Index Scan(OBJECT: ([tempdb].[dbo].[#MyInTable] AS [o]))

|--Parallelism(Repartition Streams, Hash Partitioning,

PARTITION COLUMNS: ([p].[SomeInt]), WHERE: (PROBE([Bitmap1004])=TRUE))

|--Clustered Index Scan

(OBJECT: ([ISS_V8_CS].[dbo].[JBMTest].[PK__JBMTest__7EAD8B99] AS [p]))

So what I'm seeing is that a hash match is being run on the WHERE EXISTS version, and a nested loop is being used for the IN version. Because I've not set MAXDOP hint to 0 in the WHERE EXISTS version, this is trying to setup parallelism. I'm not really clear what the Bitmap operator is doing, though.

My understanding of things is that high degrees of parallelism is in general a good thing for data warehouses, but not necessarily something you want to have occur in a heavy use OLTP situation. However, I also thought that a hash match is probably a better idea for large amounts of data, but then again the nested join doesn't block where the hash match does.

So I guess it's all a massive trade off, and depends on your environment. For a heavy use OLTP system, I think I'd use the IN statement for the semi join, or alternatively if it was a large amount of data (as in the case of 100,000 records in the outer query) I'd use the WHERE EXISTS version but with a restricted MAXDOP hint.

What do people thinK?

Random Technical Stuff[/url]

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 10

OK, now I'm slightly confused. It actually looks like the TOP 1 in the IN statement does make a difference!

Check out the execution plan:

SELECT o.SomeInt

FROM #MyInTable AS o

WHERE o.SomeInt in

(

SELECT p.SomeInt FROM JBMTest AS p

WHERE p.SomeInt = o.SomeInt

)

|--Parallelism(Gather Streams)

|--Hash Match(Left Semi Join, HASH: ([o].[SomeInt])=([p].[SomeInt]),

RESIDUAL: ([ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]=

[tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]

AND [ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]=

[tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

|--Bitmap(HASH: ([o].[SomeInt]), DEFINE: ([Bitmap1004]))

| |--Parallelism(Repartition Streams, Hash Partitioning,

PARTITION COLUMNS: ([o].[SomeInt]))

| |--Index Scan(OBJECT: ([tempdb].[dbo].[#MyInTable] AS [o]))

|--Parallelism(Repartition Streams, Hash Partitioning,

PARTITION COLUMNS: ([p].[SomeInt]), WHERE: (PROBE([Bitmap1004])=TRUE))

|--Clustered Index Scan(

OBJECT: ([ISS_V8_CS].[dbo].[JBMTest].[PK__JBMTest__7EAD8B99] AS [p]))

This exhibits pretty much the same behaviour now as the WHERE EXISTS query!

I would love to know why this occurs.

Random Technical Stuff[/url]

Semi joins

Cookies on SQLServerCentral