Correlated Sub Query (RBAR WTF?)

Question

Post reply

Correlated Sub Query (RBAR WTF?)

souLTower

SSCarpal Tunnel

Points: 4026
More actions
February 17, 2009 at 6:07 am

#72431

Here's an oddball. My software has the following query:
select A.fldA, A.fldB from tableA A where A.fldKey IN
(
select AA.fldKey from tableA AA inner join tableB B ON
AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild
)
OR A.fldKey in
(
select AAA.fldKey from tableA AAA inner join tableX X ON
AAA.fldKey = X.fldKey inner join tableC C ON X.fldChild = C.fldChild
)
The app was hanging and when I checked statistics I found that tableB was being read 4 million times. The table contains about 100K worth of data. As a test I changed the query to read like this:
select A.fldA, A.fldB from tableA A where A.fldKey IN
(
select AA.fldKey from tableA AA inner join tableB B ON
AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild
)
OR A.fldKey = 7
I got the same results. I know how to change the query so this doesn't happen but am curious as to why the DB is reading the table so many times. It seems that the optimizer sees that the correlated sub query can be treated as an inner join. Then when you add the additional conditional (I like how that sounds) it checks the condition against every row and subsequently performs a join for every row.
Any thoughts.
Good Monday Morning
(gets more coffee)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Jack Corbett SSC Guru Points: 184394 More actions · Answer 1

Try this:

select

A.fldA,

A.fldB

from

tableA A

where

EXISTS ( select

1

FROM

tableB B inner join

tableC C

ON B.fldChild = C.fldChild

WHERE

A.fldKey = B.fldKey ) OR

Exists ( select

1

from

tableX X inner join

tableC C

ON X.fldChild = C.fldChild

WHERE

A.fldKey = X.fldKey )

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

souLTower SSCarpal Tunnel Points: 4026 More actions · Answer 2

Thanks, Jack. I have lots of options as to how to correct the issue. I was interested in a discussion as to why the query would create the horrible performance in the first place. At first glance I didn't think i was a bad query (although I have no idea why a person would have used this methodology to get the data??).

To reiterate, the following had fine performance:

select A.fldA, A.fldB from tableA A where A.fldKey in

(

-- correlated sub query here

)

However, simply adding the additional condition caused the performance issue:

select A.fldA, A.fldB from tableA A where A.fldKey in

(

-- correlated sub query here

) OR A.fldKey = 7

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

You don't need tableA bulking up the two subselects:

select A.fldA, A.fldB

from tableA A

where A.fldKey IN

(

select B.fldKey

from tableB B

inner join tableC C ON B.fldChild = C.fldChild

)

OR A.fldKey in

(

select X.fldKey

from tableX X

inner join tableC C ON X.fldChild = C.fldChild

)

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

souLTower SSCarpal Tunnel Points: 4026 More actions · Answer 4

Unfortunately it did need tableA in the two sub queries. As I was posting pseudo code I didn't include the many other joins in the sub queries. I have since modified the code as (pseudo):

select myFields from

tableA A inner join tableNecessary N on A.nKey = N.nKey

left join tableB B on A.fldKey = B.fldKey

left join tableX X on A.fldKey = X.fldKey

Again. Does anyone know why the database would have acted in this way?

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

What's the result of running the highlighted part?

select A.fldA, A.fldB from tableA A where A.fldKey IN

(

select AA.fldKey from tableA AA inner join tableB B ON

AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild

)

OR A.fldKey in

(

select AAA.fldKey from tableA AAA inner join tableX X ON

AAA.fldKey = X.fldKey inner join tableC C ON X.fldChild = C.fldChild

)

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Ronald H SSCarpal Tunnel Points: 4053 More actions · Answer 6

This totally confuses me. Why not post the full query, replacing real names with fake ones where ever you want? I can't match your last piece of code at all with your first. Also, I still think that you don't need TableA in your subquery, since any selection you make in your subquery, you also can make in the main query.

Also, please add some table description and possibly sample data, use the link in my signature below for that. It's of course hard work, but that way we can test our own solutions and we may be triggered to answer your question without misunderstandings...

Ronald HensbergenHelp us, help yourself... Post data so we can read and use it: http://www.sqlservercentral.com/articles/Best+Practices/61537/-------------------------------------------------------------------------2+2=5 for significant large values of 2

Ken McKelvey SSCoach Points: 18842 More actions · Answer 7

My experience of joins in subqueries is that they always end up as nested loops. This can be very inefficient with large tables. (Check your query plans to confirm this.)

I always avoid joins in subqueries. Joins in derived tables seem to be OK.

souLTower SSCarpal Tunnel Points: 4026 More actions · Answer 8

Ken is correct. I apologize that my original queries were incomplete. Here is a query which demonstrates my point:

set statistics io OFF

set nocount on

CREATE TABLE #foo (

fldID int identity(1, 1),

fldNum int not null

)

-- Get a few records started

insert into #foo(fldNum)

select 1 as X union all

select 1 as X

while (select count(fldID) from #foo) < 100000

insert into #foo(fldNum) select fldID from #foo

select count(fldID) from #foo

set statistics io ON

-- Get statistics on sub query

select top 500 fldID from #foo

-- Results Table #foo Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0.

-- Get statistics on the main query and the sub query

select A.fldID from #foo A where A.fldID IN

(

select top 500 B.fldID from #foo B

)

-- Results Table #foo Scan count 2, logical reads 256, physical reads 0, read-ahead reads 0.

-- Get statistics on the second condition

select A.fldID from #foo A where A.fldID = 1

-- Results Table #foo Scan count 1, logical reads 254, physical reads 0, read-ahead reads 0.

-- Get statistics on the whole thing

select A.fldID from #foo A where A.fldID IN

(

select top 500 B.fldID from #foo B

) OR A.fldID = 1

-- Results Table #foo Scan count 3, logical reads 256, physical reads 0, read-ahead reads 0.

-- Table 'Worktable'. Scan count 498, logical reads 997, physical reads 0, read-ahead reads 0.

-- Table 'Worktable'. Scan count 500, logical reads 1001, physical reads 0, read-ahead reads 0.

-- Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

You can see that the addition of a second conditional causes the DB to create additional work tables with more reads than the original table required. In looking at the execution plan there are indeed many nested loops. I don't normally use this type of query but it does exist in the application I work on.

The point to the post is to help others see that what looks fairly straightforward can indeed cause bottlenecks.

Max-146500 SSCarpal Tunnel Points: 4691 More actions · Answer 9

I understand this is purely an academic post and that you have resolved the performance issue with the left outer joins.

When you say "reads", do you mean index scans or physical disk reads?

I'm quite interested to know how many records are in table A and the two derived datasets in the where clause, I have a sneaky suspicion that the 4000k scans you are referring to could possibly the number of records in table A and two derived datasets in the where clause multiplied by each other or the number of samples in the statistics. If you update usage on those tables, does the count change?

Along with a couple of the other posts here, I would probably stick my neck out here are suggest strongly trying to avoid using derived datasets in your where clause. Always try and include any such distillation in the from clause, ideally, perhaps, determining and distilling the recordsets as you go along. Including joins enhances the use of the query optimiser and will allow far better control of the performance.

Max