February 17, 2009 at 6:07 am
Here's an oddball. My software has the following query:
select A.fldA, A.fldB from tableA A where A.fldKey IN
(
select AA.fldKey from tableA AA inner join tableB B ON
AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild
)
OR A.fldKey in
(
select AAA.fldKey from tableA AAA inner join tableX X ON
AAA.fldKey = X.fldKey inner join tableC C ON X.fldChild = C.fldChild
)
The app was hanging and when I checked statistics I found that tableB was being read 4 million times. The table contains about 100K worth of data. As a test I changed the query to read like this:
select A.fldA, A.fldB from tableA A where A.fldKey IN
(
select AA.fldKey from tableA AA inner join tableB B ON
AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild
)
OR A.fldKey = 7
I got the same results. I know how to change the query so this doesn't happen but am curious as to why the DB is reading the table so many times. It seems that the optimizer sees that the correlated sub query can be treated as an inner join. Then when you add the additional conditional (I like how that sounds) it checks the condition against every row and subsequently performs a join for every row.
Any thoughts.
Good Monday Morning
(gets more coffee)
February 17, 2009 at 7:25 am
Try this:
select
A.fldA,
A.fldB
from
tableA A
where
EXISTS ( select
1
FROM
tableB B inner join
tableC C
ON B.fldChild = C.fldChild
WHERE
A.fldKey = B.fldKey ) OR
Exists ( select
1
from
tableX X inner join
tableC C
ON X.fldChild = C.fldChild
WHERE
A.fldKey = X.fldKey )
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
February 17, 2009 at 7:30 am
Thanks, Jack. I have lots of options as to how to correct the issue. I was interested in a discussion as to why the query would create the horrible performance in the first place. At first glance I didn't think i was a bad query (although I have no idea why a person would have used this methodology to get the data??).
To reiterate, the following had fine performance:
select A.fldA, A.fldB from tableA A where A.fldKey in
(
-- correlated sub query here
)
However, simply adding the additional condition caused the performance issue:
select A.fldA, A.fldB from tableA A where A.fldKey in
(
-- correlated sub query here
) OR A.fldKey = 7
February 17, 2009 at 7:46 am
You don't need tableA bulking up the two subselects:
select A.fldA, A.fldB
from tableA A
where A.fldKey IN
(
select B.fldKey
from tableB B
inner join tableC C ON B.fldChild = C.fldChild
)
OR A.fldKey in
(
select X.fldKey
from tableX X
inner join tableC C ON X.fldChild = C.fldChild
)
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
February 17, 2009 at 8:10 am
Unfortunately it did need tableA in the two sub queries. As I was posting pseudo code I didn't include the many other joins in the sub queries. I have since modified the code as (pseudo):
select myFields from
tableA A inner join tableNecessary N on A.nKey = N.nKey
left join tableB B on A.fldKey = B.fldKey
left join tableX X on A.fldKey = X.fldKey
Again. Does anyone know why the database would have acted in this way?
February 17, 2009 at 8:20 am
What's the result of running the highlighted part?
select A.fldA, A.fldB from tableA A where A.fldKey IN
(
select AA.fldKey from tableA AA inner join tableB B ON
AA.fldKey = B.fldKey inner join tableC C ON B.fldChild = C.fldChild
)
OR A.fldKey in
(
select AAA.fldKey from tableA AAA inner join tableX X ON
AAA.fldKey = X.fldKey inner join tableC C ON X.fldChild = C.fldChild
)
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
February 17, 2009 at 8:24 am
This totally confuses me. Why not post the full query, replacing real names with fake ones where ever you want? I can't match your last piece of code at all with your first. Also, I still think that you don't need TableA in your subquery, since any selection you make in your subquery, you also can make in the main query.
Also, please add some table description and possibly sample data, use the link in my signature below for that. It's of course hard work, but that way we can test our own solutions and we may be triggered to answer your question without misunderstandings...
Ronald HensbergenHelp us, help yourself... Post data so we can read and use it: http://www.sqlservercentral.com/articles/Best+Practices/61537/-------------------------------------------------------------------------2+2=5 for significant large values of 2
February 17, 2009 at 8:30 am
My experience of joins in subqueries is that they always end up as nested loops. This can be very inefficient with large tables. (Check your query plans to confirm this.)
I always avoid joins in subqueries. Joins in derived tables seem to be OK.
February 17, 2009 at 8:54 am
Ken is correct. I apologize that my original queries were incomplete. Here is a query which demonstrates my point:
set statistics io OFF
set nocount on
CREATE TABLE #foo (
fldID int identity(1, 1),
fldNum int not null
)
-- Get a few records started
insert into #foo(fldNum)
select 1 as X union all
select 1 as X union all
select 1 as X union all
select 1 as X union all
select 1 as X union all
select 1 as X union all
select 1 as X
while (select count(fldID) from #foo) < 100000
insert into #foo(fldNum) select fldID from #foo
select count(fldID) from #foo
set statistics io ON
-- Get statistics on sub query
select top 500 fldID from #foo
-- Results Table #foo Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0.
-- Get statistics on the main query and the sub query
select A.fldID from #foo A where A.fldID IN
(
select top 500 B.fldID from #foo B
)
-- Results Table #foo Scan count 2, logical reads 256, physical reads 0, read-ahead reads 0.
-- Get statistics on the second condition
select A.fldID from #foo A where A.fldID = 1
-- Results Table #foo Scan count 1, logical reads 254, physical reads 0, read-ahead reads 0.
-- Get statistics on the whole thing
select A.fldID from #foo A where A.fldID IN
(
select top 500 B.fldID from #foo B
) OR A.fldID = 1
-- Results Table #foo Scan count 3, logical reads 256, physical reads 0, read-ahead reads 0.
-- Table 'Worktable'. Scan count 498, logical reads 997, physical reads 0, read-ahead reads 0.
-- Table 'Worktable'. Scan count 500, logical reads 1001, physical reads 0, read-ahead reads 0.
-- Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.
You can see that the addition of a second conditional causes the DB to create additional work tables with more reads than the original table required. In looking at the execution plan there are indeed many nested loops. I don't normally use this type of query but it does exist in the application I work on.
The point to the post is to help others see that what looks fairly straightforward can indeed cause bottlenecks.
February 17, 2009 at 8:56 am
I understand this is purely an academic post and that you have resolved the performance issue with the left outer joins.
When you say "reads", do you mean index scans or physical disk reads?
I'm quite interested to know how many records are in table A and the two derived datasets in the where clause, I have a sneaky suspicion that the 4000k scans you are referring to could possibly the number of records in table A and two derived datasets in the where clause multiplied by each other or the number of samples in the statistics. If you update usage on those tables, does the count change?
Along with a couple of the other posts here, I would probably stick my neck out here are suggest strongly trying to avoid using derived datasets in your where clause. Always try and include any such distillation in the from clause, ideally, perhaps, determining and distilling the recordsets as you go along. Including joins enhances the use of the query optimiser and will allow far better control of the performance.
Max
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply