I was reading through Stackoverflow today and saw an interesting question. SQL inner join vs subquery. The user was comparing 3 queries and wondered why the first took significantly longer than the other 2.
Here are the queries:
Query 1: SELECT * From TabA INNER JOIN TabB on TabA.Id=TabB.Id Query 2: SELECT * From TabA WHERE Id in (SELECT Id FROM TabB) Query 3: SELECT TabA.* From TabA INNER JOIN TabB on TabA.Id=TabB.Id
My answer was that the Query 1 is pulling all of the data from both tables and therefore did more reads. I also suggested that the user use SET STATISTICS IO and SET STATISTICS TIME to test the theory. Well, I got curious and decided to run a similar test for myself.
Not having the OPs tables I decided to use 2 from AdventureWorks2008. Sales.Customer and Person.Person.
So first turning on the statistics.
SET STATISTICS IO ON SET STATISTICS TIME ON
Next, here are the queries using the same format as the OPs.
SELECT * FROM Sales.Customer INNER JOIN Person.Person ON Sales.Customer.PersonID = Person.Person.BusinessEntityID SELECT * FROM Sales.Customer WHERE Sales.Customer.PersonID IN (SELECT Person.Person.BusinessEntityID FROM Person.Person) SELECT Sales.Customer.* FROM Sales.Customer INNER JOIN Person.Person ON Sales.Customer.PersonID = Person.Person.BusinessEntityID
I’m trimming down the output for the sake of brevity.
-- Query 1 (19119 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Person'. Scan count 1, logical reads 3816, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Customer'. Scan count 1, logical reads 123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1594 ms, elapsed time = 11237 ms. -- Query 2 (19119 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Customer'. Scan count 1, logical reads 123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Person'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 859 ms, elapsed time = 2193 ms. -- Query 3 (19119 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Person'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Customer'. Scan count 1, logical reads 123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 875 ms, elapsed time = 2796 ms.
If you review IO sections you will notice that all physical reads are 0. This means both tables are cached at the time and all of the reads are coming out of memory. So right off the bat we can exclude caching issues.
Also Queries 2 and 3 are close enough in time to be just normal run variations, and the IO section while in a different order has identical numbers. The query plans are also very similar. Oddly enough even though all of the processes in those query plans are the same, the percentages are different. Unfortunately I don’t know enough about query plans to really say why.
Query 1 has a CPU time almost double the other two queries and an elapsed time of 4-5 times greater. Well we expected that from the original question. The Customer Table has 123 logical reads for all three queries. However if you compare the logical reads for the Person table you will see 59 each for Queries 2 & 3 and 3816 for Query 1! That is a lot of extra reads!
So now let’s compare the query plans
If you look closely, you will notice that the plans are very similar. The only difference (other than %s) is the Clustered Index Scan for Person.Person on Query 1 and the Index Scan on Queries 2 and 3. The Clustered Index Scan makes perfect sense to me. The clustered index contains all of the data for the table in the leaf nodes and all of the data is being returned. So it makes sense to scan across the entire table, returning everything.
It took me awhile to figure out why Queries 2 and 3 are doing an Index Scan on AK_Person_rowguid. (And honestly if anyone would like to confirm this I would greatly appreciate it.) That particular index is on a column called rowguid which is a uniqueidentifier and isn’t referenced anywhere in the query. So why use it? My understanding is that this particular index is being used for the following two reasons.
First this is the smallest index on the table, the other two being the clustered index and an index on the first, last and middle names. Second (and the real key here) is that the clustered index key has to be in the leaf nodes of every non-clustered index. Assuming a clustered index exists of course. In this particular case the clustered index is on BusinessEntityID which is the value we need for our query. So the optimizer decided that it would be faster to read in the whole non-clustered index, pulling the BusinessEntityID from the leaf nodes, rather than using the index nodes of the clustered index.
Now I’m not sure if the situation from the post I saw is the same. It relies on the indexes available, and I don’t have that information. But I do think this is a pretty good indication of what was happening. If nothing else, for me at least, it was a good exercise in reading the query plans and trying to figure out exactly what was going on.
Filed under: Microsoft SQL Server, Query Plans, SQLServerPedia Syndication Tagged: microsoft sql server, query plans