A discussion of how joins significantly outperform subqueries and how this is more evident when OR conditions are involved.
SQL queries can be written in a number of ways and be functionally equivalent but significantly different speed-wise. A great example of this is the use of cross joins instead of a while loop for populating a numbers table. The cross join, being set-based, was far more efficient. The insert statements used later in this post depend on the existence of a table ‘numbers’, which is used in place of a Student table. Here’s the code for this – pulled from the cross joins post [this only takes a few seconds instead of over a minute to run like the while loop - cross joins do have a place in SQL development]:
CREATE TABLE numbers (num INT) ----- Use a table variable DECLARE @numbers TABLE ( num INT ) ----- Populate a table with numbers 0 to 9 DECLARE @counter INT SET @counter = 0 While @counter < 10 BEGIN INSERT INTO @numbers VALUES(@counter) SET @counter = @counter + 1 END INSERT INTO numbers (num) SELECT ( n100000place.num*100000 + n10000place.num*10000 + n1000place.num*1000 + n100place.num*100 + n10place.num*10 + n1place.num ) AS 'number' FROM @numbers n1place CROSS JOIN @numbers n10place CROSS JOIN @numbers n100place CROSS JOIN @numbers n1000place CROSS JOIN @numbers n10000place CROSS JOIN @numbers n100000place WHERE n100000place.num < 8 ----- < 800000 ORDER BY NUMBER
Today I’d like to talk about subqueries and OR conditions in a sql statement. Before discussing the particular queries, let’s set up a test table with some dummy data. For this, I’m inserting a significant number of rows [4 million] into a table so any performance issues become more evident than they would with a trivial amount of data.
--Note: These inserts in total take 23 seconds to run CREATE TABLE StudentTestData (StudentNumber VARCHAR(8), testid VARCHAR(8), TestScore VARCHAR(3)) INSERT INTO StudentTestData SELECT numbers.num, 'AAABBBCC', numbers.num % 55 FROM numbers INSERT INTO StudentTestData SELECT numbers.num, 'AAABBBDD', numbers.num % 44 FROM numbers INSERT INTO StudentTestData SELECT numbers.num, 'FFFBBBDD', numbers.num % 33 FROM numbers INSERT INTO StudentTestData SELECT numbers.num, 'FFFRRRDD', numbers.num % 66 FROM numbers INSERT INTO StudentTestData SELECT numbers.num, 'CCCMMMDD', numbers.num % 55 FROM numbers
This table is similar in structure and size to a table we have at Boston Public Schools for storing student test results. Using the StudentTestData table, let’s say we needed to write a query to get a list of students who met the following conditions:
Got over a 10 on test FFFBBBDD AND over a 20 on test FFFRRRDD AND Got over a 16 on test CCCMMMDD AND over a 30 on test AAABBBCC AND Got over a 25 on test AAABBBDD
In this case, all 5 test score requirements must be met.
Here are two queries that return the results, the first based on joins and the second based on subqueries. I’ve used left joins because of a slight modification I’m going to make to the requirements later in the post – for the following query, inner joins would work fine.
-- Join-based query. 3 seconds the first time, < 1 seconds after that. -- SQL Server was able to determine a good execution plan to speed up the query. SELECT DISTINCT Numbers.num FROM Numbers LEFT JOIN StudentTestData TestAAABBBCC ON TestAAABBBCC.StudentNumber = Numbers.num AND TestAAABBBCC.testid = 'AAABBBCC' LEFT JOIN StudentTestData TestFFFBBBDD ON TestFFFBBBDD.StudentNumber = Numbers.num AND TestFFFBBBDD.testid = 'FFFBBBDD' LEFT JOIN StudentTestData TestFFFRRRDD ON TestFFFRRRDD.StudentNumber = Numbers.num AND TestFFFRRRDD.testid = 'FFFRRRDD' LEFT JOIN StudentTestData TestCCCMMMDD ON TestCCCMMMDD.StudentNumber = Numbers.num AND TestCCCMMMDD.testid = 'CCCMMMDD' LEFT JOIN StudentTestData TestAAABBBDD ON TestAAABBBDD.StudentNumber = Numbers.num AND TestAAABBBDD.testid = 'AAABBBDD' WHERE -- Condition 1 ( (TestFFFBBBDD.TestScore > 10 AND TestFFFRRRDD.TestScore > 20) AND (TestCCCMMMDD.TestScore > 16 AND TestAAABBBCC.TestScore > 30) ) -- Condition 2 AND ( TestAAABBBDD.TestScore > 25 ) -- Here's a query that returns the same result but uses subqueries instead of joins: -- 5 seconds the first time, 5 seconds after that. -- No apparent improvement can be made from the execution plan. SELECT DISTINCT Numbers.num FROM Numbers WHERE ( ( num IN (SELECT studentNumber FROM StudentTestData WHERE TestScore > 10 AND testid = 'FFFBBBDD') AND num IN(SELECT studentNumber FROM StudentTestData WHERE TestScore > 20 AND testid = 'FFFRRRDD') ) AND ( num IN (SELECT studentNumber FROM StudentTestData WHERE TestScore > 16 AND testid = 'CCCMMMDD') AND num IN(SELECT studentNumber FROM StudentTestData WHERE TestScore > 30 AND testid = 'AAABBBCC') ) ) AND ( num IN (SELECT studentNumber FROM StudentTestData WHERE TestScore > 25 AND testid = 'AAABBBDD') )
As I hinted would be the case in my previous post, the results clearly are in favor of the join-based query. While the speed differential wasn’t too significant for the first run of the query [5 seconds compared to 3 seconds], the differential became much more important for subsequent query runs. SQL Server appears to be much better at saving the execution plan for join-based queries than subquery-based queries.
Let’s say we throw a few OR conditions into the requirements. OR conditions will really illustrate why the join-based version is superior. Here’s the new requirements:
Provide a list of student numbers for students who met one of the following two conditions:
Condition 1: Got over a 10 on test FFFBBBDD OR over a 20 on test FFFRRRDD AND Got over a 16 on test CCCMMMDD OR over a 30 on test AAABBBCC Condition 2: Got over a 25 on test AAABBBDD
From a SQL query perspective, the queries are almost identical to the ones above, except that we needed to change AND to OR in a few cases and add parentheses when needed so precedence rules match the requirements. Also, the left joins are used in case some students didn’t take a test [using inner joins would remove students from the result if they didn't take all 5 tests].
-- Join-based statement. 3 seconds initially, 1.5 seconds afterwards. SELECT DISTINCT Numbers.num FROM Numbers LEFT JOIN StudentTestData TestAAABBBCC ON TestAAABBBCC.StudentNumber = Numbers.num AND TestAAABBBCC.testid = 'AAABBBCC' LEFT JOIN StudentTestData TestFFFBBBDD ON TestFFFBBBDD.StudentNumber = Numbers.num AND TestFFFBBBDD.testid = 'FFFBBBDD' LEFT JOIN StudentTestData TestFFFRRRDD ON TestFFFRRRDD.StudentNumber = Numbers.num AND TestFFFRRRDD.testid = 'FFFRRRDD' LEFT JOIN StudentTestData TestCCCMMMDD ON TestCCCMMMDD.StudentNumber = Numbers.num AND TestCCCMMMDD.testid = 'CCCMMMDD' LEFT JOIN StudentTestData TestAAABBBDD ON TestAAABBBDD.StudentNumber = Numbers.num AND TestAAABBBDD.testid = 'AAABBBDD' WHERE -- Condition 1 ( (TestFFFBBBDD.TestScore > 10 OR TestFFFRRRDD.TestScore > 20) AND (TestCCCMMMDD.TestScore > 16 OR TestAAABBBCC.TestScore > 30) ) -- Condition 2 OR ( TestAAABBBDD.TestScore > 25 ) -- Stopped after a 90 seconds - did not finish. SELECT DISTINCT Numbers.num FROM Numbers WHERE ( ( num IN (SELECT studentNumber FROM StudentTestData WHERE TestScore > 10 AND testid = 'FFFBBBDD') OR num IN(SELECT studentNumber FROM StudentTestData WHERE TestScore > 20 AND testid = 'FFFRRRDD') ) AND ( num IN (SELECT studentNumber FROM StudentTestData WHERE TestScore > 16 AND testid = 'CCCMMMDD') OR num IN(SELECT studentNumber FROM StudentTestData WHERE TestScore > 30 AND testid = 'AAABBBCC') ) ) OR ( num IN (SELECT studentNumber FROM StudentTestData WHERE TestScore > 25 AND testid = 'AAABBBDD') )
Just by adding the OR conditions to the requirements, the time differential between the join-based query and the subquery version skyrocketed. While the join-based query was slightly slower, the subquery version didn’t finish. One of these nights I’ll let the subquery version run overnight to see how long it takes.