Why OR conditions in join statements should be avoided and an example fix.
OR conditions and join statements – some things just don’t go well together. Similar to how OR conditions can significantly impair queries having subqueries, they can wreak havoc on joins as well.
Take, for example, the following [admittedly unrealistic] query. This assumes a numbers table [numbers] has been set up. [See my cartesian join post for a query to set up a numbers table.] Assume the numbers table only has 10,000 records.
SELECT n1.num, n2.num, n3.num FROM numbers n1 INNER JOIN numbers n2 ON ( (n1.num = n2.num - 400) OR (n1.num = n2.num + 400)) INNER JOIN numbers n3 ON n2.num = n3.num - 300 ORDER BY n1.num, n2.num, n3.num
This query looks simple enough. However, even though the source table only has 10,000 rows and the result only has 18,900 records, it takes 5 seconds to complete and Management Studio [and frankly my entire computer] freezes while it is running. This query spiked CPU usage to 100%. That’s not good – joins and OR conditions simply do not mix [at least on SQL Server].
I encountered this exact type of scenario at BPS. We have a DTS package for bringing over student data from the source location to another database [with a different schema] where the school assignments are done. One of the steps took 20 minutes, and I simply lived with it the first few times it ran. But because it’s best to ask if there’s a better way, I decided to investigate the step. With a 20 minute runtime, I fully expected to see some highly complex tangled web of sql with 30 joins and 150 lines. I couldn’t have been more wrong – it was a simple query resembling the one above [with an OR condition in the join]. In under 5 minutes I’d broken apart the query into two separate steps and what previously took 20 minutes to run only took 2 seconds.
For the query above, here’s one solution [using a UNION] for breaking apart the OR. There are likely other ways and I invite comments for other approaches to removing the OR condition.
SELECT n1.num, n2.num, n3.num FROM numbers n1 INNER JOIN numbers n2 ON n1.num = n2.num - 400 INNER JOIN numbers n3 ON n2.num = n3.num - 300 UNION SELECT n1.num, n2.num, n3.num FROM numbers n1 INNER JOIN numbers n2 ON n1.num = n2.num + 400 INNER JOIN numbers n3 ON n2.num = n3.num - 300 ORDER BY n1.num, n2.num, n3.num
This query runs instantly with a numbers table having 10,000 rows. It finishes in 8 seconds for a table having 800,000 rows. If I used the larger table with the first query, I’d probably have to reboot.