I’ve been working with OUTER JOIN’s for almost twenty years now and still stumble every now and again. I’ve learned to be very careful in how I use them and in fact I’ve limited how I use them at times to make sure that I don’t create more problems than I’m trying to solve. I recently had the opportunity to help out one of my co-workers with an OUTER JOIN and in the process figured out a big part of my confusion over the years. Below I’m going to try to explain what they are and how to use them in clear easy steps.
Let’s start with the definition of an OUTER JOIN. When a LEFT OUTER JOIN is specified you want all of the rows from the “left” or “first” table and only the data from those rows that match on the “right” or “second” table. On any rows where the “right” table doesn’t have a match the “left” then in the results the columns from that table are NULL. Below I have an example of an OUTER JOIN with a mistake I frequently see along with a query that has a very strange result. Hopefully there is enough of an explanation for everyone to understand the whys of this structure.
I’m going to use LEFT OUTER JOINs in my examples below because they are the most common. All of the queries could certainly be re-written as RIGHT OUTER JOINs. First some setup.
CREATE TABLE Professor ( Id Int , ProfessorName varchar(50), HasTenure Bit ) CREATE TABLE Class ( Id Int NOT NULL IDENTITY(1,1), ProfessorId Int, ClassName varchar(50), ClassYear Int, ClassSemester TinyInt ) INSERT INTO Professor (Id, ProfessorName, HasTenure) VALUES (1, 'Dr Coke','True') INSERT INTO Professor (Id, ProfessorName, HasTenure) VALUES (2, 'Dr Sprite','True') INSERT INTO Professor (Id, ProfessorName, HasTenure) VALUES (3, 'Dr 7-up','True') INSERT INTO Professor (Id, ProfessorName, HasTenure) VALUES (4, 'Mr Pepper','False') INSERT INTO Professor (Id, ProfessorName, HasTenure) VALUES (5, 'Mr Fanta','False') INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (4, 'Soda 101', 2010, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (4, 'Soda 101', 2010, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (4, 'Soda 101', 2011, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (1, 'Soda 101', 2011, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (4, 'Soda 101', 2012, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (NULL, 'Soda 101', 2012, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (1, 'Advanced Soda', 2010, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (1, 'Advanced Soda', 2010, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (1, 'Advanced Soda', 2011, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (4, 'Advanced Soda', 2011, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (1, 'Advanced Soda', 2012, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (NULL, 'Advanced Soda', 2012, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (5, 'Seltzer for fun and profit', 2010, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (5, 'Seltzer for fun and profit', 2010, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (5, 'Seltzer for fun and profit', 2011, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (5, 'Seltzer for fun and profit', 2011, 2) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (5, 'Seltzer for fun and profit', 2012, 1) INSERT INTO Class (ProfessorId, ClassName, ClassYear, ClassSemester) VALUES (NULL, 'Seltzer for fun and profit', 2012, 2)
The Problem: The Dean is working on the latest schedule and wants a list of all professors with tenure and the classes they have taught since 2011.
The query I most commonly see from people who aren’t experienced in OUTER JOINs is this
SELECT Professor.Id AS [Professor.Id], Professor.ProfessorName, CASE Professor.HasTenure WHEN 1 THEN 'True' WHEN 0 THEN 'False' ELSE NULL END AS [Has Tenure], Class.ProfessorId AS [Class.ProfessorId], Class.ClassName, Class.ClassYear, Class.ClassSemester FROM Professor LEFT OUTER JOIN Class ON Professor.Id = Class.ProfessorId WHERE Professor.HasTenure = 'True' AND Class.ClassYear >= 2011
With a result set of
They come to me and ask something along the lines of “Now wait just a minute. What happened to Dr. Sprite and Dr. 7-up? They have tenure and I used a LEFT OUTER JOIN!”. OK, that's not really what happens. Let’s face it, they come to me and say “It’s broke, why?”
Let’s start at the beginning
SELECT Professor.Id AS [Professor.Id], Professor.ProfessorName, CASE Professor.HasTenure WHEN 1 THEN 'True' WHEN 0 THEN 'False' ELSE NULL END AS [Has Tenure], Class.ProfessorId AS [Class.ProfessorId], Class.ClassName, Class.ClassYear, Class.ClassSemester FROM Professor LEFT OUTER JOIN Class ON Professor.Id = Class.ProfessorId
This results in:
Now we need to exclude any professor that doesn’t have tenure. So we add a WHERE clause.
SELECT Professor.Id AS [Professor.Id], Professor.ProfessorName, CASE Professor.HasTenure WHEN 1 THEN 'True' WHEN 0 THEN 'False' ELSE NULL END AS [Has Tenure], Class.ProfessorId AS [Class.ProfessorId], Class.ClassName, Class.ClassYear, Class.ClassSemester FROM Professor LEFT OUTER JOIN Class ON Professor.Id = Class.ProfessorId WHERE Professor.HasTenure = 'True'
This trims our result set a bit.
Looking good so far but we still have entries for classes in 2010. And we saw before if we add
AND Class.ClassYear >= 2011
to the WHERE clause we lose Dr Sprite and Dr 7-up.
“Why is that! “ I hear some of you cry. The rest are split between 75% who already know the answer, 15% who don’t care, and the remaining 10% are still trying to figure out what an OUTER JOIN is and why I’m going on about it.
If you look at the last output you will notice that rows 6 & 7 have NULLs for any field from the Class table. As we all know, the expression NULL = any value is always false. (I am NOT getting into ANSI_NULLS discussions here.)
Let’s try moving the ClassYear condition into the ON clause.
SELECT Professor.Id AS [Professor.Id], Professor.ProfessorName, CASE Professor.HasTenure WHEN 1 THEN 'True' WHEN 0 THEN 'False' ELSE NULL END AS [Has Tenure], Class.ProfessorId AS [Class.ProfessorId], Class.ClassName, Class.ClassYear, Class.ClassSemester FROM Professor LEFT OUTER JOIN Class ON Professor.Id = Class.ProfessorId AND Class.ClassYear >= 2011 WHERE Professor.HasTenure = 'True'
We are now down to 5 rows of results.
Woo Hoo! We have a solution.
Unfortunately I really didn’t understand why this worked until one day I was discussing OUTER JOINS with one of my co-workers and he handed an interesting query similar to the one below.
SELECT Professor.Id AS [Professor.Id], Professor.ProfessorName, CASE Professor.HasTenure WHEN 1 THEN 'True' WHEN 0 THEN 'False' ELSE NULL END AS [Has Tenure], Class.ProfessorId AS [Class.ProfessorId], Class.ClassName, Class.ClassYear, Class.ClassSemester FROM Professor LEFT OUTER JOIN Class ON Professor.Id = Class.ProfessorId AND Class.ClassYear >= 2011 AND Professor.HasTenure = 'True'
This gives us seven rows.
Now that’s strange. How on earth are “Mr Pepper” and “Mr Fanta” showing up? Surely Professor.HasTenure = 'True' should have eliminated the results in rows 6 and 7.
After much pondering, staring at my computer, and even a bit of swearing I tried changing from a LEFT OUTER JOIN to a FULL JOIN in order to see more of what is going on.
SELECT Professor.Id AS [Professor.Id], Professor.ProfessorName, CASE Professor.HasTenure WHEN 1 THEN 'True' WHEN 0 THEN 'False' ELSE NULL END AS [Has Tenure], Class.ProfessorId AS [Class.ProfessorId], Class.ClassName, Class.ClassYear, Class.ClassSemester FROM Professor FULL JOIN Class ON Professor.Id = Class.ProfessorId AND Class.ClassYear >= 2011 AND Professor.HasTenure = 'True'
I can examine the 17 results and look for a pattern.
Finally I was starting to get somewhere. By using the FULL JOIN I was now able to see the larger picture. You will note a fair number of the classes that had professors are not actually matched up. For example rows 8-11 have a ProfessorId of 4, which we might reasonably expect to match up with row 6, “Mr Pepper”. Why aren’t they?
The reason for this is the same reason the final “correct” query worked. It lies in the difference between the ON clause and the WHERE clause. The WHERE clause eliminates rows from the result set, or really determines which results to include, but the net result is the same. The ON clause is the join between the two tables.
In the case of a LEFT OUTER JOIN the ON clause will have absolutely no effect on what rows from the left hand table will get returned. For some reason I had always persisted in thinking of the ON clause as having two separate parts. The part that does the join and the part that eliminates rows from the result set. In reality the ON clause is of course one piece that only determines the JOIN between the referenced tables. That is it. No more, no less. The WHERE clause is the part that determines which rows are returned and which aren’t. I should probably mention that the ON clause can increase the number of rows returned in a one to many or many to many situation.
Back to the LEFT OUTER JOIN version of the query.
Note again that the ON clause only affects how the two tables are joined together. In other words, I only want to join the two tables together if the Id’s match, Tenure is TRUE and ClassYear is 2011 or greater. The LEFT OUTER JOIN will eliminate rows 8 (see previous image) and above from the result set since there is no match from Professor (the left table). Then since there is no WHERE clause no other rows are eliminated. Leaving us not only with the rows we expected (1-5) but with the previously confusing rows 6 and 7.
…
And all the way back to our “solution”
SELECT Professor.Id AS [Professor.Id], Professor.ProfessorName, CASE Professor.HasTenure WHEN 1 THEN 'True' WHEN 0 THEN 'False' ELSE NULL END AS [Has Tenure], Class.ProfessorId AS [Class.ProfessorId], Class.ClassName, Class.ClassYear, Class.ClassSemester FROM Professor LEFT OUTER JOIN Class ON Professor.Id = Class.ProfessorId AND Class.ClassYear >= 2011 WHERE Professor.HasTenure = 'True'
We are back to five results
In this particular case the WHERE clause gets rid of any professors who don’t have tenure. The ON clause is only going to join the two tables together if the ids match and the ClassYear >= 2010. Finally the LEFT OUTER JOIN keeps all of the professors that meet the WHERE clause, and gets rid of any classes that don’t have a match with Professor.
OUTER JOINs are very useful and highly utilized tools. Unfortunately they can also cause a great deal of confusion in even relatively simple queries. If you can keep in mind the true use of the ON clause (the join) VS the true use of the WHERE clause (eliminating rows from the result set) then hopefully some of the confusion can be eliminated. Of course it should also be said that the difference between ON and WHERE is the same for all types of JOINs, it just isn’t always as obvious as it can be with OUTER JOINs.