November 11, 2013 at 11:20 pm
Comments posted to this topic are about the item Outer Join – What am I missing (SQL Spackle)
November 12, 2013 at 3:23 am
Hi there,
Thx for the article about what often turns out to be a brain-breaker for a lot of people starting out with SQL.. but... (there's always a but =)
Although the LEFT OUTER JOIN ... WHERE <somefield> IS NULL gets the job done I tend to discourage its use for three reasons
1. It isn't very readable for people not into SQL
2. I've seen it 'fail' when people use a <somefield> that turns out to be NULL-able
3. Not applicable in your example, but the JOIN might cause doubles/triples/etc in the results depending on the data-structure... which they then 'fix' again by using DISTINCT etc... ( "ahh, the horror" =)
The (IMHO) much easier to understand (and code) method is by using WHERE NOT EXISTS() which pretty much circumvents all three issues. Ironically it usually ends up with the same query plan anyway. Or, (as is the case here!) sometimes even comes up with a better one. But most of all, I think it is a lot easier to read which (on the long run) makes it a more favourable option to me.
/*List of employees and courses they have not completed*/
SELECT Employee.EmployeeId
,Employee.FirstName
,Employee.LastName
,Course.CourseId
,Course.CourseName
FROM #Employee AS Employee
CROSS JOIN #Course AS Course
WHERE NOT EXISTS ( SELECT *
FROM #CourseCompleted AS CourseCompleted
WHERE CourseCompleted.EmployeeId = Employee.EmployeeId
AND CourseCompleted.CourseId = Course.CourseId )
ORDER BY Employee.EmployeeId
My 2 cents.
Roby
November 12, 2013 at 5:21 am
Grasshopper beat me to it 🙂
November 12, 2013 at 5:58 am
I have on my clipboard the identical query posted by deroby (I hope I'm pronouncing that correctly.) LEFT OUTER JOIN ... WHERE <somefield> IS NULL is a good step in learning how to solve these types I queries. Although shouldn't that step be followed by learning about WHERE NOT EXISTS?
To answer the Patrick's question "What am I missing?", you are missing the opportunity to show others how to properly use WHERE NOT EXISTS!
If the article ended with deroby's post, It would be a 5 star article. It has been my experience that WHERE NOT EXISTS always meets or beats a LEFT OUTER JOIN ... WHERE <somefield> IS NULL query.
Patrick, I do really like your test data generation queries. I learn something new everyday.
November 12, 2013 at 8:20 am
brymen (11/12/2013)
...If the article ended with deroby's post, It would be a 5 star article. It has been my experience that WHERE NOT EXISTS always meets or beats a LEFT OUTER JOIN ... WHERE <somefield> IS NULL query...
I agree. It took me a while to come around on this one, because it seems like you might be creating a hidden loop with WHERE NOT EXISTS, but often the query plan is identical to LEFT OUTER JOIN. In cases where there could be many matches (eg. employees could take a course many times), WHERE NOT EXISTS is faster because the server can stop "looking" when it finds the first match, but my understanding is that LEFT OUTER JOIN retrieves all matches first and then applies the "WHERE <somefield> IS NULL" to filter them out.
Does anyone know if this has always been true, or was there a time in the dark ages of MS SQL server when WHERE NOT EXISTS was slower than a LEFT OUTER JOIN?
November 12, 2013 at 10:21 am
thx for sharing, Patrick
I always complain how hard it is to find practical uses of the CROSS JOIN and you showed us a nice example.
November 12, 2013 at 10:52 am
The (IMHO) much easier to understand (and code) method is by using WHERE NOT EXISTS() which pretty much circumvents all three issues. Ironically it usually ends up with the same query plan anyway. Or, (as is the case here!) sometimes even comes up with a better one. But most of all, I think it is a lot easier to read which (on the long run) makes it a more favourable option to me.
I agree that using WHERE NOT EXISTS is easier to read, the purpose of the article was to show how to get the results using the outer join. But I do agree that I could have shown other solutions and it would have made the article better.
November 12, 2013 at 11:05 am
wagner crivelini (11/12/2013)
thx for sharing, PatrickI always complain how hard it is to find practical uses of the CROSS JOIN and you showed us a nice example.
I agree. A fine example of Cross Join usage.
November 12, 2013 at 11:10 am
the purpose of the article was to show how to get the results using the outer join
I think it was a good choice as it made the example easy to understand because it kept it linear. Adding another element (optimization) would have just made it more confusing and then you would have had people complaining that you didn't post any optimization proof (screenshots from the optimizer).
November 12, 2013 at 1:43 pm
I'm starting to feel bad here ...
Patrick Cahill (11/12/2013)
I agree that using WHERE NOT EXISTS is easier to read, the purpose of the article was to show how to get the results using the outer join. But I do agree that I could have shown other solutions and it would have made the article better.
I most certainly didn't want to downplay your explanation & examples; I (and obviously everyone else here) liked them a lot, but given the likely-hood someone might land on this page from a search for 'outer join & what's missing' I felt like a note about WHERE NOT EXISTS() was in order. (**)
(**: mostly because people often don't look further than the first thing 'that works'; worse, I've been having plenty of fights here at work about this exact subject where the argument often seems to be : but LEFT OUTER JOIN ... WHERE <field> IS NULL looks more technical so it HAS to be better... )
But I digress =) Like I said, thx for a fine article! I was merely trying to complement it.
November 12, 2013 at 3:44 pm
I'm starting to feel bad here ...
Hey deroby, don't feel bad. I'm betting many readers would've made the same comment if you hadn't beat them to it! You're also right about the tendency for people to look no further once they've found something that works.
Yes, this is a beginner-level article describing Outer Join as promised by its title. The introduction however, sets it up as a report request that needs a solution rather than as a tutorial on joins. Given the introduction, I think it's fair to expect that other solutions might be provided and possibly contrasted and I wouldn't classify the inclusion of NOT EXISTS as "optimization" that would have made it more confusing. I think NOT EXISTS is less confusing (regardless of performance) and I hope beginners are learning it early on.
Also, as a beginner-level tutorial on Left Outer joins, I think the article would be improved if it pointed out that you can't have any other criteria referencing the table you're left joining to in the WHERE clause. For example, say you wanted a list of employees who have not completed required company courses in 2013. At first glance you might be tempted to add criteria to the WHERE clause (instead of the join where it belongs):
WHERECourseCompleted.CourseId IS NULL
AND CourseCompleted.DateComplete >= '1/1/2013' --Failed attempt to list all courses missed this year
and then be surprised when an empty set is returned.
This common mistake got me when I was learning and I think it is yet another good reason for teaching NOT EXISTS. With NOT EXISTS, it's more intuitive to put the additional criteria in the correct place, the sub-query's WHERE clause:
WHERE NOT EXISTS
(
SELECT 1
FROM CourseCompleted
WHERE CourseCompleted.EmployeeId = Employee.EmployeeId
AND CourseCompleted.CourseId = Course.CourseId
AND CourseCompleted.DateComplete >= '1/1/2013' --List all courses missed this year
)
I know I'm being an armchair quarterback, but I don't mean to pick on the author. I appreciate how hard it must be to decide how much detail to put in an article like this. Hopefully my comments will help someone avoid pitfalls that got me.
November 12, 2013 at 4:00 pm
When I run the two queries (outer join vs NOT EXISTS),
the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).
It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.
I wonder if these results hold over larger data sets.
November 12, 2013 at 4:17 pm
Andy DBA (11/12/2013)
WHERECourseCompleted.CourseId IS NULL
AND CourseCompleted.DateComplete >= '1/1/2013' --Failed attempt to list all courses missed this year
I remember learning that lesson too. Before I knew I could add it to the ON clause, I always solved this type of query with
WHERECourseCompleted.CourseId IS NULL
AND ISNULL(CourseCompleted.DateComplete,'12/31/2012') >= '1/1/2013'
I was so happy the day I discovered that I could utilize the ON clause. 😛
November 12, 2013 at 5:37 pm
thisisfutile (11/12/2013)
I remember learning that lesson too. Before I knew I could add it to the ON clause, I always solved this type of query with
WHERECourseCompleted.CourseId IS NULL
AND ISNULL(CourseCompleted.DateComplete,'12/31/2012') >= '1/1/2013'
I was so happy the day I discovered that I could utilize the ON clause. 😛
I hope you're joking! :unsure: When using the Left Outer Join technique to find "missing" records, YOU CAN'T USE THE WHERE CLAUSE TO REFERENCE ANY COLUMNS IN THE "MISSING RECORD" TABLE except for the column that you specify IS NULL. The second criteria in the code snippet above won't filter anything, because ALL COLUMNS returned from CourseCompleted will be NULL so the ISNULL function will always return '12/31/2012' which is always greater than '1/1/2013'.
Think of it this way:
The Left Join tells SQL Server to return all rows from the Left regardless of matches found on the Right and to only include data from the Right when the join matches. The "<joined right table column> IS NULL" criteria in the WHERE clause tells SQL Server to not return any result records that DO have a match on the Right. All that is left for it to return is rows from the Left that are missing matches on the Right with the Right "columns" containing nothing but NULL placeholders.
If this seems confusing it's because it is confusing and that's why many people feel NOT EXISTS is more intuitive.
Anyone who is still unsure might benefit from doing a "Select *" on the original article's solution and see what happens with and without the where clause.
November 12, 2013 at 6:57 pm
arnipetursson (11/12/2013)
When I run the two queries (outer join vs NOT EXISTS),the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).
It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.
I wonder if these results hold over larger data sets.
I usually use a LEFT JOIN for this reason.
deroby (11/12/2013)
2. I've seen it 'fail' when people use a <somefield> that turns out to be NULL-able
That is easily avoidable, particularly when you're joining on a foreign key, by choosing <somefield> as one of the columns in your JOIN criteria. I agree though that it is a possible newbie mistake.
Nice article Patrick!
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply