Outer Join – What am I missing (SQL Spackle)

  • arnipetursson (11/12/2013)


    When I run the two queries (outer join vs NOT EXISTS),

    the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)

    and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).

    It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.

    I wonder if these results hold over larger data sets.

    After adding UPDATE STATISTICS on the #tables + dropping the result in a temp-table and putting FREEPROCCACHE & DROPCLEANBUFFERS in front of each run, the results are returned in something like 14ms vs 13ms here respectively.

    Even if it would be the other way around, in such 'close-cases' readability easily trumps any (small) performance hit IMHO.

    When upping the rowcount dramatically the difference grows a little it seems; but relatively spoken they stick together quite well.

    DECLARE @NumberOfCourse INT = 3000;

    DECLARE @NumberOfEmployees INT = 50000;

    => (55181680 row(s) affected)

    [run 1] SQL Server Execution Times: CPU time = 106922 ms, elapsed time = 109345 ms.

    [run 2] SQL Server Execution Times: CPU time = 110672 ms, elapsed time = 113376 ms.

    vs

    [run 1] SQL Server Execution Times: CPU time = 101938 ms, elapsed time = 104593 ms.

    [run 2] SQL Server Execution Times: CPU time = 99031 ms, elapsed time = 101675 ms.

    Mind you that this is on a laptop that has plenty of stuff running in the background so don't take these numbers as 'hard evidence', they're a guideline at best.

    In my experience I've always found both approaches to be equivalent as far as execution times go.

  • Andy DBA (11/12/2013)The second criteria in the code snippet above won't filter anything, because ALL COLUMNS returned from CourseCompleted will be NULL

    Right you are. In my haste to explain how I learned the lesson to not add a RIGHT table to the WHERE clause of a LEFT join query (it effectively makes it an INNER join), I added to your example and caused your confusion. What I was going for was to expand on your example by adding ISNULL so that it wouldn't produce an empty result set. I just didn't take into account that it wasn't actually doing anything.

    Sorry for my public DUH moment. 😛

  • There is always more than one way to skin a cat. My preference is to optimize simplicity and maintainability. Each part can be tested individually and there are no WHERE clauses.

    SELECT -- get all employees and required courses

    e.EmployeeId,

    e.FirstName,

    e.LastName,

    c.CourseId,

    c.CourseName

    FROM

    #Employee e

    CROSS JOIN

    #Course c

    EXCEPT

    SELECT -- remove completed employee courses

    e.EmployeeId,

    e.FirstName,

    e.LastName,

    c.CourseId,

    c.CourseName

    FROM

    #Employee e

    JOIN

    #CourseCompleted cc ON cc.EmployeeId = e.EmployeeId

    JOIN

    #Course c on cc.CourseId = c.CourseId

    ORDER BY

    e.EmployeeId

    ;

  • Personally I find the exists/not exists statements to be more difficult to write, while the LEFT JOIN WHERE IS NULL is extremely intuitive to me. It's interesting to discover that others have the opposite view. I think there was an article just the other day on joins outperforming the use of subqueries.

  • @deroby,

    Don't feel bad. In fact I think it's very important to explain to newbies that while this is an example of where LEFT OUTER JOIN can be used to accomplish the task at hand and it may be easy to understand it is generally not a good practice to rely on it since it is notoriously inefficient and should be avoided especially on large data sets. In todays world data sets are becoming very large very fast and it is important to be conscious of inefficient queries from the very beginning. Since data generally grows over time a query that works just fine when written often becomes a performance killer months or even weeks later.

    Another (more efficient) way to solve this problem without using a LEFT OUTER JOIN or even WHERE NOT EXISTS (which btw is even less efficient) is to use the power of indexes. This method while similar to the NOT EXIST clause explicitly collects the cartesian set of indexes of employees and courses EXCEPT those where courses have been completed joining on the #CourseCompleted table. Since we are only collecting the indexes and not the related data we avoid returning unnecessary data and the looping inherent in the NOT EXIST clause.

    To illustrate my point I simply used the original data creation script to increase the record count to 300 courses and 50000 employees.

    DECLARE @NumberOfCourse INT = 300;

    DECLARE @NumberOfEmployees INT = 50000;

    Using the Left Outer Join query:

    SELECT Employee.EmployeeId

    ,Employee.FirstName

    ,Employee.LastName

    ,Course.CourseId

    ,Course.CourseName

    FROM #Employee AS Employee

    CROSS JOIN #Course AS Course

    LEFT OUTER JOIN #CourseCompleted AS CourseCompleted

    ON CourseCompleted.EmployeeId = Employee.EmployeeId

    AND CourseCompleted.CourseId = Course.CourseId

    WHERE CourseCompleted.CourseId IS NULL

    ORDER BY Employee.EmployeeId

    -- Returns 5516909 Records in 01:07

    -- --------------------------------------------------------------

    -- Using the NOT EXIST QUERY

    SELECT Employee.EmployeeId

    ,Employee.FirstName

    ,Employee.LastName

    ,Course.CourseId

    ,Course.CourseName

    FROM #Employee AS Employee

    CROSS JOIN #Course AS Course

    WHERE NOT EXISTS ( SELECT *

    FROM #CourseCompleted AS CourseCompleted

    WHERE CourseCompleted.EmployeeId = Employee.EmployeeId

    AND CourseCompleted.CourseId = Course.CourseId )

    ORDER BY Employee.EmployeeId

    -- Returns 5516909 Records in 03:33

    -- --------------------------------------------------------------

    While the EXCEPT clause is a great way to simplify things it can be a real killer when returning a lot of data since the EXCEPT data set has to match the top SELECT clause and the query engine is matching everything as if it were a comjpound key to see what matches exactly. When I ran the EXCEPT query posted by Bill Tadada above I had to kill it after 7 minutes. But using it on just the indexes improves the query substantially.

    -- --------------------------------------------------------------

    Using the following script:

    IF OBJECT_ID('tempdb..#NotComplete','U') IS NOT NULL

    DROP TABLE #NotComplete;

    -- get all of the keys of a cartesian set into a temp table {15000000)

    SELECT Employee.employeeid, Course.courseId into #NotComplete

    from #Employee AS Employee

    CROSS JOIN #Course AS Course

    EXCEPT

    SELECT CourseCompleted.EmployeeId, CourseCompleted.CourseId

    FROM #CourseCompleted AS CourseCompleted

    -- the remaining keys can be used to get the data for the report (5516909 )

    SELECT Employee.EmployeeId

    ,Employee.FirstName

    ,Employee.LastName

    ,Course.CourseId

    ,Course.CourseName

    FROM #NotComplete AS NotComplete

    INNER JOIN #Employee AS Employee ON Employee.EmployeeId = NotComplete.EmployeeId

    INNER JOIN #Course AS Course ON Course.CourseId = NotComplete.CourseId

    -- Returns 5516909 Records 00:36 (-00:31, a 50% improvement)

    -- --------------------------------------------------------------

    ...and to your point deroby I did use the FREEPROCCACHE and DROPCLEANBUFFERS statements prior to execution of each. I'm using SS2014 on a laptop w/16GB ram and an SSD drive. As you say these are only indicators at best but it is consistent with my experience. I'm not sure why your results should be so different .:-)

    Ron Cicotte
    ron.cicotte@gmail.com
    Data Transformations, LLC

  • Ron Cicotte (2/27/2015)


    Another (more efficient) way to solve this problem without using a LEFT OUTER JOIN or even WHERE NOT EXISTS (which btw is even less efficient) is to use the power of indexes. This method while similar to the NOT EXIST clause explicitly collects the cartesian set of indexes of employees and courses and deletes those where courses have been completed joining on the #CourseCompleted table. Since we are only collecting the indexes and not the related data we avoid returning unnecessary data and the looping inherent in the NOT EXIST clause.

    Could you please elaborate on how the WHERE NOT EXISTS approach is less efficient and on what you mean by "looping inherent in the NOT EXIST clause"? I think this is a good example of "it depends" on your data. If there are potentially many matches on the right-side table (CourseCompleted), the WHERE NOT EXISTS approach can eliminate the rows from the left-side table as soon as it finds the first match. I haven't studied the execution plan, but I believe the LEFT OUTER JOIN approach retrieves all matches and then filters rows. If that's the case, WHERE NOT EXISTS is clearly more efficient when there could be many matches per row.

    Also not sure what you mean by "use the power of indexes". Doesn't the query optimizer already do that whenever it can?

    It's a great idea to performance test queries whenever you're unsure, but I don't think the test you suggested is comparing apples to apples. The first two queries are sorting the result set by Employee.EmployeeId and the script is not.

    Also, as you pointed out, most of the elapsed time is spent returning the records, so it's harder to see the performance differences between the actual queries. For that reason, many benchmark tests will either just select a count or will select data into a table variable instead of returning it to the screen. Because other processes can compete for resources even on a "quiet" server, many benchmark tests will report CPU time as well as elapsed time to show how much CPU is actually used. (Don't forget, with parallel processing CPU time can exceed elapsed time.) There are also times when disk IO is the bottleneck, so it is also often reported.

    Out of curiosity I modified the test so it's comparing apples to apples and so the query differences have a greater overall impact on the results. I replaced the column selects with count(*) and removed the ORDER BY clauses.

    I then ran SET STATISTICS TIME ON (ignoring IO for now) and repeated the test.

    The LEFT OUTER JOIN took about 5.5 seconds of CPU and the same time elapsed,

    the WHERE NOT EXISTS took about 5 seconds of CPU and the same time elapsed, (Hmmm... maybe it's not "even less efficient".),

    and the script took about 34 seconds of CPU and 17 seconds elapsed.

    Finally, I tested the EXCEPT query suggested by Bill Talada by making it a subquery and selecting a count from it. (I also removed the name columns so the EXCEPT was only comparing IDs. In "real life", you'd probably do that anyway and then join the EXCEPT subquery to the base table to pull only needed data columns).

    It took about 16 seconds of CPU and only 4 seconds elapsed! A clear winner in terms of elapsed time, but at a higher CPU cost.

    All of my tests were run using Microsoft SQL Server Developer (64-bit) Version 11.0.5058.0 on a Windows Enterprise 3.2GHz Workstation with 12 Gigs RAM. The record count was 5,517,383.

    As always, YMMV, but to summarize my results:

    The EXCEPT technique ran the fastest,

    NOT EXISTS was 20% slower but used less CPU, and

    the script took the longest and used the most CPU.

    Good, objective testing is important, because

    Ron Cicotte (2/27/2015)


    In todays world data sets are becoming very large very fast and it is important to be conscious of inefficient queries from the very beginning.

  • Hmmm! Good points. ...and thank you for a very good and rather thorough explanation of the issues around testing "apples to apples".

    My main point was that performance matters and in the old days (in my case that goes back to when SQL Server was a Sybase clone) OUTER JOINS were to be avoided whenever possible. So we broke things down into steps using #temp tables rather than trying to return the results in a single query. My reference to NOT EXISTS looping is based on the understanding that what the query engine is doing is using the results from the main SELECT clause to loop through the result set being tested and stopping as soon as a record is found. Similar to creating a #temp table of indexes and using a WHILE EXISTS clause to loop through a table to do something and deleting the index one by one from the #temp table until none are left. Except in the case of the NOT EXISTS clause the loop continues as soon as the first instance is found.

    At least that was my perception until running these tests and I'd appreciate any clarification you might offer.

    My reference to using the power of indexes refers to the fact that indexes represent a much smaller data set than the full table and by isolating them and operating on them to filter the data prior to retrieving the final result set we can minimize the size of the data sets being processed. This is a general practice in my work.

    My original script involved 3 steps.

    1. SELECT the cartesion set of indexes of employees and courses into a temp table.

    2. Delete those records matching the completed courses.

    3. Use the remaining indexes to retrieve the final results.

    Then I noticed the EXCEPT statement used in Bill Talada's post and realized that combining steps 1 and 2 would be an improvement. But your observation that it can be implemented as a subquery is clearly an even better solution.

    I am curious about your results though. I modified the script to be a single statement using the EXCEPT clause in a subquery as you suggested but I inserted the results into a #temp table. I also modified the other queries to select into temp tables. My thinking was that it is important to include the selected data in the comparison because the query plans presumably would be influenced by which tables the data being returned comes from and unless your returning data somewhere your not including all of the relevant variables.

    These Results differ from yours as follows:

    LEFT OUTER JOIN: - CPU time = 9812 ms, elapsed time = 9816 ms.

    WHERE NOT EXIST: -- CPU time = 38017 ms, elapsed time = 5910 ms.

    EXCEPT clause subquery -- CPU time = 38188 ms, elapsed time = 5215 ms.

    So unlike your results the CPU time and elapsed times were similar for both the EXCEPT and NOT EXIST versions of the query where the LEFT OUTER JOIN was consistent with your results

    I'm running SQL Server developer version 12.0.2000.8 (X64) on a DELL XPS 15 2.3GHz with 16G of memory and an SSD.

    Clearly hardware matters too.:-)

    Ron Cicotte
    ron.cicotte@gmail.com
    Data Transformations, LLC

  • Another set of interesting stats. When I run the queries minus the ORDER BY clause and select only the count(*) as with your conditions I get the following:

    -- LEFT OUTER JOIN:CPU time = 27640 ms, elapsed time = 3694 ms.

    -- EXCEPT:CPU time = 27566 ms, elapsed time = 3533 ms.

    -- NOT EXISTS:CPU time = 25607 ms, elapsed time = 3280 ms.

    Almost no difference. Why the huge increase in CPU time and reduction in elapsed time for the LOJ???

    I've saved the query plans for all conditions and will look at them to see if anything jumps out.

    I know this is way off topic now but I would like to pursue the subject. I haven't been on this forum for a long time. What's the best way forward? Is there a specific area for query plan analysis / performance tuning?

    Ron Cicotte
    ron.cicotte@gmail.com
    Data Transformations, LLC

  • @ Ron:

    I go pretty far back with SQL server too, and would never have even considered using "while not exists" over a "left join" to retrieve orphans until I skeptically read an SSC article and then proved to myself how efficient that approach is.

    I also was convinced that the query engine would loop through the "not exists" query for each row in the "Left" table. But I did some performance testing and also compared estimated query plans between a good ol' left join and not exists and was amazed to see that often they had identical plans (This was probably on SQL 2000 or 2005.)

    I also had to begrudgingly admit that, for me anyway, the "where not exists" syntax is more intuitive even though I wasn't used to it.

    This is no substitute for testing but, as a side note, there's a quick and dirty trick you can do with query plan estimates to compare queries. If you put them in the same SSMS query pane and then display the estimated exection plan for the "batch", the plan will show you what percentage of the total estimated cost each query is "relative to the batch". Take it with a grain of salt, though! The engine may decide to use a different plan at run time.

    I understand now what you meant by power of indexes. I tend to think of them as internal "tables" that are available to the query optimizer. I'm far from being an expert on query plans, but I can get enough out of poking around plans for simple test queries like these to realize that the engine is definitely using the indexes.

    The next time you find yourself about to make a temp table containing just the columns you need, I suggest looking into using include columns in indexes (apologies if you already are using them.) because that's exactly what they do. Obviously indexes use disk space and can slow down writes so there are tradeoffs, but they are always available and don't have to be built "on the fly" like temp tables. If you're always pulling columns A, B, and C from a superwide table using a non-clustered index on column D, it may make sense to put A, B, and C in your column D index as includes.

    I understand your reasoning that plans are influenced by which tables contain returned data, but in this case the queries are all retrieving from the same tables, so presumably any differences in response times would be caused by differences in time required to identify which rows are being returned. There are reasons why that may not always be the case though (parallelism, random fetches vs block fetches, etc), so it is a good idea to create a data "sink" for doing serious testing. I've just found that not only does it take forever to return 5M rows to your screen, there are too many other variable factors such as network traffic or competition for workstation resources that can cloud query testing results when returning data to the client.

    Yes hardware most definitely matters! It's interesting to note that on your testbed, the left join is the only one that ran single threaded when pulling data. The other two used parallelism. (I'm guessing the SQL engine had at least 8 cpus availabe versus the 4 available on the testbed I used.) It's also interesting that they all used parallelism when you tested using count(*).

    Don't have the time now, but it might be interesting to perform the tests using OPTION (MAXDOP 1) to see how much those other CPUs are earning their keep! It would also be interesting to see if MAXDOP 1 affects queries that pull data more drastically than it does simple count(*) queries.

    I tend to visit this forum sporadically also, so I don't know if there's an easy way to find articles grouped by topic. I tend to use the site's search function or sometimes bounce in from Google. Also, there are a number of reputable contributors here who I tend to search for when looking for answers.

    SSC is a great resource if you verify things for yourself and also if you consider how close your scenario is to what is being presented. Most of the smattering I know about performance tuning I learned here!

  • deroby (11/12/2013)


    Hi there,

    Thx for the article about what often turns out to be a brain-breaker for a lot of people starting out with SQL.. but... (there's always a but =)

    Although the LEFT OUTER JOIN ... WHERE <somefield> IS NULL gets the job done I tend to discourage its use for three reasons

    1. It isn't very readable for people not into SQL

    2. I've seen it 'fail' when people use a <somefield> that turns out to be NULL-able

    3. Not applicable in your example, but the JOIN might cause doubles/triples/etc in the results depending on the data-structure... which they then 'fix' again by using DISTINCT etc... ( "ahh, the horror" =)

    The (IMHO) much easier to understand (and code) method is by using WHERE NOT EXISTS() which pretty much circumvents all three issues. Ironically it usually ends up with the same query plan anyway. Or, (as is the case here!) sometimes even comes up with a better one. But most of all, I think it is a lot easier to read which (on the long run) makes it a more favourable option to me.

    /*List of employees and courses they have not completed*/

    SELECT Employee.EmployeeId

    ,Employee.FirstName

    ,Employee.LastName

    ,Course.CourseId

    ,Course.CourseName

    FROM #Employee AS Employee

    CROSS JOIN #Course AS Course

    WHERE NOT EXISTS ( SELECT *

    FROM #CourseCompleted AS CourseCompleted

    WHERE CourseCompleted.EmployeeId = Employee.EmployeeId

    AND CourseCompleted.CourseId = Course.CourseId )

    ORDER BY Employee.EmployeeId

    My 2 cents.

    Roby

    I agree, though I have used both, and when wanting a quick test result do use either. It's just how the mood hits me. Sometimes I try different ways of coding something just to make sure I understand the column/table relationships and in hopes of discovering some new great way to do something.

  • There were some comments about finding a good use for cross joins.  Cross joins followed by a left outer join just like this one are very powerful when building cross-tab reports in SSRS.  You do the cross join first to create a complete cartesian product matrix, then use the left outer join to fill it (sometimes very sparsely).  Of course leave out the IS NULL in the where because you want all of the answers, not just the missing ones.  Then it just becomes an exercise in using that result set for your SSRS report.  In this example, you would have all of the courses across the top, all of the employees on the left hand side, and the data (however you want to represent it - if or when the course was taken).  If the employee could have taken the course multiple times, then you could use COUNT to get how many or MAX to get the last occurrence.

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply