Fun with Outer Joins

  • Ghesha (1/17/2014)


    In the provided example the filter "AND (Class.ClassYear >= 2011 OR Class.ClassYear IS NULL)" in WHERE clause works. But in general, it does not.

    First of all, I'd insist that in this case, ClassYear being NULL would be in error.

    Nevertheless, in the general case, what you say is true -- IF YOU USE a NULLABLE FIELD.

    However, you'll note that there is no obligation that both sides of that clause refer to the same column. The same effect can be achieved with

    AND (Class.ClassYear >= 2011 OR Class.ClassId IS NULL)

    (assuming ClassId is the primary key for table class)

  • Charles Kincaid (1/17/2014)


    Just always, ALWAYS remember the order of execution.

    I added the emphasis. Yes, understanding this execution order is why I have been able to write good performance filtered queries with no entries in the WHERE clause at all. I have shown these around our office and folks are calling "Where is your WHERE?" Further since the WHERE takes place after the JOINS (in the FROM) moving the predicates to the JOINS can reduce memory consumption.

    I have a rule that I keep in mind while designing queries: JOINS put stuff into the result set. WHERE takes stuff out of the result set.

    Well said.

  • Kenneth.Fisher (9/10/2012)


    (snip)

    My design is correct for the purpose. It's simple. I didn't need or want a complicated design to demonstrate a subject that can be complicated enough for a lot of people.

    (snip)

    Agreed. Your article demonstrates a point for non-guru SQL query writers. It does so quite well and without overwhelming them with "Mr. Spock" level jargon. Congratulations.

    John.

  • A rule of thumb I was shown was if you're adding something to the WHERE clause for the right hand part of a LEFT OUTER JOIN then it effectively changes to an INNER JOIN so push it up into FROM in order to build the set you need before you filter on what you want to keep.

  • ...

  • Thanks for this article. It brings to the fore my lack of understanding of the difference between the join and where clauses.

    I still don't understand why the clause 'AND Class.ClassYear >= 2011' produces a different result when used as part of the join.

    It seems to me that it can't contribute to the join because it refers only to the values in one of the tables in the join, so on the face of it, it acts as an ersatz where, cutting down the rows in the Classes table involved in the join.

    But, that being the case, why is the result different when it's transplanted to the real where clause? Why are the rows with NULL values in the ClassYear retained?

    Would greatly appreciate it if you could clarify that for me.

  • Look at my previous post for an explanation: it's a matter of order of execution.

    Basically, what happens (from a logical point), is this:

    1. A cartesian product of the two tables is produced (the query optimizer will probably use some other algorithm in practice, but you can ignore that to understand HOW the rows are included/excluded/calculated).

    2. Each pair of rows is evaluated by the ON clause. So, the ON eliminates unneeded rows from the join.

    3. The OUTER kicks in and gets all the rows that where excluded by the ON (from one table or both).

    4. The WHERE excludes rows from the new list.

    In case of INNER joins, step 3 is not executed, so conditionals can be switched from the ON clause to the WHERE freely.

    No magic, no secrets. Only execution order.

  • Try it like this:

    so on the face of it, it acts as an ersatz where, cutting down the rows in the Classes table involved in the join.

    Yes, you are right, this is what it does - but it does it before the OUTER restores the rows from the Professor table, which also adds the null parts on the right. Thus the nulls are being added in, as it were, after the WHERE clause equivalent.

    Conversely, using the actual WHERE clause means that it is applied after the nulls have been added, and so it removes them.

  • Thank you. That's what I wasn't getting. The difference between the ON and the OUTER. I was conflating the two (yes, even while reading the steps as described by chiesa.alberto).

    Thanks to both of you. I'm going to pin the execution order up somewhere prominent, so I take it into account from now on.

  • sneumersky (1/17/2014)


    Charles Kincaid (1/17/2014)


    Just always, ALWAYS remember the order of execution.

    I added the emphasis. Yes, understanding this execution order is why I have been able to write good performance filtered queries with no entries in the WHERE clause at all. I have shown these around our office and folks are calling "Where is your WHERE?" Further since the WHERE takes place after the JOINS (in the FROM) moving the predicates to the JOINS can reduce memory consumption.

    I have a rule that I keep in mind while designing queries: JOINS put stuff into the result set. WHERE takes stuff out of the result set.

    Well said.

    Honestly, while I agree that knowing the execution order is important, the idea that all queries can exclude entries in the WHERE clause makes me nervious as *&@#. If you are doing INNER JOINs then no problem, it will have the same effect. I just don't see how you can manage restrictions on a query with an OUTER or CROSS JOIN without a WHERE clause.

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • Couldn't agree more.

  • chiesa.alberto (1/17/2014)


    It seems to me that a lot of people is not getting really HOW a query is executed.

    But that should be querying 101.

    I would just point out the T-SQL Querying book by Itzik Ben-Gan, but let me recap.

    A T-SQL query is composed by clauses, those are:

    SELECT TOP(x) ....

    FROM ...

    WHERE ...

    GROUP BY ....

    ORDER BY ...

    But this is a description of the query, not the real execution order.

    If we where writing queries with clauses in the exact execution order, it would be in this way:

    FROM ...

    WHERE ...

    GROUP BY ...

    SELECT ...

    ORDER BY ...

    TOP ...

    This is why you can write "Order by 1" and have the data sorted by the first column: because when the order by gets executed, the selected data is already there. However, you cannot use the expressions in the SELECT clause in the WHERE, FROM or GROUP BY clauses, because when those are executed, the select has still to be materialized.

    What seems to trouble a lot of people is failing to grasp that the ON clause and the WHERE clause are executed in different moments. And, in between them, the OUTER kicks in.

    So, the ON is used to match data for the join, and AFTER the join is evaluated, 1 of 4 things happens:

    - if the join is INNER, nothing happens and no data is appended. This is why an INNER join will be always at least as fast as an outer

    - if the join is LEFT, RIGHT or FULL OUTER, the non-joining data from the related table(s) will be added to the result set.

    After this result set given by the FROM clause is built, the WHERE kicks in. No magic or difficult considerations.

    Just always, ALWAYS remember the order of execution.

    Everything else in the article is, IMHO, unnecessary clutter.

    Couldn't agree more

  • Kenneth.Fisher (1/19/2014)


    Honestly, while I agree that knowing the execution order is important, the idea that all queries can exclude entries in the WHERE clause makes me nervious as *&@#. If you are doing INNER JOINs then no problem, it will have the same effect. I just don't see how you can manage restrictions on a query with an OUTER or CROSS JOIN without a WHERE clause.

    You should not be nervous. There is no magic in the query syntax. This is the problem I see with the article: it seems some sort of documentary on the strange habits of SQL queries.

    Talking about your example, you have three concepts:

    1. you are interested in professors with courses (a conditional on the year)

    2. but you want also the ones without courses (an Outer, because that is the only tool you have to "revive" a set of records previously excluded by a conditional)

    3. but all the results should have tenure (another conditional)

    1 and 2 are to be executed one after the other: you firstly select, and then pick up also the ones that where excluded. So 1 must go in the ON.

    Where should 3 go? Well, it depends on your requirement:

    If you want a list of ALL the professors, it must go before the OUTER, so in the ON clause.

    If, as is the case, you want only professors with Tenure, you must be sure that the clause is in the WHERE.

    Think it this way:

    1. ON is a tool to pick record pairs in a cartesian product. So it's a way to discard records.

    2. Outer is a way to "revive" discarded records from one or two tables.

    3. After the Outer, the result set will be composed by two sub sets: the paired rows produced by the ON clause, and the unpaired, revived rows picked up by the OUTER.

    4. The WHERE will be applied to everything, even the unpaired rows.

    So you NEED both tools, and you decide where a conditional has to be based on if it must be applied only to paired rows or also to unpaired ones.

    Only a problem of execution order, as I said.

    Edit: fixing typos

  • Surely the following is an easier way to accomplish this...

    SELECT PR.id, PR.ProfessorName, ClassName, ClassYear, ClassSemester

    FROM Professor PR

    LEFT JOIN

    (

    SELECT ProfessorID, ClassName, ClassYear, ClassSemester

    FROM Class

    WHERE ClassYear>=2011

    ) SQ ON SQ.ProfessorID = PR.Id

    WHERE PR.HasTenure = 1

  • rhydian (1/20/2014)


    Surely the following is an easier way to accomplish this...

    SELECT PR.id, PR.ProfessorName, ClassName, ClassYear, ClassSemester

    FROM Professor PR

    LEFT JOIN

    (

    SELECT ProfessorID, ClassName, ClassYear, ClassSemester

    FROM Class

    WHERE ClassYear>=2011

    ) SQ ON SQ.ProfessorID = PR.Id

    WHERE PR.HasTenure = 1

    It is equivalent (probably even for performance) to the proposed solution, but 2 SELECTs add visual clutter, IMO: you are repeating the list of fields 2 times. And, in complex queries, you could be tempted to use *, God forbid 😉

    For many people, your solution could probably be easier to understand, which has value.

    I would prefer to work with people that understands (and writes) the proposed solution, instead of this one.

Viewing 15 posts - 46 through 60 (of 63 total)

You must be logged in to reply to this topic. Login to reply