SQL & the JOIN Operator

Question

Post reply

SQL & the JOIN Operator

Viewing 15 posts - 76 through 90 (of 99 total)

You must be logged in to reply to this topic. Login to reply

sunny.daglia Newbie Points: 7 More actions · Answer 1

Hi All,

I have a question on joining more than 2 tables.

When we have more than 2 tables to join what is the order of execution :

For eg :

select ...

from table t1 inner join table t2

on t1.a = t2.a

inner join table t3

on t2.b = t3.b

I think it should join t1 & t2 first and then t2 & t3 to further filter the records from the join of t1 &t2.

And if the query is like

select ...

from table t1 inner join table t2

on t1.a = t2.a

inner join table t3

on t1.b = t3.b

Here how it goes? First join First, which i think but just want to confirm.

Thanks.

Sunny

Nils Gustav Stråbø SSChampion Points: 11259 More actions · Answer 2

Good article 🙂

The only thing I'm missing is a description of the other "join like" operators, like APPLY, INTERSECT and EXCEPT, but I know it's hard to cover everything in one article.

Guess there will be a follow up 😀

Cuppadan SSC Enthusiast Points: 109 More actions · Answer 3

Thanks for the informative article. As a newbie this helped me tremendously! 🙂

Cuppadan

wagner crivelini SSC Eights! Points: 925 More actions · Answer 4

Nils

Thanks for your comments. You know this 2nd article is in my list, hope I have time to work on it soon 🙂

wagner crivelini SSC Eights! Points: 925 More actions · Answer 5

Sunny

Usually the query optimizer will evaluate the different sequences when defining the execution plan. So, considering standard situations, your query's performance will be same in both cases.

gregg_dn SSCommitted Points: 1674 More actions · Answer 6

Enjoyed the article and found the diagrams easy to follow.

timothy bates SSC-Addicted Points: 471 More actions · Answer 7

arun55 (10/8/2009)
vliet (10/8/2009)
Since the 'or' operator has lower precedence as the 'and' operator, when you ommit the parentheses the join condition means something completely different. Without parentheses the predicate behind the 'or' is evaluated separately, which leaves out the 'a.id = b.id' part completely, thus returning all records from 'a' for each and every record where 'b.date2' is empty. This might blow up the size of the result set a lot when a has many rows and b has many rows without a date in this column. Never ommit any parenthenses whenever you use both 'and' and 'or' in a predicate. You can never use too many parenthenses in these cases, computers don't know what you mean, they just do what you tell them to, but that's generaly not what you want.
I've seen predicates with both the 'and' and 'or' operator being the cause of 'human' errors in SQL so often, it's almost worth an article.
Thanks for the information. I understood what you have explained.
But in above case, b.date field has no nulls and I am getting same results.
I think it was a precautionary statement. I will check the requirement and try to add parentheses.

If Table B can contain a start but no end date (assumption given inferred use) this is important.

Your 40 minute query is returning information based on two specific possibilities:

1) a.id = b.id and a.date1 between b.date1 and b.date2

2) b.date2 is null

This would return the same data as the following more complex and less efficient query:

select a.*,b.*

from table a

left join table b

on a.id = b.id

and a.date1 between b.date1 and b.date2

union

select a.*,b.*

from table a, table b

where b.date2 is null

Looking at the second select this is a cross join. After creating this cross join it then goes through each record Count(A)*Count(B) and select those where date2 is null (0 per your statements). So this entire cross join would be generated then ignored. You are paying the cost to create this cross join and then ignore it.

Your results are the same given your data set, but this doesn't mean your queries are equal. They could change after your next transaction.

You might consider a third query depending on your expectation.

This requires that a.date1 occur after b.date1 but b.date2 may be null

select a.*,b.*

from table a

left join table b

on a.id = b.id and a.date1 >= b.date1 and a.date1 <= isnull(b.date2,getdate())

This query allows a.date1 to predate b.date1. This may not be desired.

Arun55 modified.

select a.*,b.*

from table a

left join table b

on a.id = b.id

and (a.date1 between b.date1 and b.date2 or b.date2 is null)

SQLRNNR SSC Guru Points: 281344 More actions · Answer 8

Well done with this article.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

permap SSC Journeyman Points: 78 More actions · Answer 9

Good article.

Here's a possible correction. For the Self Join section, there is the output to the query. See the fourth record which is "7, Peter, 8, Harry". Isn't it supposed to be "7, Peter, 6, Mark"?

Thanks.

ksjayadev SSC Journeyman Points: 96 More actions · Answer 10

ksjayadev

SSC Journeyman

Points: 96

February 26, 2011 at 8:46 am

#1291913

Excellent article using Venn diagrams

James G SSC Enthusiast Points: 149 More actions · Answer 11

One question: I thought that when you used multiple JOINs that SQL does the first join between Table1 and Table2 and then joins the result set of that join to Table3. Is this correct? Your article seems to say that SQL joins Table1 to Table2 and then joins Table1 to Table3 and then I assume it joins the result set of those two joins. Could you clarify? Is there some "order of operations" when it comes to JOINs?

Thank you for the article! Really like the Venn diagram visual aid.

wagner crivelini SSC Eights! Points: 925 More actions · Answer 12

wagner crivelini

SSC Eights!

Points: 925

February 26, 2011 at 4:14 pm

#1291953

Permap

Good catch! I'm correcting it now.

Thx

Paul White SSC Guru Points: 150468 More actions · Answer 13

Very nicely presented article, but the content is not quite right in places.

The set intersection is useful so long as your are clear that you are talking about sets of keys (i.e. unique join column values). If you don't make that distinction, or omit to go on to describe how one-many and many-many joins looks, it seems misleading. An INNER JOIN is not the intersection of two sets (that would be INTERSECT). The example data only uses unique join key values in both tables, which could reinforce any misconception.

I appreciate the value of simplification to introduce a concept, but the set-intersect visual might promote unclear understanding. The same issue applies to the coverage of LEFT, RIGHT, and FULL OUTER JOIN. There's no mention of how duplicates are treated, or how rows are NULL-extended when a match does not occur. Again, I feel the Venn diagram approach leads us astray here, unless it is very clearly pointed out that you are visualizing sets (unique by definition) of join keys.

The article also describes the cross product of Table1 and Table2 as "...each record of Table1 will be duplicated for each record in Table2". This suggests that only Table1's rows will appear duplicated in the output, which is clearly not the case. It's not wrong exactly, but it is incomplete and imprecise.

The section 'Excluding the Intersection of Sets' recommends the least efficient method (LEFT JOIN). NOT EXISTS is generally preferred because it can stop searching as soon as it finds a match. The semantic is different, of course, because LEFT JOIN ... WHERE NOT NULL can produce duplicates whereas NOT EXISTS will not. 'Excluding the intersection' has an air on set theory about it, but it isn't a very good description for returning rows from Table1 where a matching row does not exist in Table2.

In the same section, you say: "When we do this kind of query, we have to pay attention to which field we pick for the WHERE clause. We must use a field that does not allow NULL values. Otherwise the result set may include unwanted records. That's why I suggested to use the second table's key. More specifically, its primary key. Since primary keys don't accept NULL values, they will assure our result set will be just what we needed." This is a common misconception, which you should have tested before publishing. There is no problem with performing an outer join on NULLable columns: NULLs never join (regardless of the ANSI_NULLS setting), so there is no possible confusion between a 'join matched' NULL and a NULL resulting from the NULL extension of a row by the outer join.

The section 'One Word About Execution Plans' starts with "These comments lead us to an important insight. We usually don't stop to think about this, but observe that the execution plan of SQL queries will first calculate the result set for the FROM clause and the JOIN operator (if any), and then the WHERE clause will be executed". This is logically true, but not physically. If it were, we would never see an index seek in a plan with joins! The SQL Server optimizer is free to reorder expressions, transform the plan, and push filters (predicates) down past joins if it results in a lower cost query. The only reason the filter in your LEFT OUTER JOIN example can't be pushed past the join is because it is testing NULL-extended rows resulting from the join.

In 'Joins and Indexes', you say: "On the other hand, Table1 had no index on field key2. Because of that, the query optimizer tried to be smart enough and improve the performance of querying key2 using the only available index." - This is nonsense. It isn't the optimizer trying to be smart - there is no other access method available aside from scanning the clustered index. There's no such thing as a table scan on a clustered table.

The ability to perform anything other than an equi-join is not a 'cosmetic feature' - such joins are common where we need to produce a running total or a match on a range of dates, for example. The problem with non-equal (inequality) joins is NOT that they usually duplicate records - it is that performance tends to be worse because more rows will match.

Under 'Multiple Joins', you write: "Just remember that joins are written in pairs. So first we will join Table1 to Table2. And then we will join Table1 and Table3". The query optimizer chooses an initial join order based on table cardinality and other heuristics - it does not normally follow the written query order (though you can force that by including the FORCE ORDER hint, setting FORCE_PLAN ON, or by hinting a specific physical join operator e.g. INNER HASH JOIN). In the main, the optimizer picks an initial order, and then applies transformations which may result in further changes to the physical join order.

Consider reading my optimizer series starting at http://sqlblog.com/blogs/paul_white/archive/2010/07/29/inside-the-optimiser-constructing-a-plan.aspx

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 14

Paul,

Isn't this a bit like the article you wrote on designing databases? While I agree with all the things you've stated, entire books have been written on how to join and all the caveates that occur when dupes are present, etc, ad infinitum. This article is a good introduction to joins... it's not the final word just like your introduction to designing databases isn't and wasn't intended to be the final word. Think about it... 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

wagner crivelini SSC Eights! Points: 925 More actions · Answer 15

James

The SQL statement was supposed to do it from the first to the last join. Actually I understand that's the way things worked out in old RDMBSs.

But SQL SERVER has a powerful Query Optimizer that evaluates the possibilities to choose the best combination of tables so to optimize query performance.

You can do yourself a simple test to check that. Try running the following queries. This is a simple query (running on ADVENTUREWORKSLT2008) written in two different forms, changing the sequence of the JOINs.

[font="Courier New"]USE AdventureWorksLT2008

GO

SELECT C.CompanyName, P.Name AS Product, SUM(SD.OrderQty) AS OrderQtyTotal

FROM SALESLT.CUSTOMER C

INNER JOIN SalesLT.SalesOrderHeader SH ON SH.CustomerID = C.CUSTOMERID

INNER JOIN SalesLT.SalesOrderDetail SD ON SD.SalesOrderID = SH.SalesOrderID

INNER JOIN SalesLT.Product P ON P.ProductID = SD.ProductID

GROUP BY C.CompanyName, P.Name

GO

SELECT C.CompanyName, P.Name AS Product, SUM(SD.OrderQty) AS OrderQtyTotal

FROM SALESLT.Product P

INNER JOIN SalesLT.SalesOrderDetail SD ON P.ProductID = SD.ProductID

INNER JOIN SalesLT.SalesOrderHeader SH ON SD.SalesOrderID = SH.SalesOrderID

INNER JOIN SalesLT.Customer C ON SH.CustomerID = C.CUSTOMERID

GROUP BY C.CompanyName, P.Name

GO[/font]

If you compare the execution plan of each query, you'll see they're exactly the same (due to the work of the QUERY OPTIMIZER).

(I looked for some reference to show you this idea and only found this pretty short explanation http://bytes.com/topic/sql-server/answers/142445-order-joins-performance)