The first thing we learn to do with SQL is writing a SELECT statement to get data from one table. This kind of statement seems to be straightforward and very close to the language we speak.
But real-world queries are often much more sophisticated than those simple SELECT statements.
First of all, usually the data we need is split into several different tables. This is a natural consequence of data normalization, which is an essential feature of any well designed database model. And SQL gives you the power to put that data together.
In the past, DBAs and developers used to put all necessary tables and/or views in the FROM clause and then use the WHERE clause to define how the records from each table would combine with the other records. (To make this text a bit more readable, from now on, I will simplify things and say "table" instead of "table and/or view").
But there's been a long time since we have a standard for bringing these data together. And this is done using the JOIN operator ( ANSI-SQL 92). Unfortunately, there are some details about JOIN operators that remain obscure for many people.
Below I will show different syntaxes of joins supported by T-SQL (that is SQL SERVER 2008). I will outline a few concepts I believe they shouldn't be forgot each time we combine the data from two tables or more.
Getting Started: 1 Table, no Join
When you have only one object to query, the syntax will be quite simple and no join will be used. The statement will be the good & old "SELECT fields FROM object" plus any other optional clause you might want to use (that is WHERE, GROUP BY, HAVING or ORDER BY).
One thing that end users don't know is that we DBAs usually hide lots of complex joins under one nice and easy-to-use view. This is done for several reasons, ranging from data security to database performance. For instance, DBAs can give permissions for the end users to access one single view instead of several production tables, obviously increasing data security. Or considering performance, DBAs can create a view using the right parameters to join the records from several tables, correctly using database indexes and thus boosting query performance.
All in all, joins might be there in the database even when the end users don't see them.
The Logic Behind Joining Tables
Many years ago, when I started working with SQL, I learned there were several types of joins. But it took me some time to understand what exactly I was doing when I brought those tables together. Maybe because people are so scared of mathematics, it is not frequently said that the whole idea behind joining tables is about Set Theory. Despite the fancy name, the concept is so simple we are taught it in elementary school.
The drawing in Figure 1 is quite similar to the ones found in my kids' books from First Grade. The idea is to find correspondent objects in the different sets. Well, this is precisely what we do with SQL JOINs!
Figure 1: Combining objects from different sets
Once you understand the analogy, things will start to make sense.
Consider that the 2 sets in Figure 1 are tables and the numbers we see are the keys we will use to join the tables. So in each set, instead of representing the whole records, we are only seeing the key fields from each table. The result set of this combination will be determined by the type of join we consider, and this is the topic I will show now. To illustrate our examples, consider we have 2 tables, shown below:
Table1
key1 | field1 | field2 | key2 | key3 |
3 | Erik | 8 | 1 | 6 |
4 | John | 3 | 4 | 4 |
6 | Mark | 3 | 7 | 1 |
7 | Peter | 6 | 8 | 5 |
8 | Harry | 0 | 9 | 2 |
Table2
key2 | field1 | field2 | field3 |
1 | New York | A | N |
2 | Sao Paulo | B | N |
4 | Paris | C | Y |
5 | London | C | Y |
6 | Rome | C | Y |
9 | Madrid | C | Y |
0 | Bangalore | D | N |
The script to create and populate those tables are available as one attached file (SQLServerCentral.com_JOIN.sql) in the Resources section below.
You will notice this script does not fully implement referential integrity. I intentionally left the tables without foreign keys to better explain the functionality of the different types of joins. But I did so for didactical purposes only. Foreign keys are extremely useful to guarantee data consistency and they should not be left out of any real-world database.
Well, now we are ready to go. Let's check the types of joins we can use in T-SQL, the correspondent syntax and the result set that each one will generate.
The Inner Join
This is the most common join we use in SQL. It returns the intersection of two sets. Or in terms of tables, it brings only the records from both tables that match a given criteria.
We can see in Figure 2 the Venn diagram illustrating the inner join of two tables. The result set of the operation is the area in red.
Figure 2: Representing the INNER JOIN
Now check the syntax to combine the data from Table1 and Table2 using a INNER JOIN.
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.key2 = t2.key2;
The result set of this statement will be:
key1 | Name | T1Key | T2Key | City |
3 | Erik | 1 | 1 | New York |
4 | John | 4 | 4 | Paris |
8 | Harry | 9 | 9 | Madrid |
Notice it returned only the data from records which have the same value for key2 on both Table1 and Table2.
Opposed to the INNER JOIN, there is also the OUTER JOIN. There are 3 types of OUTER JOINs, named full, left and right. We will look at each one in detail below.
The FULL JOIN
This is also known as the FULL OUTER JOIN (the reserved word OUTER is optional). FULL JOINs work like the union of two sets. Now we have in Figure 3 the Venn diagram illustrating the FULL JOIN of two tables. The result set of the operation is again the area in red.
Figure 3: Representing the FULL JOIN
The syntax is almost exactly the same we saw before.
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
FULL JOIN Table2 t2 ON t1.key2 = t2.key2 ;
The result set of this statement will be:
key1 | Name | T1Key | T2Key | City |
3 | Erik | 1 | 1 | New York |
4 | John | 4 | 4 | Paris |
6 | Mark | 7 | null | null |
7 | Peter | 8 | null | null |
8 | Harry | 9 | 9 | Madrid |
null | null | null | 2 | Sao Paulo |
null | null | null | 5 | London |
null | null | null | 6 | Rome |
null | null | null | 0 | Bangalore |
The FULL JOIN returns all records from Table1 and Table2, without duplicating data.
The LEFT JOIN
Also known as LEFT OUTER JOIN, this is a particular case of the FULL JOIN. It brings all requested data from the table that appears to the left of the JOIN operator, plus the data from the right table which intersects with the first one. Below we have a Venn diagram illustrating the LEFT JOIN of two tables in Figure 4.
Figure 4: Representing the LEFT JOIN
See the syntax below.
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.key2 = t2.key2 ;
The result set of this statement will be:
key1 | Name | T1Key | T2Key | City |
3 | Erik | 1 | 1 | New York |
4 | John | 4 | 4 | Paris |
6 | Mark | 7 | null | null |
7 | Peter | 8 | null | null |
8 | Harry | 9 | 9 | Madrid |
The third and forth records (key1 equals to 6 and 7) show NULL values on the last fields because there is no information to be brought from the second table. This means we have a value in field key2 in Table1 with no correspondent value in Table2. We could have avoided this "data inconsistency" in case we had a foreign key on field key2 in Table1.
The RIGHT JOIN
Also known as RIGHT OUTER JOIN, this is another particular case of the FULL JOIN. It brings all requested data from the table that appears to the right of the JOIN operator, plus the data from the left table which intersects with the right one. The Venn diagram for the RIGHT JOIN of two tables is in Figure 5.
Figure 5: Representing the RIGHT JOIN
As you can see, syntax is very similar.
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
RIGHT JOIN Table2 t2 ON t1.key2 = t2.key2 ;
The result set of this statement will be:
key1 | Name | T1Key | T2Key | City |
null | null | null | 0 | Bangalore |
3 | Erik | 1 | 1 | New York |
null | null | null | 2 | Sao Paulo |
4 | John | 4 | 4 | Paris |
null | null | null | 5 | London |
null | null | null | 6 | Rome |
8 | Harry | 9 | 9 | Madrid |
Observe now that records with key1 equal to 6 and 7 no longer appear in the result set. This is because they have no correspondent record in the right table. There are 4 records showing NULL values on the first fields, because they are not available in the left table.
The CROSS JOIN
A CROSS JOIN is in fact a Cartesian product. Using CROSS JOIN generates exactly the same output of calling two tables (separated by a comma) without any JOIN at all. This means we will get a huge result set, where each record of Table1 will be duplicated for each record in Table2. If Table1 has N1 records and Table2 has N2 records, the output will have N1 times N2 records.
I don't believe there is any way to represent this output in a Venn diagram. I guess it would be a three dimensional image. If this is really the case, the diagram would be more confusing than explanatory.
The syntax for a CROSS JOIN will be:
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
CROSS JOIN Table2 t2 ;
As Table1 has 5 records and Table2 has another 7, the output for this query will have 35 records (5 x 7).
Please check the attached file (SQLServerCentral.com_JOIN_CrossJoin.rpt). |
Quite honestly, I don't remember at this very moment not a single real-life situation that I do need to generate a Cartesian product of two tables. But whenever you need, CROSS JOIN is there, anyway.
Besides, you should be concerned about performance. Say you accidentally run in your production server a query with a CROSS JOIN over two tables with 1 million records. This is surely something that will give you a headache. Probably your server will start showing performance problems, as your query might run for some time consuming a considerable amount of the server resources.
The SELF JOIN
The JOIN operator can be used to combine any pair of tables, including combining the table to itself. This is the "self join". Self joins can use any JOIN operator.
For instance, check this classical example of returning an employee's boss (based on Table1). In this example, we consider that the value in field2 is in fact the boss' code number, therefore related to key1.
SELECT t1.key1, t1.field1 as Name, t1.field2, mirror.field1 as Boss FROM Table1 t1
LEFT JOIN Table1 mirror ON t1.field2 = mirror.key1;
And this is the output to this query.
key1 | Name | field2 | Boss |
3 | Erik | 8 | Harry |
4 | John | 3 | Erik |
6 | Mark | 3 | Erik |
7 | Peter | 6 | Harry |
8 | Harry | 0 | null |
In this example, the last record shows that Harry has no boss, or in other words, he is the #1 in the company's hierarchy.
Excluding the Intersection of the Sets
Checking the previous Venn diagrams I just showed above, one may come to a simple question: what if I need to get all records from Table1 except for those that match with records in Table2. Well, this is pretty useful in day-to-day business, but obviously we don't need a special JOIN operator to do it.
Observe the result sets above and you will see you only need to add a WHERE clause to your SQL statement, looking for records that have a NULL value for Table2's key. So, the result set we are looking is the red area shown in the Venn diagram below (Figure 6).
Figure 6: Non-matching records from Table1.
We can write a LEFT JOIN for this query, for instance:
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.key2 = t2.key2
WHERE t2.key2 IS NULL;
And, finally, the result set will be:
key1 | Name | T1Key | T2Key | City |
6 | Mark | 7 | null | null |
7 | Peter | 8 | null | null |
When we do this kind of query, we have to pay attention to which field we pick for the WHERE clause. We must use a field that does not allow NULL values. Otherwise the result set may include unwanted records. That's why I suggested to use the second table's key. More specifically, its primary key. Since primary keys don't accept NULL values, they will assure our result set will be just what we needed.
One Word about Execution Plans
These comments lead us to an important insight. We usually don't stop to think about this, but observe that the execution plan of SQL queries will first calculate the result set for the FROM clause and the JOIN operator (if any), and then the WHERE clause will be executed.
This is as true for SQL Server as any other RDBMS.
Having a basic understanding of how SQL works is important for any DBA or developer. This will help you to get things done. In fast and reliable way. If you are curious about this, just take a look in the execution plan for the query above, shown in Figure 7.
Figure 7: The execution plan to a query using LEFT JOIN
Joins and Indexes
Take a look again at the Execution Plan of that query. Notice it used the clustered indexes of both tables. Using indexes is the best way to make your query run faster. But you have to pay attention to some details.
When we write our queries, we expect the SQL Server Query Optimizer to use the table indexes to boost your query performance. We can also help the Query Optimizer choose the indexed fields to be part of your query.
For instance, when using the JOIN operator, the ideal approach is to base the join condition over indexed fields. Checking again the Execution Plan, we notice that the clustered index on Table2 was used. This index was automatically built on key2 when this table was created, as key2 is the primary key to that table.
On the other hand, Table1 had no index on field key2. Because of that, the query optimizer tried to be smart enough and improve the performance of querying key2 using the only available index. This was the table clustered index, based on key1, the primary key on Table1. You see the query optimizer is really a smart tool. But you would help it a lot creating a new index (a non-clustered one) on key2.
Remembering a bit about referential integrity, you see key2 should be a foreign key on Table1, because it is related to another field in other table (which is Table2.key2).
Personally I believe foreign keys should exist in all real-world database models. And it is a good idea to create non-clustered indexes on all foreign keys. You will always run lots of queries, and also use the JOIN operator, based on your primary and foreign keys.
(IMPORTANT: SQL Server will automatically create a clustered index on primary keys. But, by default, it does nothing with foreign keys. So make sure you have the proper settings on your database).
Non-equal Comparisons
When we write SQL statements using the JOIN operator, we usually compare if one field in one table is equal to another field in the other table. But this is not the mandatory syntax. We could use any logical operator, like different than (<>), greater than (>), less than (<) and so on.
Although this fancy stuff might give you the impression that SQL gives so much power, I feel this is more like a cosmetic feature. Consider this example. See Table 1 above , where we have 5 records. Now let's consider the following SQL statement.
SELECT t1.key1, t1.field1 as Name, t1.key2 as T1Key,
t2.key2 as T2Key, t2.field1 as City
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.key2 <= t2.key2
WHERE t1.key1 = 3 ;
Notice this uses an inner join and we are specifically picking one single record from Table1, the one where key1 is equal to 3. The only problem is that there are 6 records and Table2 that satisfy the join condition. Take a look in the output to this query.
key1 | Name | T1Key | T2Key | City |
3 | Erik | 1 | 1 | New York |
3 | Erik | 1 | 2 | Sao Paulo |
3 | Erik | 1 | 4 | Paris |
3 | Erik | 1 | 5 | London |
3 | Erik | 1 | 6 | Rome |
3 | Erik | 1 | 9 | Madrid |
The problem with non-equal joins is that they usually duplicate records. And this is not something you will need in a regular basis. Anyway, now you know you can do it.
Multiple JOINs
SQL JOINs are always about putting together a pair of tables and finding related objects that obey a given rule (usually, but not limited to, equal values). We can join multiple tables. For instance, to combine 3 tables, you will need 2 joins. And a new join will be necessary for each new table. If you use a join in each step, to combine N tables, you will use N-1 joins.
One important thing is that SQL allows you to use different types of joins in the same statement.
But DBAs and developers have to be careful on joining too many tables. Several times, I have seen situations where queries demanded 10, 20 tables or even more. For performance reasons, it is not a good idea to do a single query to put all data together. The query optimizer will do a better job if you break your query it into several smaller, simpler queries.
Now consider we have a third table, called Table3, shown bellow.
Table3
key3 | field1 |
1 | Engineer |
2 | Surgeon |
3 | DBA |
4 | Lawyer |
5 | Teacher |
6 | Actor |
Now let's write a statement to bring the employee's name, the city where he lives and what is his profession. This will demand us to join all 3 tables. Just remember that joins are written in pairs. So first we will join Table1 to Table2. And then we will join Table1 and Table3. The resulting script is shown below.
SELECT t1.key1, t1.field1 as Employee,
t2.key2, t2.field1 as City,
t3.key3, t3.field1 as Profession
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.key2 = t2.key2
INNER JOIN Table3 t3 ON t1.key3 = t3.key3;
As we are running only INNER JOINs, we will have only records that match the combination of the 3 tables. See the output below.
key1 | Name | key2 | City | key3 | Profession |
3 | Erik | 1 | New York | 6 | Actor |
4 | John | 4 | Paris | 4 | Lawyer |
6 | Harry | 9 | Madrid | 2 | Surgeon |
Beyond the SELECT statements
The use of JOIN operators is not restricted to SELECT statements. In T-SQL, you can use joins in INSERT, DELETE and UPDATE statements as well. But keep in mind that in most RDBMS's we have nowadays, joins are not supported on DELETE and UPDATE statements. So I would advise you to restrict the use of joins to SELECT and INSERT statements only, even in SQL Server code. This is important if you want to keep your scripts more easily portable to different platforms.
Conclusion
Combining data from several tables is something really important when you work with RDBMS's. JOIN operators give you a powerful, fast and flexible way to retrieve the data you need.
But be careful, because the wrong choices might do some harm to your application. Either consuming too much processing power on your server, or retrieving the wrong output to your application. With the necessary attention, SQL JOINs are a great feature. And it surely should be part of your personal toolbox.