Introduction
This article highlights the clever use of joins and how they can be used to solve common business problems, which may be otherwise difficult and slow to solve using other methods. The scope of this article does not cover the basics of joins, however this link provides an excellent resource if the reader is interested.
Most of the problems involving extracting data from multiple linked tables are usually solved using INNER and LEFT OUTER joins. However, many a times some developers are left wondering which JOIN should be implemented in which scenario, especially in case when more than 2 tables are involved.
Let's start the game
Even though it is recommended that one should avoid subqueries, most people don’t hesitate to use them without realizing the benefits that JOINS can provide over subqueries.
Consider the following scenario from the AdventureWorks 2008 database. Let’s say a query needs to extract all products name that have been bought from vendor Mountain Works.
Scenario 1: Sub-queries VS Joins
The traditional and slow way
select DISTINCT Name
from Production.Product
where ProductID IN (Select ProductID from Purchasing.ProductVendor
where VendorID=( select VendorID
from Purchasing.Vendor
where Name='Mountain Works'
)
)
Using Joins:
select DISTINCT P.Name from Production.Product P INNER JOIN Purchasing.ProductVendor PV ON P.ProductID = PV.ProductID INNER JOIN Purchasing.Vendor V ON V.VendorID = PV.VendorID where V.Name='Mountain Works'
For a table containing less data, both the queries may or less take same amount of time. However for a table that contains millions of rows the time difference would be noticeable. The reason being that the subquery has to run one query for each returned row whereas the join just has to run one query, and it utilizes the table indexes (if there are any). The same result can be achived using an inner join.
Scenario 2: NOT IN clause VS Joins
Now we need a query that needs to extract all products name that has been bought by all the vendors except Mountain Works, so this time we will involving the NOT IN clause.
Using Sub-Queries (Traditional and slow)
select DISTINCT Name from Production.Product where ProductID IN (Select ProductID from Purchasing.ProductVendor where VendorID IN(select VendorID from Purchasing.Vendor where Name<>'Mountain Works' ) )
Using Joins:
select DISTINCT P.Name from Production.Product P INNER JOIN Purchasing.ProductVendor PV ON P.ProductID = PV.ProductID LEFT JOIN Purchasing.Vendor V ON V.VendorID = PV.VendorID AND V.Name='Mountain Works' where V.Name IS NULL
Note that in the code snippet above, a LEFT JOIN has been used along with an AND condition. A Left Join with Purchasing.Vendor along with AND condition will give you all the vendors whose name is Mountain Works (all the rest of the vendors' names will be NULL). These are the rows we do not want, so to filter them out, we just add "V.Name is NULL".
Again, the reason for using a join instead of the subquery is that the latter has to run one query for each returned row whereas former just has to run one query in total that utilizes the table indexes.
Scenario 3: Optimizing aggregations
Aggregations are a resource intensive process. The more the number of columns used for a GROUP BY, the slower the query will execute. Hence, it is always advisable to create aggregations only on required columns.
Let’s take an example where it is required to find the quantity of product ordered along with product details. In the traditional way I have kept all the non aggregatable columns present in the select query, in the group by section as well.
Using all columns (Traditional and slow)
select P.Name, P.ProductNumber, P.Color, P.DaysToManufacture, P.MakeFlag, P.FinishedGoodsFlag, P.Color, P.SafetyStockLevel, P.ReorderPoint, P.StandardCost, P.ProductNumber, P.MakeFlag, P.FinishedGoodsFlag, P.Color, P.SafetyStockLevel, P.ReorderPoint, P.SizeUnitMeasureCode, P.WeightUnitMeasureCode, P.Weight, P.ProductLine, P.Class, P.Style, H.Freight, C.AccountNumber, S.SalesOrderID, SUM(S.OrderQty) AS OrderQty FROM Production.Product P INNER JOIN Sales.SalesOrderDetail S ON P.ProductID = S.ProductID INNER JOIN Sales.SalesOrderHeader H ON H.SalesOrderID = S.SalesOrderID INNER JOIN Sales.Customer C ON C.CustomerID = H.CustomerID GROUP BY P.Name, P.ProductNumber, P.Color, P.DaysToManufacture, P.MakeFlag, P.FinishedGoodsFlag, P.Color, P.SafetyStockLevel, P.ReorderPoint, P.StandardCost, P.ProductNumber, P.MakeFlag, P.FinishedGoodsFlag, P.Color, P.SafetyStockLevel, P.ReorderPoint, P.SizeUnitMeasureCode, P.WeightUnitMeasureCode, P.Weight, P.ProductLine, P.Class, P.Style, H.Freight, C.AccountNumber, S.SalesOrderID
Using joins, I am creating a alias table and using only on 2 columns not on 25 columns (shown in traditional method). Aggregation is being done on only the necessary columns, which is a better way.
select P.Name, P.ProductNumber, P.Color, P.DaysToManufacture, P.MakeFlag, P.FinishedGoodsFlag, P.Color, P.SafetyStockLevel, P.ReorderPoint, P.StandardCost, P.ProductNumber, P.MakeFlag, P.FinishedGoodsFlag, P.Color, P.SafetyStockLevel, P.ReorderPoint, P.SizeUnitMeasureCode, P.WeightUnitMeasureCode, P.Weight, P.ProductLine, P.Class, P.Style, H.Freight, C.AccountNumber, S.SalesOrderID, S.OrderQty FROM Production.Product P INNER JOIN (select ProductID,SalesOrderID,SUM(OrderQty) AS OrderQty FROM Sales.SalesOrderDetail GROUP BY ProductID,SalesOrderID) S ON P.ProductID = S.ProductID INNER JOIN Sales.SalesOrderHeader H ON H.SalesOrderID = S.SalesOrderID INNER JOIN Sales.Customer C ON C.CustomerID = H.CustomerID
In the former approach aggregation has been done on four columns whereas in the latter case if the same thing is done using joins, then aggregation needs to be placed only on one column.
End Game
Last but not least, the following situation is where joins have proven extremely helpful. In a call center based application it is common to have a table that records the call related details like the response, call date and customer name. Let's create the base table:
Select 'A' AS CustomerName,'03-12-10' AS CalledOn,'No Response' AS CustomerResponse INTO #TMP UNION Select 'A' AS CustomerName,'02-15-11' AS CalledOn,'Busy Call later' AS CustomerResponse UNION Select 'B' AS CustomerName,'03-12-10' AS CalledOn,'Interested Call later' AS CustomerResponse UNION Select 'A' AS CustomerName,'10-14-11' AS CalledOn,'No Response' AS CustomerResponse UNION Select 'C ' AS CustomerName,'03-12-10' AS CalledOn,'No Response' AS CustomerResponse UNION Select 'B' AS CustomerName,'04-29-11' AS CalledOn,'No Response' AS CustomerResponse UNION Select 'A' AS CustomerName,'02-14-12' AS CalledOn,'Interested Call later' AS CustomerResponse UNION Select 'B' AS CustomerName,'05-01-11' AS CalledOn,'No Response' AS CustomerResponse UNION Select 'C' AS CustomerName,'01-05-12' AS CalledOn,'Not Interested' AS CustomerResponse UNION Select 'B' AS CustomerName,'10-14-11' AS CalledOn,'Sale' AS CustomerResponse
The requirement is to find all the customers who have never showed disinterest, their last response, and the last time they were called.
SELECT T.CustomerName,T.CustomerResponse,T.CalledOn FROM #TMP T LEFT OUTER JOIN (SELECT CustomerName FROM #TMP WHERE CustomerResponse = 'Not Interested' ) Q ON Q.CustomerName = T.CustomerName INNER JOIN (SELECT CustomerName, MAX(CONVERT(DATE,CalledOn)) AS MaxCalledOn FROM #TMP GROUP BY CustomerName )Q2 ON Q2.CustomerName = T.CustomerName WHERE Q.CustomerName IS NULL AND T.CalledOn = Q2.MaxCalledOn
A brief analysis of the query:
- This query is a good example of the self join concept.
- Instead of using NOT IN, we have utilized LEFT OUTER JOIN and IS NULL.
- There are three columns in the output of the query, however the GROUP BY is applied only on a single column.
- One important concept of join that is being exploited here. The two subquery Q and Q2 give less number of rows in the output as compared to the main query, but their result gets repeated as the value in key column that is CustomerName is redundant.
Definitely, there are other ways to write an equivalent query, but it is left to the readers to try that.
Conclusion
We can easily avoid subqueries by using joins. It will make your query faster and less resource intensive. Joins have great power and their judicious use would not only optimize your queries but also make the SQL Server engine happy.