The Potential of Joins

Introduction

This article highlights the clever use of joins and how they can be used to solve common business problems, which may be otherwise difficult and slow to solve using other methods. The scope of this article does not cover the basics of joins, however this link provides an excellent resource if the reader is interested.

Most of the problems involving extracting data from multiple linked tables are usually solved using INNER and LEFT OUTER joins. However, many a times some developers are left wondering which JOIN should be implemented in which scenario, especially in case when more than 2 tables are involved.

Let's start the game

Even though it is recommended that one should avoid subqueries, most people don’t hesitate to use them without realizing the benefits that JOINS can provide over subqueries.

Consider the following scenario from the AdventureWorks 2008 database. Let’s say a query needs to extract all products name that have been bought from vendor Mountain Works.

Scenario 1: Sub-queries VS Joins

The traditional and slow way

select DISTINCT Name 
 from Production.Product
 where ProductID IN (Select ProductID from Purchasing.ProductVendor
                      where VendorID=( select VendorID 
                                        from Purchasing.Vendor
                                        where Name='Mountain Works'
                                      )
                     )

Using Joins:

select DISTINCT P.Name
 from Production.Product P
  INNER JOIN Purchasing.ProductVendor PV 
    ON P.ProductID = PV.ProductID
  INNER JOIN Purchasing.Vendor V 
    ON V.VendorID = PV.VendorID
 where V.Name='Mountain Works'

For a table containing less data, both the queries may or less take same amount of time. However for a table that contains millions of rows the time difference would be noticeable. The reason being that the subquery has to run one query for each returned row whereas the join just has to run one query, and it utilizes the table indexes (if there are any). The same result can be achived using an inner join.

Scenario 2: NOT IN clause VS Joins

Now we need a query that needs to extract all products name that has been bought by all the vendors except Mountain Works, so this time we will involving the NOT IN clause.

Using Sub-Queries (Traditional and slow)

select DISTINCT Name 
 from Production.Product 
 where ProductID IN (Select ProductID from Purchasing.ProductVendor 
                     where VendorID IN(select VendorID from Purchasing.Vendor 
                                       where Name<>'Mountain Works'
                                      )
                     )

Using Joins:

select DISTINCT P.Name 
 from Production.Product P
  INNER JOIN Purchasing.ProductVendor PV 
   ON P.ProductID = PV.ProductID
  LEFT JOIN Purchasing.Vendor V 
   ON V.VendorID = PV.VendorID 
   AND V.Name='Mountain Works'
 where V.Name IS NULL

Note that in the code snippet above, a LEFT JOIN has been used along with an AND condition. A Left Join with Purchasing.Vendor along with AND condition will give you all the vendors whose name is Mountain Works (all the rest of the vendors' names will be NULL). These are the rows we do not want, so to filter them out, we just add "V.Name is NULL".

Again, the reason for using a join instead of the subquery is that the latter has to run one query for each returned row whereas former just has to run one query in total that utilizes the table indexes.

Scenario 3: Optimizing aggregations

Aggregations are a resource intensive process. The more the number of columns used for a GROUP BY, the slower the query will execute. Hence, it is always advisable to create aggregations only on required columns.

Let’s take an example where it is required to find the quantity of product ordered along with product details. In the traditional way I have kept all the non aggregatable columns present in the select query, in the group by section as well.

Using all columns (Traditional and slow)

select P.Name,
       P.ProductNumber,
       P.Color,
       P.DaysToManufacture,
        P.MakeFlag,
        P.FinishedGoodsFlag,
        P.Color,
        P.SafetyStockLevel,
        P.ReorderPoint,
        P.StandardCost,
        P.ProductNumber,
        P.MakeFlag,
        P.FinishedGoodsFlag,
        P.Color,
        P.SafetyStockLevel,
        P.ReorderPoint,
        P.SizeUnitMeasureCode,
        P.WeightUnitMeasureCode,
        P.Weight,
        P.ProductLine,
        P.Class,
        P.Style,
        H.Freight,
        C.AccountNumber,
        S.SalesOrderID,
       SUM(S.OrderQty) AS OrderQty
 FROM Production.Product P
  INNER JOIN Sales.SalesOrderDetail S ON P.ProductID = S.ProductID
  INNER JOIN Sales.SalesOrderHeader H ON H.SalesOrderID = S.SalesOrderID
  INNER JOIN Sales.Customer C ON C.CustomerID = H.CustomerID
 GROUP BY P.Name,
            P.ProductNumber,
            P.Color,
            P.DaysToManufacture,
            P.MakeFlag,
            P.FinishedGoodsFlag,
            P.Color,
            P.SafetyStockLevel,
            P.ReorderPoint,
            P.StandardCost,
            P.ProductNumber,
            P.MakeFlag,
            P.FinishedGoodsFlag,
            P.Color,
            P.SafetyStockLevel,
            P.ReorderPoint,
            P.SizeUnitMeasureCode,
            P.WeightUnitMeasureCode,
            P.Weight,
            P.ProductLine,
            P.Class,
            P.Style,
            H.Freight,
            C.AccountNumber,
            S.SalesOrderID

Using joins, I am creating a alias table and using only on 2 columns not on 25 columns (shown in traditional method). Aggregation is being done on only the necessary columns, which is a better way.

select P.Name,
       P.ProductNumber,
       P.Color,
       P.DaysToManufacture,
        P.MakeFlag,
        P.FinishedGoodsFlag,
        P.Color,
        P.SafetyStockLevel,
        P.ReorderPoint,
        P.StandardCost,
        P.ProductNumber,
        P.MakeFlag,
        P.FinishedGoodsFlag,
        P.Color,
        P.SafetyStockLevel,
        P.ReorderPoint,
        P.SizeUnitMeasureCode,
        P.WeightUnitMeasureCode,
        P.Weight,
        P.ProductLine,
        P.Class,
        P.Style,
        H.Freight,
        C.AccountNumber,
        S.SalesOrderID,
        S.OrderQty
 FROM Production.Product P
  INNER JOIN (select ProductID,SalesOrderID,SUM(OrderQty) AS OrderQty FROM Sales.SalesOrderDetail GROUP BY ProductID,SalesOrderID) S ON P.ProductID = S.ProductID
  INNER JOIN Sales.SalesOrderHeader H ON H.SalesOrderID = S.SalesOrderID
  INNER JOIN Sales.Customer C ON C.CustomerID = H.CustomerID

In the former approach aggregation has been done on four columns whereas in the latter case if the same thing is done using joins, then aggregation needs to be placed only on one column.

End Game

Last but not least, the following situation is where joins have proven extremely helpful. In a call center based application it is common to have a table that records the call related details like the response, call date and customer name. Let's create the base table:

Select 'A' AS CustomerName,'03-12-10' AS CalledOn,'No Response' AS CustomerResponse INTO #TMP
UNION
Select 'A' AS CustomerName,'02-15-11' AS CalledOn,'Busy Call later' AS CustomerResponse
UNION
Select 'B' AS CustomerName,'03-12-10' AS CalledOn,'Interested Call later' AS CustomerResponse
UNION
Select 'A' AS CustomerName,'10-14-11' AS CalledOn,'No Response' AS CustomerResponse
UNION
Select 'C ' AS CustomerName,'03-12-10' AS CalledOn,'No Response' AS CustomerResponse
UNION
Select 'B' AS CustomerName,'04-29-11' AS CalledOn,'No Response' AS CustomerResponse
UNION
Select 'A' AS CustomerName,'02-14-12' AS CalledOn,'Interested Call later' AS CustomerResponse
UNION
Select 'B' AS CustomerName,'05-01-11' AS CalledOn,'No Response' AS CustomerResponse
UNION
Select 'C' AS CustomerName,'01-05-12' AS CalledOn,'Not Interested' AS CustomerResponse
UNION
Select 'B' AS CustomerName,'10-14-11' AS CalledOn,'Sale' AS CustomerResponse

The requirement is to find all the customers who have never showed disinterest, their last response, and the last time they were called.

SELECT T.CustomerName,T.CustomerResponse,T.CalledOn
 FROM #TMP T
  LEFT OUTER JOIN (SELECT CustomerName 
                    FROM #TMP 
                    WHERE CustomerResponse = 'Not Interested'
                  ) Q
   ON Q.CustomerName = T.CustomerName
 INNER JOIN (SELECT CustomerName,
                    MAX(CONVERT(DATE,CalledOn)) AS MaxCalledOn 
              FROM #TMP 
              GROUP BY CustomerName
            )Q2
  ON Q2.CustomerName = T.CustomerName
 WHERE Q.CustomerName IS NULL
 AND T.CalledOn = Q2.MaxCalledOn

A brief analysis of the query:

This query is a good example of the self join concept.
Instead of using NOT IN, we have utilized LEFT OUTER JOIN and IS NULL.
There are three columns in the output of the query, however the GROUP BY is applied only on a single column.
One important concept of join that is being exploited here. The two subquery Q and Q2 give less number of rows in the output as compared to the main query, but their result gets repeated as the value in key column that is CustomerName is redundant.

Definitely, there are other ways to write an equivalent query, but it is left to the readers to try that.

Conclusion

We can easily avoid subqueries by using joins. It will make your query faster and less resource intensive. Joins have great power and their judicious use would not only optimize your queries but also make the SQL Server engine happy.

Introduction

Let's start the game

Scenario 1: Sub-queries VS Joins

Scenario 2: NOT IN clause VS Joins

Scenario 3: Optimizing aggregations

End Game

Conclusion

Rate

Share

Categories

Share

Rate

The Potential of Joins

Introduction

Let's start the game

Scenario 1: Sub-queries VS Joins

Scenario 2: NOT IN clause VS Joins

Scenario 3: Optimizing aggregations

End Game

Conclusion

Rate

Share

Categories

Share

Rate

Related content

A Normalization Primer

Reaching the Outer Limits

Working with email addresses in SQL Server!

Querying Responses

Storing Checkbox Responses as Integers