In level 2 of this stairway I discussed how to use a subquery in a Transact-SQL statement. This stairway level will expand on the subquery topic by discussing a type of subquery known as a correlated subquery. I will be exploring what is a correlated subquery and how it is different from a normal subquery. Additionally I will provide you a number of examples of Transaction-SQL statements that go beyond the basics and use a correlated subquery to help identify the rows returned in the result set to meet complex business requirements.
What is a Correlated Subquery?
In level 2 of this stairway we learned that a normal subquery is just a SELECT statement within another Transact-SQL statement, where the subquery can return results if run independently of the outer query. The correlated subquery is a form of a subquery that cannot be run independently of the outer query, because it contains one or more columns from the outer query. The correlated subquery, just like a normal subquery, is sometimes referred to as the inner query. If the correlated subquery (inner query) is run independently of the outer query it will return an error. Because the execution of the inner query depends on values from the outer query, it is called a correlated subquery.
The correlated subquery may be executed many times. It will be run once for each candidate row selected in the outer query. The column values of each candidate row will be used to supply values for the outer query columns in the inner for each execution of the correlated subquery. The final results of a statement that contains a correlated subquery will be based on results of each execution of the correlated subquery.
Sample Data for Correlated Subquery Examples
In order to demonstrate how to use a correlated subquery I will need some test data. Rather than create my own test data, all of my examples will use the AdventureWorks2008R2 database. If you want to follow along and run my examples in your environment then you can download the AdventureWorks2008R2 database from here: http://msftdbprodsamples.codeplex.com/releases/view/93587
Example of a Correlated Subquery in WHERE Clause
To demonstrate the use of a correlated subquery in a WHERE clause suppose, I want to identify those CustomerID’s that have purchased more than 70 items in a single order. To accomplish this requirement I can run the code in Listing 1.
SELECT CustomerID FROM Sales.SalesOrderHeader OH WHERE (SELECT COUNT(*) FROM Sales.SalesOrderDetail WHERE SalesOrderID = OH.SalesOrderID) > 70;
Listing 1: Correlated Subquery in WHERE clause
When I run the code in Listing 1 I get the output in Report 1.
CustomerID ----------- 29712 29722 30048 30107
Report 1: Results returned when running the code in Listing 1
If you review the code in Listing 1 you will see I constrained my WHERE by using a correlated subquery. The subquery is the code within the parentheses I’ve extracted the correlated subquery code from Listing 1 and placed it in Listing 2.
SELECT COUNT(*) FROM Sales.SalesOrderDetail WHERE SalesOrderID = OH.SalesOrderID
Listing 2: Subquery code in Listing 1
If you I run the code in Listing 2 I will find I get an error which is displayed in Report 2.
Msg 4104, Level 16, State 1, Line 3 The multi-part identifier "OH.SalesOrderID " could not be bound.
Report 2: Error when running code in Listing 2
I get the error shown in Report 2 because my correlated subquery contains a reference to the column OH.SalesOrderID which is a column from the outer query. Since all correlated subqueries reference one or more columns from the outer query you cannot run them independently of the outer query with which it is associated. The fact that you can’t run the subquery independently of the whole Transact-SQL statement is what distinguishes a correlated subquery over a normal subquery.
The example presented here is a very trivial example of using a correlated subquery in a WHERE clause. Hopefully by having such a simple example it was easily understand the difference between a normal subquery and a correlated subquery. Typically a correlated subquery might be considerably more complex. Additionally, keep in mind there might be other methods to meet your business requirements without using a correlated subquery.
As you can see writing a correlated subquery is very similar to a normal subquery, but you just can’t run the correlated subquery independently of the outer query.
Example of Correlated Subquery in HAVING clause
There are times when you might want to constrain a HAVING clause by different values from outer query. This is when you can use a correlated subquery in your HAVING clause. Suppose you have to write a query that will calculate rebate amounts for those customer that have purchased more than $150000 worth of products before taxes in the year 2008. The code in Listing 3 calculates the rebate amount for those valued customers by using a correlated subquery in the HAVING clause.
SELECT Outer_H.[CustomerID] , SUM(Outer_H.[SubTotal]) AS TotalPurchase , SUM(Outer_H.[SubTotal]) * .10 AS Rebate FROM [Sales].[SalesOrderHeader] AS Outer_H WHERE YEAR(Outer_H.[OrderDate]) = '2008' GROUP BY Outer_H.[CustomerID] HAVING (SELECT SUM(Inner_H.[SubTotal]) FROM [Sales].[SalesOrderHeader] AS Inner_H WHERE Inner_H.[CustomerID] = Outer_H.[CustomerID] AND YEAR(Inner_H.[OrderDate]) = '2008') > 150000 ORDER BY Rebate DESC;
Listing 3: Correlated Subquery in HAVING Clause
When I run the code in Listing 5 I get the results in Report 3.
CustomerID TotalPurchase Rebate ----------- --------------------- --------------------------------------- 29923 220496.658 22049.665800 29641 210647.4929 21064.749290 29617 187964.844 18796.484400 29913 186387.5613 18638.756130 29818 179916.2877 17991.628770 29940 175358.3954 17535.839540 29987 172169.4612 17216.946120 29736 157700.6034 15770.060340 29995 156984.5148 15698.451480 29770 151824.9944 15182.499440
Report 3: Results of running Listing 3
The correlated subquery code in Listing 3 uses the CustomerID from the GROUP BY clause in the outer query within the correlated subquery. The correlated subquery will be executed once for each row returned from the GROUP BY clause. This allows the HAVING clause to calculated the total amount of products sold to each CustomerID from the outer query by summing the values of the SubTotal column on each SalesOrderHeader record where the record is associated with the CustomerID from the outer query. The Transact-SQL statement in Listing 3 only returns a row where the CustomerID in has purchased more than $150,000 worth of product.
Example of an UPDATE Statement that Contains a Correlated Subquery
Correlated subqueries can be used not only for returning a result set using a SELECT statement. You can also use them to update the data in a SQL Server table. To demonstrate this, I will first generate some test data in a tempdb table, by using the code in Listing 4.
USE tempdb; GO SET NOCOUNT ON; CREATE TABLE CarInventory ( ID int identity, CarName varchar(50), VIN varchar(50), StickerPrice decimal (7,2), InvoicePrice decimal (7,2)); GO INSERT INTO CarInventory VALUES ('Explorer','EXP2014123456A',46198.45,38201.87), ('Explorer','EXP2014123493A',47129.98, 38201.87), ('Grand Cherokee','JGC20141234345X',41678.45,36201.86), ('Grand Cherokee','JGC20141234556W',44518.31,36201.86), ('Pathfinder','NPF2014987365A',32587.73,28917.10), ('Pathfinder','NPF2014239657B',33577.54,28917.10), ('Pathfinder','NPF2014098587C',35876.12,28917.10), ('Tahoe','TAH201409674A',52001.08,46000.01);
Listing 4: Code to Create and Populate Test Table
The code in Listing 4 creates a CarInventory table and then populates it with eight rows representing the cars currently in inventory.
Periodically the Sales Manager likes to see his InvoicePriceRatio by running the query in Listing 5.
SELECT CarName, InvoicePrice/StickerPrice*100.0 AS InvoicePriceRatio FROM CarInventory;
Listing 5: InvoicePriceRatio query
When the manager runs this query she notices that there are a number of similar cars with same InvoicePrice amounts that have different InvoicePriceRatio values. To maximize her Invoice to Sticker price ratio she asks her IT support to write a query that will update the StickerPrice on all of her cars so each car with the same CarName value has the same InvoicePriceRatio. She wants the IT guy to set the StickerPrice to the same value as the maximum Sticker price for that CarName. This way all cars with the same CarName value will have the same StickerPrice value. To accomplish this update of the CarInventory table,the IT guy runs the Transact-SQL statement in Listing 6, which contains a correlated subquery.
UPDATE CarInventory SET StickerPrice = (SELECT MAX(StickerPrice) FROM CarInventory Inner_CI WHERE Inner_CI.CarName = Outer_CI.CarName) FROM CarInventory Outer_CI;
Listing 6: Correlated Subquery to update CarInventory to Maximum Sticker Price
The code in listing 8 uses the CarName from the outer query in the correlated subquery to identify the maximum StickerPrice for each unique CarName. This maximum StickerPrice value found in the correlated subquery is then used to update the StickerPrice value for each CarInventory record that has the same CarName.
Performance Considerations for Correlated Subqueries
There are some performance considerations you should be aware of when writing Transact-SQL statements that contain correlated subqueries. The performance isn’t bad when the outer query contains a small number of rows. But when the outer query contains a large number of rows it doesn’t scale well from a performance perspective. This is because the correlated subquery needs to be executed for every candidate row in the outer query. Therefore when the outer query contains more and more candidate rows a correlated subquery has to be executed many times, and therefore the Transact-SQL statement will take longer to run. If you find performance of your correlated subquery Transact-SQL statements is not meeting your requirements, then you should look for alternatives solutions, such as queries that use an INNER or OUTER JOIN operation, or ones that return a smaller number of candidate rows from the outer query.
Summary
A correlated subquery is an inner query that includes one or more columns from the outer query. The correlated subquery is executed once for every candidate row of the outer query. Because a correlated subquery contains a column from the outer query it cannot be run independently of the outer query. Correlated subqueries have their place, although do not scale well from a performance perspective when there are a large number of candidate rows identified in the outer query.
Question and Answer
In this section you can review how well you understand the concept of correlated subqueries by answering the following questions.
Question 1:
When writing a correlated subquery you need to have___________________ . (fill in the blank)
- One or more columns from the inner query that are used to constrain the results of the correlated subquery.
- One or more column from the inner query that are used in the select list of the correlated subquery.
- One or more columns from the outer query that are used to constrain the results of the correlated subquery.
- One or more columns from the outer query that are used in the selection list of the correlated subquery.
Question 2:
Select all of the following statements that are true about correlated subqueries.
- As the number of candidate rows increases the performance of the Transact-SQL statement that contains a correlated subquery improves.
- The correlated subquery will be executed once for every candidate row from the outer query.
- The correlated subquery will reference one or more columns from the inner query.
- When using a correlated subquery in a HAVING clause the inner query will be executed once for each candidate row returned by the GROUP BY clause.
Question 3:
The correlated subquery is just like a normal subquery, and the correlated subquery can be run independently of the entire Transact-SQL statement (True or False).
- True
- False
Answers:
Question 1:
The correct answer is c. A correlated subquery requires one or more columns from the outer query to be using in the correlated subquery statement. These outer column references are replaced with the values of every candidate row when executing the correlated subquery.
Question 2:
The correct answers are b and d. a is incorrect because as the number of candidate rows increases the correlated subquery gets executed more times, and the Transact-SQL statement performance gets worse. c is incorrect because a correlated subquery must contain one or more rows from the outer query, not the inner query.
Question 3:
The correct answer is b. If you try to run the correlated subquery independently of the complete Transact-SQL statement, the correlated subquery statement will fail.