Correlated Joins Using 'Apply'
One of my cohorts came to me a while back with a SQL Query challenge. He was creating a report for our Live Voice call center application in which he needed to list all Live Voice Transactions within a given date range (easy stuff so far).
The business rules, however, dictate that a Live Voice transaction can result in 1 to many Live Voice calls actually being made. The call center rep will continue making phone calls until they have either reached the customer in question or until it is determined that the call cannot be completed (bad phone number, etc).
This report was required to only show the 3 most recent calls made for each transaction and for each of the calls, it was required to display the call's disposition (busy, voice mail, no answer, etc). Hmmm.....that's a little tougher.
Take the following:
--INNER JOIN SELECT lvt.ID, lvt.PhoneNumber, lvt.CreateDate, lvc.CallWindowStart, lvc.CallWindowEnd, lvc.LVCallDispositionID FROM LVTransaction lvt INNER JOIN LVCall lvc on lvc.LVTransactionID = lvt.ID Order By Lvt.CreateDate,lvc.CreateDate DESC
This example gives me All LVCall records for each LVTransaction. Some LVTransactions have up to a dozen calls made before a customer is reached or before it's determined that the call cannot be successfully completed. So, this query is close, but not yet what we need.
Any attempts to limit the joined records with correlated logic results in errors due to the fact that the outer query columns are not available to the inner query (out of scope).
Correlated Join?:
--Correlated INNER JOIN Attempt SELECT lvt.ID, lvt.PhoneNumber, lvt.CreateDate, lvc.CallWindowStart, lvc.CallWindowEnd, lvc.LVCallDispositionID FROM LVTransaction lvt INNER JOIN (SELECT TOP 3 * FROM LVCall WHERE LVTransactionID = lvt.ID ORDER BY CreateDate DESC) lvc on lvc.LVTransactionID = lvt.ID
This frugal attempt yields the following...
Msg 4104, Level 16, State 1, Line 2 The multi-part identifier "lvt.ID" could not be bound.
The plot thickens...
So I muddled around with a few ideas, attempting various flavors of correlated subqueries, using temp tables, using User Defined functions, etc. and I was not able to come up with a solution that I felt adequately met the requirements.
This seemingly simple problem was turning into a fairly painful ordeal.....welcome to my world.
After searching the web for a bit, I finally came up with a workable solution. The APPLY SQL extension introduced with the release of SQL Server 2005. The APPLY statement can be used in your SQL FROM Clause and allows you to call a table returning function for each row of the outer query. Furthermore (and more importantly for our example), it allows you to pass in outer query columns as arguments to the called function.
The apply statement comes in 2 flavors: CROSS APPLY and OUTER APPLY. CROSS APPLY will return All records in the outer query that have a matching record returned by the inner function\query (Similar to an Inner Join). OUTER APPLY will return all records in the outer query whether they have a matching record in the inner function\query or not (Similar to an Outer Join).
In researching the APPLY statement for this article, I came across the following article by Bill Graziano. using CROSS APPLY in SQL Server 2005. This excellent article discusses a very similar scenario to mine and presents the following solution (Using CROSS APPLY):
CROSS APPLY On A UDF:
SELECT C.CustomerID, O.SalesOrderID, O.TotalDue FROM AdventureWorks.Sales.Customer AS C CROSS APPLY AdventureWorks.dbo.fn_GetTopOrders(C.CustomerID, 3) AS O ORDER BY CustomerID ASC, TotalDue DESC
Notice that Bill is passing in a "3" to the function to indicate that he wants the 3 TOP orders for each customer (Nearly identical to what we need).
The function used above is defined as follows:
CREATE FUNCTION dbo.fn_GetTopOrders(@custid AS int, @n AS INT) RETURNS TABLE AS RETURN SELECT TOP(@n) * FROM Sales.SalesOrderHeader WHERE CustomerID = @custid ORDER BY TotalDue DESC GO
So, although this example gives a great example of using CROSS APPLY, it relies on the use of a UDF, which will work great, but I'd like to avoid it if possible. When devising my solution, I realized that a subquery behaves very much like a function in that it essentially returns a set of records to the caller (outer query) so I thought to myself that we should be able to use the OUTER APPLY statement on a subquery......like this:
OUTER APPLY against a SubQuery:
--OUTER APPLY SELECT lvt.ID, lvt.PhoneNumber, lvt.CreateDate, lvc.CallWindowStart, lvc.CallWindowEnd, lvc.LVCallDispositionID FROM LVTransaction lvt OUTER APPLY ( SELECT top 3 * FROM LVCall WHERE lvtransactionID = lvt.ID Order By CreateDate DESC ) as lvc
VOILA !!!!
Notice, my solution does the same thing that Bill's solution does but it doesn't require a UDF so it's a little easier to implement and to read, etc.
This is the pattern we've now been using at my work to implement these types of queries. It's been easy to teach to other engineers on the team, it works well and it doesn't require creating separate UDF's for implementation.
In reading Bill's article, I also noticed a footnote showing another solution that someone else had sent to him that uses ROW_NUMBER() and PARTITIONING. This was actually very interesting to me. I have to admit that I'd never even heard of these features until coming across the article (I guess we learn something every day).
Here's the solution to my problem using this technique:
ROW_Number() and Partitioning: --ROW_NUMBER SELECT ID AS LVTransactionID, PhoneNumber, CreateDate, CallWindowStart, CallWindowEnd, LVCallDispositionID FROM ( SELECT lvt.ID, lvt.PhoneNumber, lvt.CreateDate, lvc.CallWindowStart, lvc.CallWindowEnd, lvc.LVCallDispositionID FROM LVTransaction lvt INNER JOIN ( SELECT *, ROW_Num = ROW_NUMBER() OVER (PARTITION BY LVCall.LVTransactionID ORDER BY LVCall.CreateDate Desc) FROM LVCall ) lvc on lvc.LVTransactionID = lvt.ID AND lvc.ROW_NUM <= 3 ) as result
This interesting solution provides the exact same resultset as Bill's original example and as my eventual solution. It seems a little more complicated to me, but I have to admit, the subtree cost displayed in the Estimated Query Plan suggests that it would perform better than my solution. I plan on investigating this approach further when I have time.
I'm certain that there are other solutions to this problem. My intent however, with this article is to introduce some T-SQL extensions newly introduced in SQL Server 2005 that you may not have yet come across and to provide a real world example on how their use can solve a real business problem. As other solutions to this same problem come up, I would love to see them. I am concerned with performance as well as ease of application due to the fact that the techniques used must be dispersed throughout our organization so that we are consistent in our implementation with these sorts of problems.