Blog Post

T-SQL Window Functions – Part 4: Analytic Functions

,

This is a reprint with some revisions of a series I originally published on LessThanDot. You can find the links to the original blogs on my Series page.

TSQL-WIndow-Functions_thumb1_thumb_tIn the final installment of my series on SQL window functions, we will explore using analytic functions. Analytic functions were introduced in SQL Server 2012 with the expansion of the OVER clause capabilities. Analytic functions fall in to two primary categories: values at a position and percentiles. Four of the functions, LAG, LEAD, FIRST_VALUE and LAST_VALUE find a row in the partition and returns the desired value from that row. CUME_DIST and PERCENT_RANK break the partition into percentiles and return a rank value for each row. PERCENTILE_CONT and PERCENTILE_DISC a value at the requested percentile in the function for each row. All of the functions and examples in this blog will only work with SQL Server 2012.

Once again, the following CTE will be used as the query in all examples throughout the post:

with CTEOrders as

(select cast(1 as int) as OrderID, cast(‘3/1/2012′ as date) as OrderDate, cast(10.00 as money) as OrderAmt, ‘Joe’ as CustomerName

union select 2, ‘3/1/2012′, 11.00, ‘Sam’

union select 3, ‘3/2/2012′, 10.00, ‘Beth’

union select 4, ‘3/2/2012′, 15.00, ‘Joe’

union select 5, ‘3/2/2012′, 17.00, ‘Sam’

union select 6, ‘3/3/2012′, 12.00, ‘Joe’

union select 7, ‘3/4/2012′, 10.00, ‘Beth’

union select 8, ‘3/4/2012′, 18.00, ‘Sam’

union select 9, ‘3/4/2012′, 12.00, ‘Joe’

union select 10, ‘3/4/2012′, 11.00, ‘Beth’

union select 11, ‘3/5/2012′, 14.00, ‘Sam’

union select 12, ‘3/6/2012′, 17.00, ‘Beth’

union select 13, ‘3/6/2012′, 19.00, ‘Joe’

union select 14, ‘3/7/2012′, 13.00, ‘Beth’

union select 15, ‘3/7/2012′, 16.00, ‘Sam’

)

select OrderID

,OrderDate

,OrderAmt

,CustomerName

from CTEOrders;

Position Value Functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE

Who has not needed to use values from other rows in the current row for a report or other query? A prime example is needing to know what the last order value was to calculate growth or just show the difference in the results. This has never been easy in SQL Server until now. All of these functions require the use of the OVER clause and the ORDER BY clause. They all use the current row within the partition to find the row at the desired position.

The LAG and LEAD functions allow you to specify the offset or how many rows to look forward or backward and they support a default value in cases where the value returned would be null. These functions do not support the use of ROWS or RANGE in the OVER clause. The FIRST_VALUE and LAST_VALUE allow you to further define the partition using ROWS or RANGE if desired.

The following example illustrates all of the functions with various variations on the parameters and settings.

select OrderID

,OrderDate

,OrderAmt

,CustomerName

,LAG(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderID) as PrevOrdAmt

,LEAD(OrderAmt, 2) OVER (PARTITION BY CustomerName ORDER BY OrderID) as NextTwoOrdAmt

,LEAD(OrderDate, 2, ‘9999-12-31’) OVER (PARTITION BY CustomerName ORDER BY OrderID) as NextTwoOrdDtNoNull

,FIRST_VALUE(OrderDate) OVER (ORDER BY OrderID) as FirstOrdDt

,LAST_VALUE(CustomerName) OVER (PARTITION BY OrderDate ORDER BY OrderID) as LastCustToOrdByDay
from CTEOrders

Results (with shortened column names):
IDOrderDateAmtCustPrevOrdAmtNextTwoAmtNextTwoDtFirstOrdLastCust
13/1/201210JoeNULL123/3/20123/1/2012Joe
23/1/201211SamNULL183/4/20123/1/2012Sam
33/2/201210BethNULL113/4/20123/1/2012Beth
43/2/201215Joe10123/4/20123/1/2012Joe
53/2/201217Sam11143/5/20123/1/2012Sam
63/3/201212Joe15193/6/20123/1/2012Joe
73/4/201210Beth10173/6/20123/1/2012Beth
83/4/201218Sam17163/7/20123/1/2012Sam
93/4/201212Joe12NULL12/31/99993/1/2012Joe
103/4/201211Beth10133/7/20123/1/2012Beth
113/5/201214Sam18NULL12/31/99993/1/2012Sam
123/6/201217Beth11NULL12/31/99993/1/2012Beth
133/6/201219Joe12NULL12/31/99993/1/2012Joe
143/7/201213Beth17NULL12/31/99993/1/2012Beth
153/7/201216Sam14NULL12/31/99993/1/2012Sam

If you really like subselects, you can also mix in some subselects and have a very creative SQL statement. The following statement uses LAG and a subselect to find the first value in a partition. I am showing this to illustrate some more of the capabilities of the function in case you have a solution that requires this level of complexity.

select OrderID

,OrderDate

,OrderAmt

,CustomerName

,LAG(OrderAmt, (

select count(*)-1

from CTEOrders c

where c.CustomerName = CTEOrders.CustomerName

and c.OrderID <= CTEOrders.OrderID), 0)

OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID) as FirstOrderAmt

FROM CTEOrders

Percentile Functions: CUME_DIST, PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC

As I wrap up my discussion on window functions, the percentile based functions were the functions I knew the least about. While I have already used the position value functions above, I have not yet needed to use the percentiles. So, that meant I had to work with them for a while so I could share their usage and have some samples for you to use.

The key differences in the four function have to do with ranks and values. CUME_DIST and PERCENT_RANK return a ranking value while PERCENTILE_CONT and PERCENTILE_DISC return data values.

CUME_DIST returns a value that is greater than zero and lest than or equal to one (>0 and <=1) and represents the percentage group that the value falls into based on the order specified. PERCENT_RANK returns a value between zero and one as well (>= 0 and <=1). However, in PERCENT_RANK the first group is always represented as 0 whereas in CUME_DIST it represents the percentage of the group. Both functions return the last percent group as 1. In both cases, as the ranking percentages move from lowest to highest, each group’s percent value includes all of the earlier values in the calculation as well.

The following statement shows both of the functions using the default partition to determine the rankings of order amounts within our dataset.

select OrderID

,OrderDate

,OrderAmt

,CustomerName

,CUME_DIST() OVER (ORDER BY OrderAmt) CumDist

,PERCENT_RANK() OVER (ORDER BY OrderAmt) PctRank

FROM CTEOrders

Results:
OrderIDOrderDateOrderAmtCustomerNameCumDistPctRank
13/1/201210Joe0.20
33/2/201210Beth0.20
73/4/201210Beth0.20
23/1/201211Sam0.333333330.214285714
103/4/201211Beth0.333333330.214285714
63/3/201212Joe0.466666670.357142857
93/4/201212Joe0.466666670.357142857
143/7/201213Beth0.533333330.5
113/5/201214Sam0.60.571428571
43/2/201215Joe0.666666670.642857143
153/7/201216Sam0.733333330.714285714
53/2/201217Sam0.866666670.785714286
123/6/201217Beth0.866666670.785714286
83/4/201218Sam0.933333330.928571429
133/6/201219Joe11

The last two functions, PERCENTILE_CONT and PERCENTILE_DISC, return the value at the percentile requested. PERCENTILE_CONT will return the true percentile value whether it exists in the data or not. For instance, if the percentile group has the values 10 and 20, it will return 15. If PERCENTILE_DISC, is applied to the same group it will return 10. It will return the smallest value in the percentile group, which in this case is 10. Both functions ignore NULL values and do not use the ORDER BY, ROWS, or RANGE clauses with the PARTITION BY clause. Instead, WITHIN GROUP is introduced which must contain a numeric data type and ORDER BY clause. Only one column can be specified here. Both functions need a percentile value which can be between 0.0 and 1.0.

The following script illustrates a couple of variations. The first two functions return the median of the default partition. Then next two return the median value for each day. Finally, the last two functions return the low and high values within the partition. The values segmented by the date partition highlight the key difference between the two functions.

select OrderID as ID

,OrderDate as ODt

,OrderAmt as OAmt

,CustomerName as CName

,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER() PerCont05

,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER() PerDisc05

,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER(PARTITION BY OrderDate) PerContDt

,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER(PARTITION BY OrderDate) PerDiscDt

,PERCENTILE_CONT(0) WITHIN GROUP (ORDER BY OrderAmt) OVER() PerCont0

FROM CTEOrders

Results
IDODtOAmtCNamePerCont05PerDisc05PerContDtPerDiscDtPerCont0
13/1/201210Joe1313.0010.510.0010
23/1/201211Sam1313.0010.510.0010
33/2/201210Beth1313.0015.015.0010
43/2/201215Joe1313.0015.015.0010
53/2/201217Sam1313.0015.015.0010
63/3/201212Joe1313.0012.012.0010
73/4/201210Beth1313.0011.511.0010
103/4/201211Beth1313.0011.511.0010
93/4/201212Joe1313.0011.511.0010
83/4/201218Sam1313.0011.511.0010
113/5/201214Sam1313.0014.014.0010
123/6/201217Beth1313.0018.017.0010
133/6/201219Joe1313.0018.017.0010
143/7/201213Beth1313.0014.513.0010
153/7/201216Sam1313.0014.513.0010

As I wrap up this post, I have to give a shout out to my daughter, Kristy, who is an honors math student. She helped me get my head around this last group of functions. Her honors math work and some statistical work she had done in science helped provide additional insight into the math behind the functions. (Kristy – you rock!)

Series Wrap Up

I hope this series helps everyone understand the power and flexibility in the window functions made available in SQL Server 2012. If you happen to use Oracle, I know that many of these functions or there equivalent are also available in 11g and they also appear to be in 10g. I have to admit my first real production usage was with Oracle 11g but has since used them with SQL Server 2012. The expanded functionality in SQL Server 2012 is just one more reason to upgrade to the latest version.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating