August 21, 2014 at 1:03 pm
Hi SSC,
I've got a problem I'm trying to solve as efficiently as possible. I've attached simplified data, but to help conceptualize it, I want to, for each day in the life of a security, figure out what the highest value was over the past 52 weeks.
In SQL Server 2012+, this is a piece of cake using max() over()... and a row frame, but alas, we don't have 2012+ on any prod boxes.
Two Questions:
1) does anyone know an efficient way to do this (p.s. Quirky updates came to mind, but I couldn't figure out how to apply it to this problem)
2) Does anyone know roughly how the max() over() /w row frame works under the covers? Maybe I can reverse engineer it in an efficient manner
use master
go
set nocount on
go
/**********************
** BUILD SAMPLE DATA **
**********************/
if object_id('tempdb.dbo.#dat') is not null drop table #dat
create table #dat
(
RID int identity(1,1) primary key clustered,
Value float,
MaxRollingValue float
)
insert into #dat(Value)
select top 10000 (abs(checksum(newid())) % 1000) * rand()
from sys.objects a, sys.objects b
/******************
** SET MAX VALUE **
******************/
--Super efficient, but 2012+ only
--1 Scan, 38 LR
select
RID,
MaxRollingValue = max(Value) over (order by (select null)
rows between 10 preceding and current row)
from #dat
--Simple to understand, but horrible performance.
--10001 scans, 31691 LR
select
d1.RID,
MaxRollingValue = max(d2.Value)
from #dat d1
inner join #dat d2
on d1.RID >= d2.RID
and d1.RID - 10 <= d2.RID
and d1.Value <= d2.Value
group by d1.RID
August 21, 2014 at 3:11 pm
The Quirky Update is easy to get, assuming that you already know the rules to follow.
Here's an example with your sample data.
DECLARE @Value float=0,
@id int
UPDATE t SET
@Value = MaxRollingValue = CASE WHEN Value > @Value THEN Value ELSE @Value END,
@id = RID
FROM #dat t WITH( TABLOCKX)
OPTION( MAXDOP 1)
August 21, 2014 at 3:18 pm
Thanks Luis. The only problem there is it doesn't work for a rolling window.
If the highest value is at the tail end of the window and you iterate one row, the high must now change to the second highest value. Then if that remained the high till IT fell out of scope, you'd have to switch to the 3rd highest, and so on.
August 21, 2014 at 3:25 pm
Can you give me an example on what you just said? Are you trying to do a ranking? Or am I misunderstanding you?
Partitioning (using windows) the results is possible, but I'm trying to figure what do you need.
August 21, 2014 at 3:33 pm
Sure. So turning the data set on its side, you could think of the sequence of values like this:
5,3,2,3,1,4,6....
The update you described works if you just want to know the highest value at any point along the whole series, but I want to know what the highest value was only for the past (lets say) 3 records. So let's start reading from the left to the right (with the current frame indicated by brackets)
1) 5],3,2,3,1,4,6. 5 is the highest (by default)
2) 5,3],2,3,1,4,6. 5 is still the highest
3) [5,3,2],3,1,4,6. 5 is still the highest
4) 5,[3,2,3],1,4,6. 3 is now the highest, but you have no way of knowing this since the highest value, against which we were previously comparing to determine the next in the series, is now out of scope.
If you're aware of a way to use an over clause to partition over a range without the use of ROWS or RANGE, I'd be interested to know.
August 21, 2014 at 4:00 pm
I totally missed that part and understand your pain.
For now, the logic indicates that the "wide" join you're using seems the best option with one slight modification. You can remove the last condition to reduce one third of the logical reads. At least that's been consistent on my tests.
I'll try to think further.
August 21, 2014 at 4:34 pm
Hokay, so I think I have an algorithm to do this with a quirky update. There may be some bugs in it because there are so many combinations of how the elements in the window are arranged, but I think more or less, this does the trick.
Early on I had the idea to store off the second highest value, and then if the first highest value rolled off, the second highest value would become the first. Originally I was storing the second highest value in the window whether it was before or after the first highest value. But after I jotted it down on paper, it became clear that you only need to store the second highest value if its IN FRONT of the first highest. No value smaller than the first highest value, behind it in the sequence, would ever supplant the first highest.
Consider the following sequence and assume a rolling window of 3. F = first highest, S = second highest
6,2,5,3,7,1,1,3
•[6],2,5,3,7,1,1,3
oF = 6
oS = null
•[6,2],5,3,7,1,1,3
oF = 6
oS = 2
•[6,2,5],3,7,1,1,3
oF = 6
oS = 5
•6,[2,5,3],7,1,1,3
o6 falls off
oF = 5
oS = 3
•6,2,[5,3,7],1,1,3
oF = 7
oS = null (again, we’re only interested in second highest value IN FRONT of the first highest)
•6,2,5,[3,7,1],1,3
oF = 7
oS = 1
•6,2,5,3,[7,1,1],3
oF = 7
oS = 1
•6,2,5,3,7,[1,1,3]
o7 falls off
oF = 3
oS = null
All these rules can be done on successive rows, making a quirky update possible. This works for the randomized data I originally asked about, but to show it in SQL, I’ve hard coded this exact sequence into this script.
Note the use of @temp<name> variables. The operations need to take place on the variables as they were when they entered the update. If you were to use the actual variables, as soon as the first clause modified @first, it would throw off the clauses for subsequent variable assignments.
The windowed function in 2012+ I think still works better than mine, but this is considerably better than any of the RBAR solutions in terms of performance.
/**********************
** BUILD SAMPLE DATA **
**********************/
if object_id('tempdb.dbo.#dat') is not null drop table #dat
create table #dat
(
RID int identity(1,1) primary key clustered,
Value float,
MaxRollingValue float,
Debug varchar(max),
FirstRID int
)
go
insert into #dat(Value)
values (6),(2),(5),(3),(7),(1),(1),(3)
--select top 10000 (abs(checksum(newid())) % 1000) * rand()
--from sys.objects a, sys.objects b
declare
@tempFirst float,
@first float = -1,
@tempFirstRID int,
@firstRID int = 1,
@tempSecond float,
@second float = -1,
@tempSecondRID int,
@secondRID int = null,
@interval int = 3,
@Anchor int
update #dat with (tablockx)
set @Anchor = RID,
@tempfirst = case when Value > @first then value
when (RID - @Interval) = @firstRID and Value < @second then @second
when (RID - @Interval) = @firstRID and Value > @second then Value
else @first
end,
@tempfirstRID = case when Value > @first then RID
when (RID - @Interval) = @firstRID and Value < @second then @secondRID
when (RID - @Interval) = @firstRID and Value > @second then RID
else @firstRID
end,
@tempSecond = case when Value > @first then -1
when Value < @first and value > @second then value
when (RID - @Interval) = @firstRID then -1
else @second
end,
@tempSecondRID = case when (RID - @Interval) = @firstRID then null
when Value > @first then null
when Value < @first and Value > @second then RID
else @SecondRID
end,
@first = @tempFirst,
@firstRID = @tempFirstRID,
@second = @tempSecond,
@secondRID = @tempSecondRID,
MaxRollingValue = @first,
FirstRID = @firstRID,
Debug = concat('@first:', @first, ' @firstRID:', @firstRID, ' @second:', @second, ' @secondRID:', @secondRID)
option (maxdop 1)
select *
from #dat
order by RID
August 21, 2014 at 10:31 pm
Here is a little "first coffee in the morning" solution:doze:
It uses two instances of a Tally CTE to generate a set N{N+0,,,,N+x} for framing the x rows preceding and current row (kind of).
😎
DECLARE @SET_SIZE INT = 10;
--SET STATISTICS IO ON;
;WITH T(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))
,NUMS(N) AS (SELECT TOP(SELECT COUNT(*) FROM #dat) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM T T1,T T2,T T3,T T4,T T5)
,NM10(N) AS (SELECT TOP (@SET_SIZE) N - 1 FROM NUMS)
,CALC_SET AS
(
SELECT
NM.N
,NM.N + N10.N AS NX
FROM NUMS NM
OUTER APPLY NM10 N10
)
SELECT
CS.N
,MAX(DT.Value) AS MAX_VAL
FROM CALC_SET CS
INNER JOIN #dat DT
ON CS.NX = DT.RID
GROUP BY CS.N
--ORDER BY CS.N
--SET STATISTICS IO OFF;
IO statistics :pinch:
Table 'Worktable'. Scan count 1, logical reads 20438, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#dat________________________________________________________________________________________________________________000000000023'. Scan count 3, logical reads 114, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
August 21, 2014 at 10:46 pm
After the second coffee I realised that the solution was in fact "current row and X following", here is a "X preceding and current row" improvement.
😎
use master
go
set nocount on
go
/**********************
** BUILD SAMPLE DATA **
**********************/
if object_id('tempdb.dbo.#dat') is not null drop table #dat
create table #dat
(
RID int identity(1,1) primary key clustered,
Value float,
MaxRollingValue float
)
insert into #dat(Value)
select top 10000 (abs(checksum(newid())) % 1000) * rand()
from sys.objects a, sys.objects b
DECLARE @SET_SIZE INT = 10;
SET STATISTICS IO ON;
;WITH T(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))
,NUMS(N) AS (SELECT TOP(SELECT COUNT(*) FROM #dat) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM T T1,T T2,T T3,T T4,T T5)
,NM10(N) AS (SELECT TOP (@SET_SIZE) N - 1 FROM NUMS)
,PRENUM(RID,Value) AS (SELECT -N AS RID,0 AS Value FROM NM10)
,CALC_SET AS
(
SELECT
NM.N
,NM.N + N10.N AS NX
FROM NUMS NM
OUTER APPLY NM10 N10
)
,EXTENDED_DAT AS
(
SELECT
RID
,Value
FROM PRENUM
UNION ALL
SELECT
RID
,Value
FROM #dat
)
SELECT
CS.N
,MAX(DT.Value) AS MAX_VAL
FROM CALC_SET CS
INNER JOIN EXTENDED_DAT DT
ON CS.NX = DT.RID
GROUP BY CS.N
ORDER BY CS.N
SET STATISTICS IO OFF;
August 22, 2014 at 5:16 pm
August 23, 2014 at 12:03 am
JeeTee (8/22/2014)
Wow, that works quite well. Given that I discovered some flaws in my post where I thought I had it with the quirky update, I'll try using this against my actual data set and see if everything holds water.Thanks!
Quick thought, I don't think the solution is perfect but it should be easy to improve. Haven't had time to look at it in more detail but I suspect that the leading in part, where N < @SET_SIZE might need improving.
😎
August 24, 2014 at 6:29 pm
Stick with a quirky update:
use master
go
set nocount on
go
/**********************
** BUILD SAMPLE DATA **
**********************/
if object_id('tempdb.dbo.#dat') is not null drop table #dat
create table #dat
(
RID int identity(1,1) primary key clustered,
Value float,
MaxRollingValue float
)
insert into #dat(Value)
select top 10000 (abs(checksum(newid())) % 1000) * rand()
from sys.objects a, sys.objects b
/******************
** SET MAX VALUE **
******************/
--Super efficient, but 2012+ only
--1 Scan, 38 LR
--select
-- RID,
-- MaxRollingValue = max(Value) over (order by (select null)
-- rows between 10 preceding and current row)
--from #dat
--SELECT *
--FROM #dat;
--Simple to understand, but horrible performance.
--10001 scans, 31691 LR
select
d1.RID,
MaxRollingValue = max(d2.Value)
from #dat d1
inner join #dat d2
on d1.RID >= d2.RID
and d1.RID - 10 <= d2.RID
and d1.Value <= d2.Value
group by d1.RID
DECLARE @Lag1 INT = NULL
,@Lag2 INT = NULL
,@Lag3 INT = NULL
,@Lag4 INT = NULL
,@Lag5 INT = NULL
,@Lag6 INT = NULL
,@Lag7 INT = NULL
,@Lag8 INT = NULL
,@Lag9 INT = NULL;
UPDATE #dat WITH(TABLOCKX)
SET MaxRollingValue =
(
SELECT MAX(lag)
FROM (VALUES(@Lag1),(@Lag2),(@Lag3),(@Lag4),(@Lag5),(@Lag6),(@Lag7),(@Lag8),(@Lag9),(Value)) a(lag)
)
,@Lag9 = @Lag8
,@Lag8 = @Lag7
,@Lag7 = @Lag6
,@Lag6 = @Lag5
,@Lag5 = @Lag4
,@Lag4 = @Lag3
,@Lag3 = @Lag2
,@Lag2 = @Lag1
,@Lag1 = Value
OPTION (MAXDOP 1);
SELECT *
FROM #dat;
This is a knock-off of what I did in this article: https://www.simple-talk.com/sql/t-sql-programming/calculating-values-within-a-rolling-window-in-transact-sql/
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 24, 2014 at 6:44 pm
Thanks, Dwain. I'll look this solution over as well and see how it fits the data set I'm working with.
Fwiw (and this doesn't quite work in pure SQL), I did some research on different algorithms to achieve this and I think the best one is illustrated in this post:
http://softwarelearner.blogspot.com/2011/04/minima-in-sliding-window.html
Admittedly, this is not TSQL, but I thought I'd share the fruits of my research.
It's my guess that the function Microsoft built into 2012 probably does something along those lines. Either way, there have been a few ingenious approaches to solving this issues in SQL posted thus far. Thanks!
August 24, 2014 at 7:01 pm
I looked at your link and I would caution you to be careful when you try to translate procedural solutions like that "ascending minima" thingy into a set-based (declarative) one. Often they don't translate well. Like for example in that case, sorting each subgroup of 10 elements, especially since that is not needed.
You were on the right track initially thinking Quirky Update (which is what mine does). The issue is that it becomes a bit of a curiosity when you need to create the sliding window.
BTW. Like your job title!
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply