September 17, 2009 at 4:55 am
I had an odd requirement recently which involved filling in gaps in sequences with values based on the last value before the gap and the first value after the gap. Basically, we needed to fill in missing yields for a date/service combination by assuming a straight line between the last known value and the next known value. Probably easier to explain with data:
create table mytable (id int identity(1,1) PRIMARY KEY NONCLUSTERED, DATE datetime NOT NULL, service varchar(5) NOT NULL, yield decimal(10,5))
GO
CREATE CLUSTERED INDEX IX_myTable_service_date
ON MyTable (service, date);
GO
Insert into mytable (Date, service,yield)
Select '2009-09-17','A0001',.1
UNION ALL
Select '2009-09-18','A0001',NULL
UNION ALL
Select '2009-09-19','A0001',NULL
UNION ALL
Select '2009-09-20','A0001',.3
UNION ALL
Select '2009-09-21','A0001',.2
UNION ALL
Select '2009-09-22','A0001',.1
UNION ALL
Select '2009-09-23','A0001',NULL
UNION ALL
Select '2009-09-24','A0001',NULL
UNION ALL
Select '2009-09-25','A0001',NULL
UNION ALL
Select '2009-09-26','A0001',.7
UNION ALL
Select '2009-09-27','A0001',.6
UNION ALL
Select '2009-09-28','A0001',.6
UNION ALL
Select '2009-09-29','A0001',.6
UNION ALL
Select '2009-09-30','A0001',.4
UNION ALL
Select '2009-09-17','A0002',.1
UNION ALL
Select '2009-09-18','A0002',NULL
UNION ALL
Select '2009-09-19','A0002',NULL
UNION ALL
Select '2009-09-20','A0002',.3
UNION ALL
Select '2009-09-21','A0002',.2
UNION ALL
Select '2009-09-22','A0002',.1
UNION ALL
Select '2009-09-23','A0002',NULL
UNION ALL
Select '2009-09-24','A0002',NULL
UNION ALL
Select '2009-09-25','A0002',NULL
UNION ALL
Select '2009-09-26','A0002',.7
UNION ALL
Select '2009-09-27','A0002',.6
UNION ALL
Select '2009-09-28','A0002',.6
UNION ALL
Select '2009-09-29','A0002',.6
UNION ALL
Select '2009-09-30','A0002',.4
The table should look as follows after the update(don't worry about rounding differences):
12009-09-17 00:00:00.000A00010.10000
22009-09-18 00:00:00.000A00010.16667
32009-09-19 00:00:00.000A00010.23333
42009-09-20 00:00:00.000A00010.30000
52009-09-21 00:00:00.000A00010.20000
62009-09-22 00:00:00.000A00010.10000
72009-09-23 00:00:00.000A00010.25000
82009-09-24 00:00:00.000A00010.40000
92009-09-25 00:00:00.000A00010.55000
102009-09-26 00:00:00.000A00010.70000
112009-09-27 00:00:00.000A00010.60000
122009-09-28 00:00:00.000A00010.60000
132009-09-29 00:00:00.000A00010.60000
142009-09-30 00:00:00.000A00010.40000
152009-09-17 00:00:00.000A00020.10000
162009-09-18 00:00:00.000A00020.16667
172009-09-19 00:00:00.000A00020.23333
182009-09-20 00:00:00.000A00020.30000
192009-09-21 00:00:00.000A00020.20000
202009-09-22 00:00:00.000A00020.10000
212009-09-23 00:00:00.000A00020.25000
222009-09-24 00:00:00.000A00020.40000
232009-09-25 00:00:00.000A00020.55000
242009-09-26 00:00:00.000A00020.70000
252009-09-27 00:00:00.000A00020.60000
262009-09-28 00:00:00.000A00020.60000
272009-09-29 00:00:00.000A00020.60000
282009-09-30 00:00:00.000A00020.40000
Obviously there a several RBAR solutions that could accomplish this but in reality, this data is approx. 250,000+ rows as it includes all dates from today to today+365 and there are 700+ services.
There will also be many more services in the future, so I wanted to future proof it as much as possible.
After much banging my head against a wall, I did come up with a set-based solution as follows:
update m
set m.yield=LowerYield.Yield + (((UpperYield.Yield-LowerYield.Yield)/(Datediff(Day,LowerYield.Date1,UpperYield.Date1))) * Datediff(Day,LowerYield.Date1,m.date))
from mytable m
inner join (
select m.date,m.service, m1.date as date1, m1.yield as yield, row_number() over (partition by m.service,m.date order by m1.date desc) as sequence
from mytable m
inner join mytable m1 on m1.service=m.service
where m.yield is null and m1.yield is not null
and m.date > m1.date
) LowerYield on LowerYield.date=m.date and Loweryield.service=m.service and LowerYield.Sequence=1
inner join (
select m.date,m.service, m1.date as date1, m1.yield as yield, row_number() over (partition by m.service,m.date order by m1.date) as sequence
from mytable m
inner join mytable m1 on m1.service=m.service
where m.yield is null and m1.yield is not null
and m.date < m1.date
) UpperYield on UpperYield.date=m.date and UpperYield.service=m.service and UpperYield.Sequence=1
select * from mytable
drop table mytable[/code]
I'm happy with the performance from this (approx. 44 secs on the full data set on quad core 2.2Ghz, 16gb RAM, RAID 10 SCSI disks), but would like to see if anyone can come up with a more elegant solution!
This is a real requirement (with table/column names changed and fake data). If anyone's interested, it's to be used to order a list of web results according to their profitability (and other factors). As the data's relatively static, it makes most sense to do this as a once a day batch job from the sql server rather than attempt this real-time where performance is the key driver.
September 17, 2009 at 6:56 am
Had I had the full data, maybe I cuold come up with something else, but all in all it looks OK..
One thing though: Wouldn't it be better if you added an extra WHERE to the last row?
WHERE m.yield IS NULL
Because I guess you only want to update empty rows. It might make a big difference when it comes to large sets...
September 17, 2009 at 7:04 am
Well the Inner Joins restrict the data set to only NULL yields anyway, but I'll give it a spin and see if the execution plan/performance changes after adding that.
I'll also see if I can provide some sql to give a representative number of rows/spread of data to the real thing.
Nothing urgent, I'm happy with the solution, I've just spent quite a while looking at this and feel I must have missed a simpler solution.
It's kind of similar to a running totals problem, but you also have to consider what's 'in front' of the row as well as 'behind' it.
September 17, 2009 at 7:16 am
UPDATE t SET yield = b.yield + DATEDIFF(d,b.date,t.date)*1. / DATEDIFF(d,b.date,a.date) * (a.yield - b.yield)
FROM
dbo.mytable t JOIN
(SELECT id,
(SELECT TOP 1 id
FROM dbo.mytable
WHERE service = m.service AND date m.date AND yield IS NOT NULL
ORDER BY date) above
FROM dbo.mytable m
WHERE yield IS NULL)n ON t.id = n.id
JOIN dbo.mytable a ON n.above = a.id
JOIN dbo.mytable b ON n.below = b.id;
September 17, 2009 at 7:49 am
Nice one. Hadn't thought of doing that!
Well, with my data, the original method and with Benyos's clause gave consistently the same time (give or take .5 secs), but the extra clause adds more 'logical reads'. Hans method was very slightly slower in all runs and was more expensive in CPU, but there's not much in it.
Results below - I dropped and recreated the source table on each run and ran each one 3 times to make sure the results were consistent:
--Hans
Table 'mytable'. Scan count 561052, logical reads 2395243, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 140278, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 61969 ms, elapsed time = 16616 ms.
--Benyos
Table 'mytable'. Scan count 144951, logical reads 673414, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 140286, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 26499 ms, elapsed time = 13803 ms.
--Mine
Table 'mytable'. Scan count 82612, logical reads 474057, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 140286, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 26065 ms, elapsed time = 13586 ms.
September 17, 2009 at 8:41 am
I love a CTE ... 🙂
with cteGaps(Date,Service,yield,Rown)
as
(
Select Date,Service,yield,row_number() over (partition by service ,case when yield is null then 1 else 0 end order by date desc)
from #mytable
)
,
cteRanges(Service,StartDate,EndDate)
as
(
select service,min(date),max(date)
from cteGaps where yield is null
group by service,date+rown
),
CteStepMaths(Service,StartDate,EndDate,StepYieldAmt,PrevYield)
as
(
Select cteRanges.Service,
Startdate,
EndDate,
StepYieldAmt = convert(decimal(10,5),convert(decimal(10,5),post.yield - Prev.Yield) / convert(decimal(10,5),(datediff(dd,StartDate,EndDate)+2.00000000))),
Prev.Yield
from cteRanges join #mytable Prev on Prev.service = cteRanges.Service
and Prev.Date = cteRanges.StartDate - 1
join #mytable Post on Post.service = cteRanges.Service
and Post.Date = cteRanges.EndDate +1
)
,
cteStepsSoln(Service,Date,Yield)
as
(
Select #MyTable.Service,
#MyTable.Date,
PrevYield+(StepYieldAmt * (DateDiff(dd,StartDate,#MyTable.date)+1))
from CteStepMaths
join #MyTable
on #MyTable.Service = CteStepMaths.Service
and #MyTable.Date between CteStepMaths.StartDate and CteStepMaths.EndDate
)
Select #myTable.id,#myTable.DATE ,#myTable.service ,coalesce(#myTable.Yield,cteStepsSoln.Yield)
from #myTable
left outer join cteStepsSoln
on #myTable.Service = cteStepsSoln.Service
and #myTable.Date = cteStepsSoln.Date
September 17, 2009 at 9:03 am
This makes my head hurt, but trounces my solution 😀 (I've converted yours to update the source table as per the query instead of outputing to select). Results below:
Table 'mytable'. Scan count 5590, logical reads 865294, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 574664, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 10059 ms, elapsed time = 5385 ms.
September 17, 2009 at 9:20 am
Well , the challenging part is sorting out the missing data ranges.
Its an extension on this technique.
http://sqlblogcasts.com/blogs/sqlandthelike/archive/2009/08/27/sql-and-contiguous-data-ranges.aspx
September 17, 2009 at 9:52 am
Ah, yes, I understand now. It was mainly the +2.000000 in the calculation that was confusing me, but I get it now, it's just that you're picking the dates inside the gap rather than outside.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply