January 6, 2016 at 11:49 am
I have a table with gaps in the data. The requirement is to fill in the gaps with the data from the last month that had data.
Here is a sample of the source data:
SOURCEUSER_IDPHASE PHASE_MONTH
CMN625150Open 1/1/2014
CMN625150In Work 3/1/2014
CMN625150In Work 9/1/2014
CMN625150Released 10/1/2014
The result should be:
CMN625150Open 1/1/2014
CMN625150Open 2/1/2014
CMN625150In Work 3/1/2014
CMN625150In Work 4/1/2014
CMN625150In Work 5/1/2014
CMN625150In Work 6/1/2014
CMN625150In Work 7/1/2014
CMN625150In Work 8/1/2014
CMN625150In Work 9/1/2014
CMN625150Released 10/1/2014
CMN625150Released 11/1/2014
etc...
I created a CTE but it is throwing a overflow error. Here is some sample code:
CREATE TABLE #SAMPLETABLE
(
[SOURCE] [varchar](50) NOT NULL,
[USER_ID] [varchar](50) NOT NULL,
[PHASE] [varchar](100) NOT NULL,
[PHASE_MONTH] [datetime] NOT NULL
)
insert into #SAMPLETABLE
Select 'CMN','625150','Open','1/1/2014' Union all
Select 'CMN','625150','In Work','3/1/2014' Union all
Select 'CMN','625150','In Work','9/1/2014' Union all
Select 'CMN','625150','Released','10/1/2014'
;with cte as
(
select SOURCE,USER_ID,PHASE,PHASE_MONTH
from #SAMPLETABLE
union all
select SOURCE,USER_ID,PHASE,dateadd(mm, 1, PHASE_MONTH)
from cte
where not exists
--the resultant YearMonth value should not lie in the original table
( select 1 from #SAMPLETABLE s
where s.PHASE_MONTH = dateadd(mm, 1, cte.PHASE_MONTH)
--stop at current month
and dateadd(mm, 1, cte.PHASE_MONTH) < GETDATE()
)
)
select SOURCE,USER_ID,PHASE,PHASE_MONTH from cte
order by PHASE_MONTH
option (maxrecursion 0)
drop table #SAMPLETABLE
Can someone please tell me what I need to do to get this to work correctly?
The source data has about 100K rows.
January 6, 2016 at 1:04 pm
The immediate problem is the location of your clause that restricts the results to dates before today.
Right now it's in the NOT EXISTS subquery, which means that once rows start getting generated with dates greater than today, the subquery won't return rows for those dates. When the subquery doesn't return rows, the NOT EXISTS evaluates to true, so the recursion continues until you've exceeded the maximum value for datetime.
To fix that problem, you just need to move this
AND DATEADD(mm, 1, cte.PHASE_MONTH) < GETDATE()
outside the NOT EXISTS subquery.
Additionally, there's likely a much more efficient way to solve this problem than the rCTE. I can't test alternate solutions right now, but when I can I'll follow up.
Cheers!
EDIT: Got bitten by the greater-than/less-than transformations in code tags again, so removed them.
January 6, 2016 at 1:17 pm
So Simple - Thanks!
I tried doing this using nested loops and it kept crashing my system after hours of processing. I just made the change to the CTE and it took 9 seconds.
January 6, 2016 at 1:46 pm
Here's an alternative to generate the rows using cascade CTEs.
I'm including a SQL 2012+ version which you might want to use if at all possible.
WITH
E(n) AS(
SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
E2(n) AS(
SELECT a.n FROM E a, E b
),
cteTally(n) AS(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
FROM E2
),
Phases AS(
SELECT SOURCE,
USER_ID,
PHASE,
PHASE_MONTH,
ROW_NUMBER() OVER( PARTITION BY SOURCE, USER_ID ORDER BY PHASE_MONTH) rn
FROM #SAMPLETABLE
),
PhasesRanges AS(
SELECT s.SOURCE,
s.USER_ID,
s.PHASE,
s.PHASE_MONTH,
ISNULL( e.PHASE_MONTH, DATEADD(MM, DATEDIFF(MM, 0, GETDATE()) + 1, 0)) AS end_range
FROM Phases s
LEFT
JOIN Phases e ON s.source = e.source
AND s.user_id = e.user_id
AND s.rn = e.rn - 1
)
SELECT p.SOURCE,
p.USER_ID,
p.PHASE,
DATEADD(MM, t.n - 1, p.PHASE_MONTH) PHASE_MONTH
FROM cteTally t
JOIN PhasesRanges p ON p.PHASE_MONTH <= DATEADD(MM, -t.n, p.end_range)
ORDER BY PHASE_MONTH;
-- SQL Server 2012+ version using LEAD()
WITH
E(n) AS(
SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
E2(n) AS(
SELECT a.n FROM E a, E b
),
cteTally(n) AS(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
FROM E2
),
Phases AS(
SELECT SOURCE,
USER_ID,
PHASE,
PHASE_MONTH,
ISNULL( LEAD(PHASE_MONTH) OVER( PARTITION BY SOURCE, USER_ID ORDER BY PHASE_MONTH) , DATEADD(MM, DATEDIFF(MM, 0, GETDATE()) + 1, 0)) AS end_range
FROM #SAMPLETABLE
)
SELECT p.SOURCE,
p.USER_ID,
p.PHASE,
DATEADD(MM, t.n - 1, p.PHASE_MONTH) PHASE_MONTH
FROM cteTally t
JOIN Phases p ON p.PHASE_MONTH <= DATEADD(MM, -t.n, p.end_range)
ORDER BY PHASE_MONTH;
Note: The PARTITION BY might be removed if it's not needed and it's causing incorrect results.
January 10, 2016 at 5:07 am
Further on Luis's fine answer, make certain that the proper POC (partition, order, covering) index is in place as it will remove the need for a very expensive sort operator in the execution plan.
😎
POC index for the #SAMPLETABLE
CREATE NONCLUSTERED INDEX NCLIDX_TMP_SAMPLETABLE_SOURCE_USER_ID_PHASE_MONTH_INCL_PASE ON #SAMPLETABLE
(
[SOURCE] ASC
,[USER_ID] ASC
,[PHASE_MONTH] ASC
)
INCLUDE ([PHASE]);
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply