July 21, 2016 at 1:42 pm
I have a table keyed by an id and entry date that has a lot of data nulled out before the last row. I would like to get the last id/date with the last non-null value from each column. Here is an example:
[font="Courier New"]Example Data
id date col1 col2 col3 col4 col5 col6
1 7/17/2016 aaaa NULL NULL 2222 1111 NULL
1 7/18/2016 bbbb xxxx NULL NULL 2222 NULL
1 7/19/2016 NULL NULL jjjj 1111 3333 NULL
1 7/20/2016 NULL NULL NULL NULL 4444 8888
1 7/21/2016 NULL NULL NULL NULL 5555 NULL
2 7/17/2015 RRRR NULL NULL 9999 2222 NULL
2 7/18/2015 LLLL qqqq NULL NULL 5555 NULL
2 7/19/2015 NULL NULL NULL pppp 3737 NULL
Desired Result
id date col1 col2 col3 col4 col5 col6
1 7/21/2016 bbbb xxxx jjjj 1111 5555 8888
2 7/19/2015 LLLL qqqq pppp 3737 3333 NULL[/font]
I can get the last non-null row for a column for each id with the following:
select id, c1
from (select id, max(date) dte from tbl a where col1 is not null group by id) a
join tbl b on a.id = b.id and a.dte = b.date
and I can repeat this in common table expressions and join them all together but am hitting performance and memory issues. I only need to do this once or twice. The real table has about 4 million rows and 30 columns of interest with 360000 ids. I can have hundreds of rows per ID. I've tried using LAG() but can only figure out how to go back a specific number of rows. Doing COALESCE() against a lot of lag values is not practical or probably even supported for several hundred LAG() values.
I've also tried storing the results in temporary tables and then joining them together but still run in to space issues like the following error:
Msg 1105, Level 17, State 2, Line 1
Could not allocate space for object '<temporary system object: 422355172524032>' in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
Next thing I'll try is creating a temporary table and then updating each column to it as it is calculated. Is there are better approach?
July 21, 2016 at 3:04 pm
Here is a two-pass approach to get you the values you want. The first pass gets the maximum date within ID for which each column has a (non-null) value. The second pass then retrieves that value. I don't know what to tell you about your disk allocation, but if you are running into memory contention, I would suggest running this against ranges of IDs, rather than attempting the entire table at one time. For example, run IDs 1 - 99,999 followed by 100,000 to 199,999 etc.
Hope this helps.
declare @sampledata Table (ID int, [Date] Date , Column1 char(4), Column2 char(4), Column3 char(4), Column4 char(4), Column5 char(4), Column6 char(4), primary key (ID, [Date]))
insert into @sampledata
SELECT '1', '7/17/2016', 'aaaa', NULL, NULL, '2222 ', '1111 ', NULL union all
SELECT '1', '7/18/2016', 'bbbb', 'xxxx ', NULL, NULL, '2222 ', NULL union all
SELECT '1', '7/19/2016', NULL, NULL, 'jjjj ', '1111 ', '3333 ', NULL union all
SELECT '1', '7/20/2016', NULL, NULL, NULL, NULL, '4444 ', '8888' union all
SELECT '1', '7/21/2016', NULL, NULL, NULL, NULL, '5555 ', NULL union all
SELECT '2', '7/17/2015', 'RRRR', NULL, NULL, '9999 ', '2222 ', NULL union all
SELECT '2', '7/18/2015', 'LLLL', 'qqqq ', NULL, NULL, '5555 ', NULL union all
SELECT '2', '7/19/2015', NULL, NULL, NULL, 'pppp ', '3737 ', NULL
select * from @sampledata
-- first pass, identify dates
select id, max(case when column1 is not null then [Date] end) as Date1
, max(case when Column2 is not null then [Date] end) as Date2
, max(case when Column3 is not null then [Date] end) as Date3
, max(case when Column4 is not null then [Date] end) as Date4
, max(case when Column5 is not null then [Date] end) as Date5
, max(case when Column6 is not null then [Date] end) as Date6
, max([Date]) as MaxDate
into #temp
from @sampledata
group by ID
create unique clustered index PK_#temp on #temp(ID)
-- second pass, retrieve values
select s.ID, max(MaxDate) as [Date], max(case when [Date1] = [Date] then Column1 end) as Column1
,max(case when [Date2] = [Date] then Column2 end) as Column2
,max(case when [Date3] = [Date] then Column3 end) as Column3
,max(case when [Date4] = [Date] then Column4 end) as Column4
,max(case when [Date5] = [Date] then Column5 end) as Column5
,max(case when [Date6] = [Date] then Column6 end) as Column6
from @sampleData s
join #temp on #temp.ID = s.ID
group by s.ID
Edited to add: I think your sample data may be a little confused, you are showing 3737 for an expected result in column4, but that value only appears in column5.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 21, 2016 at 3:05 pm
That's because LAG is for a relative position. FIRST_VALUE and LAST_VALUE are for absolute positions. You do have to get a little tricky, because you need to ensure that all the NULL values group on one end of your ORDER BY clause. You should also be aware that FIRST_VALUE and LAST_VALUE require a frame. If you don't specify one, it will use the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which will not give you the desired results for LAST_VALUE.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
July 21, 2016 at 3:12 pm
July 22, 2016 at 7:33 am
THANK YOU! That worked very well. Instead of using #temp I put it in a CTE and it ran in 11 seconds which is acceptable performance for my needs.
July 22, 2016 at 4:54 pm
Try this. I get the results that you want, but you should try it in your system to check the performance.
with ct1
as
(
select t.ID,t.Date, case when t.Column1 is null then t.Column1 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column1 end as 'Column1',
case when t.Column2 is null then t.Column2 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+'-'+ t.Column2 end as 'Column2',
case when t.Column3 is null then t.Column3 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column3 end as 'Column3',
case when t.Column4 is null then t.Column4 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column4 end as 'Column4',
case when t.Column5 is null then t.Column5 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column5 end as 'Column5',
case when t.Column6 is null then t.Column6 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column6 end as 'Column6'
from #t as t
)
,
cte2
as
(
select c1.Id, max(c1.date) as Date, Max(column1) as column1, max(column2) as column2, max(column3) as column3,
max(column4) as column4,max(column5) as column5, max(column6) as column6
from ct1 as c1
group by c1.id
)
select c2.Id,
c2.Date,
Right(c2.column1,len(c2.column1)-2) as column1,
Right(c2.column2,len(c2.column2)-2) as column2,
Right(c2.column3,len(c2.column3)-2) as column3,
Right(c2.column4,len(c2.column4)-2) as column4,
Right(c2.column5,len(c2.column5)-2) as column5,
Right(c2.column6,len(c2.column6)-2) as column6
from cte2 as c2
July 23, 2016 at 1:24 am
Quick suggestion, slightly different but easy to scale on large number of columns
😎
USE TEEST;
GO
SET NOCOUNT ON;
--
declare @sampledata Table
(
ID int
, [Date] Date
, Column1 char(4)
, Column2 char(4)
, Column3 char(4)
, Column4 char(4)
, Column5 char(4)
, Column6 char(4)
, primary key (ID, [Date])
);
insert into @sampledata
SELECT '1', '7/17/2016', 'aaaa', NULL, NULL, '2222 ', '1111 ', NULL union all
SELECT '1', '7/18/2016', 'bbbb', 'xxxx ', NULL, NULL, '2222 ', NULL union all
SELECT '1', '7/19/2016', NULL, NULL, 'jjjj ', '1111 ', '3333 ', NULL union all
SELECT '1', '7/20/2016', NULL, NULL, NULL, NULL, '4444 ', '8888' union all
SELECT '1', '7/21/2016', NULL, NULL, NULL, NULL, '5555 ', NULL union all
SELECT '1', '7/22/2016', NULL, NULL, NULL, NULL, NULL, NULL union all
SELECT '2', '7/17/2015', 'RRRR', NULL, NULL, '9999 ', '2222 ', NULL union all
SELECT '2', '7/18/2015', 'LLLL', 'qqqq ', NULL, NULL, '5555 ', NULL union all
SELECT '2', '7/19/2015', NULL, NULL, 'abcd ', NULL, NULL, NULL union all
SELECT '2', '7/20/2015', NULL, NULL, NULL, 'pppp ', '3737 ', NULL;
;WITH BASE_DATA AS
(
SELECT
SD.ID
,SD.Date
,X.COLID
,X.COLVAL
FROM @sampledata SD
CROSS APPLY
(
SELECT 1, SD.Column1 UNION ALL
SELECT 2, SD.Column2 UNION ALL
SELECT 3, SD.Column3 UNION ALL
SELECT 4, SD.Column4 UNION ALL
SELECT 5, SD.Column5 UNION ALL
SELECT 6, SD.Column6
) X(COLID,COLVAL)
)
,LAST_NOT_NULL_ID_DATE AS
(
SELECT
BD.ID
,BD.COLID
,MAX(BD.Date) AS LAST_DATE
,MAX(BD.COLVAL) AS LAST_VALUE
FROM BASE_DATA BD
WHERE BD.COLVAL IS NOT NULL
GROUP BY BD.ID
,BD.COLID
)
SELECT
LD.ID
,MAX(LD.LAST_DATE ) AS LAST_DATE
,MAX(CASE WHEN LD.COLID = 1 THEN LD.LAST_VALUE END) AS COL1
,MAX(CASE WHEN LD.COLID = 2 THEN LD.LAST_VALUE END) AS COL2
,MAX(CASE WHEN LD.COLID = 3 THEN LD.LAST_VALUE END) AS COL3
,MAX(CASE WHEN LD.COLID = 4 THEN LD.LAST_VALUE END) AS COL4
,MAX(CASE WHEN LD.COLID = 5 THEN LD.LAST_VALUE END) AS COL5
,MAX(CASE WHEN LD.COLID = 6 THEN LD.LAST_VALUE END) AS COL6
FROM LAST_NOT_NULL_ID_DATE LD
GROUP BY LD.ID
;
Output
ID LAST_DATE COL1 COL2 COL3 COL4 COL5 COL6
----------- ---------- ---- ---- ---- ---- ---- ----
1 2016-07-21 bbbb xxxx jjjj 2222 5555 8888
2 2015-07-20 RRRR qqqq abcd pppp 5555 NULL
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply