Getting last good data for each id from each column

Question

Getting last good data for each id from each column

tom.w.brannon

Hall of Fame

Points: 3976
More actions
July 21, 2016 at 1:42 pm

#400449

I have a table keyed by an id and entry date that has a lot of data nulled out before the last row. I would like to get the last id/date with the last non-null value from each column. Here is an example:
[font="Courier New"]Example Data
id date col1 col2 col3 col4 col5 col6
1 7/17/2016 aaaa NULL NULL 2222 1111 NULL
1 7/18/2016 bbbb xxxx NULL NULL 2222 NULL
1 7/19/2016 NULL NULL jjjj 1111 3333 NULL
1 7/20/2016 NULL NULL NULL NULL 4444 8888
1 7/21/2016 NULL NULL NULL NULL 5555 NULL
2 7/17/2015 RRRR NULL NULL 9999 2222 NULL
2 7/18/2015 LLLL qqqq NULL NULL 5555 NULL
2 7/19/2015 NULL NULL NULL pppp 3737 NULL
Desired Result
id date col1 col2 col3 col4 col5 col6
1 7/21/2016 bbbb xxxx jjjj 1111 5555 8888
2 7/19/2015 LLLL qqqq pppp 3737 3333 NULL[/font]
I can get the last non-null row for a column for each id with the following:
select id, c1
from (select id, max(date) dte from tbl a where col1 is not null group by id) a
join tbl b on a.id = b.id and a.dte = b.date
and I can repeat this in common table expressions and join them all together but am hitting performance and memory issues. I only need to do this once or twice. The real table has about 4 million rows and 30 columns of interest with 360000 ids. I can have hundreds of rows per ID. I've tried using LAG() but can only figure out how to go back a specific number of rows. Doing COALESCE() against a lot of lag values is not practical or probably even supported for several hundred LAG() values.
I've also tried storing the results in temporary tables and then joining them together but still run in to space issues like the following error:
Msg 1105, Level 17, State 2, Line 1
Could not allocate space for object '<temporary system object: 422355172524032>' in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
Next thing I'll try is creating a temporary table and then updating each column to it as it is calculated. Is there are better approach?

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 1

Here is a two-pass approach to get you the values you want. The first pass gets the maximum date within ID for which each column has a (non-null) value. The second pass then retrieves that value. I don't know what to tell you about your disk allocation, but if you are running into memory contention, I would suggest running this against ranges of IDs, rather than attempting the entire table at one time. For example, run IDs 1 - 99,999 followed by 100,000 to 199,999 etc.

Hope this helps.

declare @sampledata Table (ID int, [Date] Date , Column1 char(4), Column2 char(4), Column3 char(4), Column4 char(4), Column5 char(4), Column6 char(4), primary key (ID, [Date]))

insert into @sampledata

SELECT '1', '7/17/2016', 'aaaa', NULL, NULL, '2222 ', '1111 ', NULL union all

SELECT '1', '7/18/2016', 'bbbb', 'xxxx ', NULL, NULL, '2222 ', NULL union all

SELECT '1', '7/19/2016', NULL, NULL, 'jjjj ', '1111 ', '3333 ', NULL union all

SELECT '1', '7/20/2016', NULL, NULL, NULL, NULL, '4444 ', '8888' union all

SELECT '1', '7/21/2016', NULL, NULL, NULL, NULL, '5555 ', NULL union all

SELECT '2', '7/17/2015', 'RRRR', NULL, NULL, '9999 ', '2222 ', NULL union all

SELECT '2', '7/18/2015', 'LLLL', 'qqqq ', NULL, NULL, '5555 ', NULL union all

SELECT '2', '7/19/2015', NULL, NULL, NULL, 'pppp ', '3737 ', NULL

select * from @sampledata

-- first pass, identify dates

select id, max(case when column1 is not null then [Date] end) as Date1

, max(case when Column2 is not null then [Date] end) as Date2

, max(case when Column3 is not null then [Date] end) as Date3

, max(case when Column4 is not null then [Date] end) as Date4

, max(case when Column5 is not null then [Date] end) as Date5

, max(case when Column6 is not null then [Date] end) as Date6

, max([Date]) as MaxDate

into #temp

from @sampledata

group by ID

create unique clustered index PK_#temp on #temp(ID)

-- second pass, retrieve values

select s.ID, max(MaxDate) as [Date], max(case when [Date1] = [Date] then Column1 end) as Column1

,max(case when [Date2] = [Date] then Column2 end) as Column2

,max(case when [Date3] = [Date] then Column3 end) as Column3

,max(case when [Date4] = [Date] then Column4 end) as Column4

,max(case when [Date5] = [Date] then Column5 end) as Column5

,max(case when [Date6] = [Date] then Column6 end) as Column6

from @sampleData s

join #temp on #temp.ID = s.ID

group by s.ID

Edited to add: I think your sample data may be a little confused, you are showing 3737 for an expected result in column4, but that value only appears in column5.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

drew.allen SSC Guru Points: 77014 More actions · Answer 2

That's because LAG is for a relative position. FIRST_VALUE and LAST_VALUE are for absolute positions. You do have to get a little tricky, because you need to ensure that all the NULL values group on one end of your ORDER BY clause. You should also be aware that FIRST_VALUE and LAST_VALUE require a frame. If you don't specify one, it will use the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which will not give you the desired results for LAST_VALUE.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

drew.allen SSC Guru Points: 77014 More actions · Answer 3

Actually, I think that the Quirky Update[/url] method might be the fastest approach to doing this. You have to be very careful to follow ALL of the rules.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 4

THANK YOU! That worked very well. Instead of using #temp I put it in a CTE and it ran in 11 seconds which is acceptable performance for my needs.

santiagoc93 Valued Member Points: 66 More actions · Answer 5

Try this. I get the results that you want, but you should try it in your system to check the performance.

with ct1

as

(

select t.ID,t.Date, case when t.Column1 is null then t.Column1 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column1 end as 'Column1',

case when t.Column2 is null then t.Column2 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+'-'+ t.Column2 end as 'Column2',

case when t.Column3 is null then t.Column3 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column3 end as 'Column3',

case when t.Column4 is null then t.Column4 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column4 end as 'Column4',

case when t.Column5 is null then t.Column5 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column5 end as 'Column5',

case when t.Column6 is null then t.Column6 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column6 end as 'Column6'

from #t as t

)

,

cte2

as

(

select c1.Id, max(c1.date) as Date, Max(column1) as column1, max(column2) as column2, max(column3) as column3,

max(column4) as column4,max(column5) as column5, max(column6) as column6

from ct1 as c1

group by c1.id

)

select c2.Id,

c2.Date,

Right(c2.column1,len(c2.column1)-2) as column1,

Right(c2.column2,len(c2.column2)-2) as column2,

Right(c2.column3,len(c2.column3)-2) as column3,

Right(c2.column4,len(c2.column4)-2) as column4,

Right(c2.column5,len(c2.column5)-2) as column5,

Right(c2.column6,len(c2.column6)-2) as column6

from cte2 as c2

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 6

Quick suggestion, slightly different but easy to scale on large number of columns

😎

USE TEEST;

GO

SET NOCOUNT ON;

--

declare @sampledata Table

(

ID int

, [Date] Date

, Column1 char(4)

, Column2 char(4)

, Column3 char(4)

, Column4 char(4)

, Column5 char(4)

, Column6 char(4)

, primary key (ID, [Date])

);

insert into @sampledata

SELECT '1', '7/17/2016', 'aaaa', NULL, NULL, '2222 ', '1111 ', NULL union all

SELECT '1', '7/18/2016', 'bbbb', 'xxxx ', NULL, NULL, '2222 ', NULL union all

SELECT '1', '7/19/2016', NULL, NULL, 'jjjj ', '1111 ', '3333 ', NULL union all

SELECT '1', '7/20/2016', NULL, NULL, NULL, NULL, '4444 ', '8888' union all

SELECT '1', '7/21/2016', NULL, NULL, NULL, NULL, '5555 ', NULL union all

SELECT '1', '7/22/2016', NULL, NULL, NULL, NULL, NULL, NULL union all

SELECT '2', '7/17/2015', 'RRRR', NULL, NULL, '9999 ', '2222 ', NULL union all

SELECT '2', '7/18/2015', 'LLLL', 'qqqq ', NULL, NULL, '5555 ', NULL union all

SELECT '2', '7/19/2015', NULL, NULL, 'abcd ', NULL, NULL, NULL union all

SELECT '2', '7/20/2015', NULL, NULL, NULL, 'pppp ', '3737 ', NULL;

;WITH BASE_DATA AS

(

SELECT

SD.ID

,SD.Date

,X.COLID

,X.COLVAL

FROM @sampledata SD

CROSS APPLY

(

SELECT 1, SD.Column1 UNION ALL

SELECT 2, SD.Column2 UNION ALL

SELECT 3, SD.Column3 UNION ALL

SELECT 4, SD.Column4 UNION ALL

SELECT 5, SD.Column5 UNION ALL

SELECT 6, SD.Column6

) X(COLID,COLVAL)

)

,LAST_NOT_NULL_ID_DATE AS

(

SELECT

BD.ID

,BD.COLID

,MAX(BD.Date) AS LAST_DATE

,MAX(BD.COLVAL) AS LAST_VALUE

FROM BASE_DATA BD

WHERE BD.COLVAL IS NOT NULL

GROUP BY BD.ID

,BD.COLID

)

SELECT

LD.ID

,MAX(LD.LAST_DATE ) AS LAST_DATE

,MAX(CASE WHEN LD.COLID = 1 THEN LD.LAST_VALUE END) AS COL1

,MAX(CASE WHEN LD.COLID = 2 THEN LD.LAST_VALUE END) AS COL2

,MAX(CASE WHEN LD.COLID = 3 THEN LD.LAST_VALUE END) AS COL3

,MAX(CASE WHEN LD.COLID = 4 THEN LD.LAST_VALUE END) AS COL4

,MAX(CASE WHEN LD.COLID = 5 THEN LD.LAST_VALUE END) AS COL5

,MAX(CASE WHEN LD.COLID = 6 THEN LD.LAST_VALUE END) AS COL6

FROM LAST_NOT_NULL_ID_DATE LD

GROUP BY LD.ID

;

Output

ID LAST_DATE COL1 COL2 COL3 COL4 COL5 COL6

----------- ---------- ---- ---- ---- ---- ---- ----

1 2016-07-21 bbbb xxxx jjjj 2222 5555 8888

2 2015-07-20 RRRR qqqq abcd pppp 5555 NULL