December 4, 2010 at 8:50 am
Ok this will take some explaining.
If I have a table, columns MYKEY, DT, A, B, C, D, E with the following data:
MYKEY DT A B C D E
------------------------------------------------
1 1Jan 1 2 3 4 5
1 2Jan 6 7 8 9 10
1 3Jan null null 3 null 6
1 4Jan 7 null null null 1
2 5Jan 4 3 4 2 3
2 6Jan null null 2 3 1
What I want is to create a new table where MYKEY is unique and I take the most recent non-null value for each column. So in the above case I would end up with:
1 4Jan 7 7 3 9 1
2 6Jan 4 3 2 3 1
I will be doing this for very large data sets so the performance of the query is important.
What I have done to date is add a column called COPY, and use ROWNUMBER() partitioned by MYKEY to assign a COPY number to each row, and then in a loop, set each value to the one from the previous row where not null.
But it's slow slow slow when you work with large data sets, no matter how I try to index it.
Any ideas?
December 4, 2010 at 6:27 pm
Brian McGee-355525 (12/4/2010)
Ok this will take some explaining.If I have a table, columns MYKEY, DT, A, B, C, D, E with the following data:
MYKEY DT A B C D E
------------------------------------------------
1 1Jan 1 2 3 4 5
1 2Jan 6 7 8 9 10
1 3Jan null null 3 null 6
1 4Jan 7 null null null 1
2 5Jan 4 3 4 2 3
2 6Jan null null 2 3 1
What I want is to create a new table where MYKEY is unique and I take the most recent non-null value for each column. So in the above case I would end up with:
1 4Jan 7 7 3 9 1
2 6Jan 4 3 2 3 1
I will be doing this for very large data sets so the performance of the query is important.
What I have done to date is add a column called COPY, and use ROWNUMBER() partitioned by MYKEY to assign a COPY number to each row, and then in a loop, set each value to the one from the previous row where not null.
But it's slow slow slow when you work with large data sets, no matter how I try to index it.
Any ideas?
Yes... I have some ideas... but I need to know, is your real data limited to just 5 columns?
I also need to know what PK is on the table and which columns the clustered index is on (and, yes, they can be different).
--Jeff Moden
Change is inevitable... Change for the better is not.
December 6, 2010 at 6:31 am
I have done similar through what (at first sight) looks horrendous code.
Basically you can use an update statement that updates the required "master rows" and each column is derived from a subquery something like:
Update x
set
col1=(select top 1 t2.col1 from table as t2 where t2.col1 is not null and t2.k = x.k order by t2.datecol desc),
col2=(select top 1 t3.col2 from table as t3 where t3.col2 is not null and t3.k = x.k order by t3.datecol desc)
where .... "whatever determins the masters"
where:
datecol is the column in the table that tells you which is the "most recent" ie something like date_updated
and k is a column that is common across all the rows you want to merge.
These can look very complicated but careful layout and ensuring consistent use of correleation names (aliases) can make it readable.
If you get the indexes right on the underlying tables this can actually be stunningly fast, even though at first look it appears it will not be.
If you post the exact table structures you will probably get nmore help.
Something like this should be orders of magnitude faster than cursors etc...
Mike John
December 6, 2010 at 6:42 am
Not sure if you've tried this already.
DECLARE @MyTable TABLE(MYKEY INT,DT VARCHAR(10),A INT,B INT,C INT,D INT,E INT);
INSERT INTO @MyTable(MYKEY,DT,A,B,C,D,E)
SELECT 1 ,'1Jan', 1, 2, 3, 4, 5 UNION ALL
SELECT 1 ,'2Jan', 6, 7, 8, 9, 10 UNION ALL
SELECT 1 ,'3Jan', null, null, 3, null, 6 UNION ALL
SELECT 1 ,'4Jan', 7, null, null, null, 1 UNION ALL
SELECT 2 ,'5Jan', 4, 3, 4, 2, 3 UNION ALL
SELECT 2 ,'6Jan', null, null, 2, 3, 1;
WITH CTE AS (
SELECT MYKEY,DT,
A,
ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN A IS NULL THEN 1 ELSE 0 END,DT DESC) AS A_rn,
B,
ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN B IS NULL THEN 1 ELSE 0 END,DT DESC) AS B_rn,
C,
ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN C IS NULL THEN 1 ELSE 0 END,DT DESC) AS C_rn,
D,
ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN D IS NULL THEN 1 ELSE 0 END,DT DESC) AS D_rn,
E,
ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN E IS NULL THEN 1 ELSE 0 END,DT DESC) AS E_rn
FROM @MyTable)
SELECT MYKEY,
MAX(DT) AS DT,
MAX(CASE WHEN A_rn=1 THEN A END) AS A,
MAX(CASE WHEN B_rn=1 THEN B END) AS B,
MAX(CASE WHEN C_rn=1 THEN C END) AS C,
MAX(CASE WHEN D_rn=1 THEN D END) AS D,
MAX(CASE WHEN E_rn=1 THEN E END) AS E
FROM CTE
GROUP BY MYKEY
ORDER BY MYKEY;
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537December 6, 2010 at 8:24 pm
Mike John (12/6/2010)
where .... "whatever determins the masters"
Can you quantify that in code using the example because that is the crux of this problem.
Also... you're using an UPDATE which modifies the original data which may not be allowed.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 8, 2010 at 5:18 am
Thanks to everyone for the suggestions, I will certainly give them a go. As things go, I've been dragged off on to several other items since starting this one, but again huge thanks for the suggestions - they will all be tried 🙂
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply