Aggregating to a best record.

Question

Post reply

Aggregating to a best record.

David McKinney

SSChampion

Points: 10472
More actions
June 26, 2009 at 1:20 am

#135747

Hi all,
There's a script below which may speak louder than my explanation of the problem. Hopefully they're complementary.
I have a table with around 100,000 records and 150 fields.
A process is run which flags records as being duplicates. By duplicates I mean that it has been established that they actually refer to same 'real world' entity. It doesn't mean that the duplicate rows all contain the same data. Some rows may be more complete than others, and there may also be conflicting data between rows.
The goal is to eliminate the duplicates, leaving one 'master' row each time, with the 'best' data from each of the duplicates.
Sets of duplicates have been flagged with the same KEYID. i.e. all rows with KEYID=1 are considered as duplicates. The other fields are ID - Primary Key, VALUE_STR (varchar field), and VALUE_INT (int field). The goal is to finish with ONE record with the lowest ID, and then taking the first non null value for each of the other fields (going in ascending order of ID) - a sort of vertical coalesce. (In fact the full requirement is to take the first value with is not any of [null, empty string, -1] but null would be a good start.)
A colleague came up with a clever proposition which works (see script below), but both he and I think that it's not entirely pleasing 😉 hence my post.
Any ideas how I can improve on this? Or on a completely different approach?
Thanks,
David McKinney.
/*Set up table with sample data*/
IF OBJECT_ID('tempdb..#AGGREGATE_FIRST') is not null
BEGIN
DROP TABLE #AGGREGATE_FIRST
END
CREATE TABLE #AGGREGATE_FIRST
(
ID INT IDENTITY,
KEYID INT,
VALUE_STR VARCHAR(100),
VALUE_INT INTEGER,
PRIMARY KEY ( ID )
)
GO
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, NULL, NULL )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, 'AA', -1 )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, 'BB', NULL )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, NULL, 1 )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, NULL, 2 )
SELECT
*
FROM
[#AGGREGATE_FIRST] AS af
SELECT
af.[KEYID],
substring(MIN(RIGHT('000' + CONVERT(VARCHAR(3), ID), 3) + VALUE_STR), 4,
100),
substring(MIN(RIGHT('000' + CONVERT(VARCHAR(3), ID), 3)
+ cast(nullif(VALUE_INT,-1) as varchar)), 4, 100)
FROM
[#AGGREGATE_FIRST] AS af
GROUP BY
af.[KEYID]

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Mark Cowne One Orange Chip Points: 26952 More actions · Answer 1

Another way...

WITH OrderedCTE AS (

SELECT ID,

KEYID,

VALUE_STR,

ROW_NUMBER()OVER(PARTITION BY KEYID ORDER BY CASE WHEN VALUE_STR IS NULL THEN 1 ELSE 0 END,ID) AS VALUE_STR_RN,

VALUE_INT,

ROW_NUMBER()OVER(PARTITION BY KEYID ORDER BY CASE WHEN NULLIF(VALUE_INT,-1) IS NULL THEN 1 ELSE 0 END,ID) AS VALUE_INT_RN

FROM #AGGREGATE_FIRST)

SELECT KEYID,

MAX(CASE WHEN VALUE_STR_RN=1 THEN VALUE_STR END) AS VALUE_STR,

MAX(CASE WHEN VALUE_INT_RN=1 THEN VALUE_INT END) AS VALUE_INT

FROM OrderedCTE

GROUP BY KEYID

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

David McKinney SSChampion Points: 10472 More actions · Answer 2

Thanks Mark,

That's an interesting approach certainly with merit. It got me thinking about an alternative which would "bubble" non nulls to the top, by linking to the previous row[/url]

Do you think that there's any mileage in such an approach?

Regards,

David.