June 26, 2009 at 1:20 am
Hi all,
There's a script below which may speak louder than my explanation of the problem. Hopefully they're complementary.
I have a table with around 100,000 records and 150 fields.
A process is run which flags records as being duplicates. By duplicates I mean that it has been established that they actually refer to same 'real world' entity. It doesn't mean that the duplicate rows all contain the same data. Some rows may be more complete than others, and there may also be conflicting data between rows.
The goal is to eliminate the duplicates, leaving one 'master' row each time, with the 'best' data from each of the duplicates.
Sets of duplicates have been flagged with the same KEYID. i.e. all rows with KEYID=1 are considered as duplicates. The other fields are ID - Primary Key, VALUE_STR (varchar field), and VALUE_INT (int field). The goal is to finish with ONE record with the lowest ID, and then taking the first non null value for each of the other fields (going in ascending order of ID) - a sort of vertical coalesce. (In fact the full requirement is to take the first value with is not any of [null, empty string, -1] but null would be a good start.)
A colleague came up with a clever proposition which works (see script below), but both he and I think that it's not entirely pleasing 😉 hence my post.
Any ideas how I can improve on this? Or on a completely different approach?
Thanks,
David McKinney.
/*Set up table with sample data*/
IF OBJECT_ID('tempdb..#AGGREGATE_FIRST') is not null
BEGIN
DROP TABLE #AGGREGATE_FIRST
END
CREATE TABLE #AGGREGATE_FIRST
(
ID INT IDENTITY,
KEYID INT,
VALUE_STR VARCHAR(100),
VALUE_INT INTEGER,
PRIMARY KEY ( ID )
)
GO
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, NULL, NULL )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, 'AA', -1 )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, 'BB', NULL )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, NULL, 1 )
INSERT INTO
[#AGGREGATE_FIRST]
(
[KEYID],
[VALUE_STR],
[VALUE_INT]
)
VALUES
( 1, NULL, 2 )
SELECT
*
FROM
[#AGGREGATE_FIRST] AS af
SELECT
af.[KEYID],
substring(MIN(RIGHT('000' + CONVERT(VARCHAR(3), ID), 3) + VALUE_STR), 4,
100),
substring(MIN(RIGHT('000' + CONVERT(VARCHAR(3), ID), 3)
+ cast(nullif(VALUE_INT,-1) as varchar)), 4, 100)
FROM
[#AGGREGATE_FIRST] AS af
GROUP BY
af.[KEYID]
June 26, 2009 at 2:24 am
Another way...
WITH OrderedCTE AS (
SELECT ID,
KEYID,
VALUE_STR,
ROW_NUMBER()OVER(PARTITION BY KEYID ORDER BY CASE WHEN VALUE_STR IS NULL THEN 1 ELSE 0 END,ID) AS VALUE_STR_RN,
VALUE_INT,
ROW_NUMBER()OVER(PARTITION BY KEYID ORDER BY CASE WHEN NULLIF(VALUE_INT,-1) IS NULL THEN 1 ELSE 0 END,ID) AS VALUE_INT_RN
FROM #AGGREGATE_FIRST)
SELECT KEYID,
MAX(CASE WHEN VALUE_STR_RN=1 THEN VALUE_STR END) AS VALUE_STR,
MAX(CASE WHEN VALUE_INT_RN=1 THEN VALUE_INT END) AS VALUE_INT
FROM OrderedCTE
GROUP BY KEYID
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537June 26, 2009 at 2:45 am
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply