Aggregating to a best record.

  • Hi all,

    There's a script below which may speak louder than my explanation of the problem. Hopefully they're complementary.

    I have a table with around 100,000 records and 150 fields.

    A process is run which flags records as being duplicates. By duplicates I mean that it has been established that they actually refer to same 'real world' entity. It doesn't mean that the duplicate rows all contain the same data. Some rows may be more complete than others, and there may also be conflicting data between rows.

    The goal is to eliminate the duplicates, leaving one 'master' row each time, with the 'best' data from each of the duplicates.

    Sets of duplicates have been flagged with the same KEYID. i.e. all rows with KEYID=1 are considered as duplicates. The other fields are ID - Primary Key, VALUE_STR (varchar field), and VALUE_INT (int field). The goal is to finish with ONE record with the lowest ID, and then taking the first non null value for each of the other fields (going in ascending order of ID) - a sort of vertical coalesce. (In fact the full requirement is to take the first value with is not any of [null, empty string, -1] but null would be a good start.)

    A colleague came up with a clever proposition which works (see script below), but both he and I think that it's not entirely pleasing 😉 hence my post.

    Any ideas how I can improve on this? Or on a completely different approach?

    Thanks,

    David McKinney.

    /*Set up table with sample data*/

    IF OBJECT_ID('tempdb..#AGGREGATE_FIRST') is not null

    BEGIN

    DROP TABLE #AGGREGATE_FIRST

    END

    CREATE TABLE #AGGREGATE_FIRST

    (

    ID INT IDENTITY,

    KEYID INT,

    VALUE_STR VARCHAR(100),

    VALUE_INT INTEGER,

    PRIMARY KEY ( ID )

    )

    GO

    INSERT INTO

    [#AGGREGATE_FIRST]

    (

    [KEYID],

    [VALUE_STR],

    [VALUE_INT]

    )

    VALUES

    ( 1, NULL, NULL )

    INSERT INTO

    [#AGGREGATE_FIRST]

    (

    [KEYID],

    [VALUE_STR],

    [VALUE_INT]

    )

    VALUES

    ( 1, 'AA', -1 )

    INSERT INTO

    [#AGGREGATE_FIRST]

    (

    [KEYID],

    [VALUE_STR],

    [VALUE_INT]

    )

    VALUES

    ( 1, 'BB', NULL )

    INSERT INTO

    [#AGGREGATE_FIRST]

    (

    [KEYID],

    [VALUE_STR],

    [VALUE_INT]

    )

    VALUES

    ( 1, NULL, 1 )

    INSERT INTO

    [#AGGREGATE_FIRST]

    (

    [KEYID],

    [VALUE_STR],

    [VALUE_INT]

    )

    VALUES

    ( 1, NULL, 2 )

    SELECT

    *

    FROM

    [#AGGREGATE_FIRST] AS af

    SELECT

    af.[KEYID],

    substring(MIN(RIGHT('000' + CONVERT(VARCHAR(3), ID), 3) + VALUE_STR), 4,

    100),

    substring(MIN(RIGHT('000' + CONVERT(VARCHAR(3), ID), 3)

    + cast(nullif(VALUE_INT,-1) as varchar)), 4, 100)

    FROM

    [#AGGREGATE_FIRST] AS af

    GROUP BY

    af.[KEYID]

  • Another way...

    WITH OrderedCTE AS (

    SELECT ID,

    KEYID,

    VALUE_STR,

    ROW_NUMBER()OVER(PARTITION BY KEYID ORDER BY CASE WHEN VALUE_STR IS NULL THEN 1 ELSE 0 END,ID) AS VALUE_STR_RN,

    VALUE_INT,

    ROW_NUMBER()OVER(PARTITION BY KEYID ORDER BY CASE WHEN NULLIF(VALUE_INT,-1) IS NULL THEN 1 ELSE 0 END,ID) AS VALUE_INT_RN

    FROM #AGGREGATE_FIRST)

    SELECT KEYID,

    MAX(CASE WHEN VALUE_STR_RN=1 THEN VALUE_STR END) AS VALUE_STR,

    MAX(CASE WHEN VALUE_INT_RN=1 THEN VALUE_INT END) AS VALUE_INT

    FROM OrderedCTE

    GROUP BY KEYID

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Thanks Mark,

    That's an interesting approach certainly with merit. It got me thinking about an alternative which would "bubble" non nulls to the top, by linking to the previous row[/url]

    Do you think that there's any mileage in such an approach?

    Regards,

    David.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply