How to find duplicate sets of record?

  • Here is my scenario (I signifficantly simplified from real problem).

    I have a table with following record:

    set_id col1 col2 col3 col4

    ------ ---- ---- ---- ----

    X01 a b c d

    X01 e f g h

    X02 a b c d

    X03 a b c d

    X03 e f g h

    X04 a b c d

    X04 e f g h

    X04 k l m n

    I need to find duplicate SETS of record. In this table set is defined by set_id. Each set_id can have just one record as well as multiple records, but records within each given set are always unique.

    In the example above only sets X01 and X03 are duplicate. They both must be present in my result-set. Set X02, even though it has same values as 1st record of X01, is not duplicate. Likewise, X04 that's having 1st two records the same as X01, is not duplicate either, because it also has 3rd record.

    In the final result-set it does not matter whether duplicate records listed vertically or side-by-side.

    Thanks

  • This is my stab at it but it's not quite right. The problem is making sure the set's are ordered correctly when they are together.

    i.e. Making sure you don't get "e|f|g|h|a|b|c|d" instead of "a|b|c|de|f|g|h".

    DECLARE @test-2 TABLE (set_id VARCHAR(3), col1 CHAR(1), col2 CHAR(1), col3 CHAR(1), col4 CHAR(1))

    INSERT INTO @test-2

    VALUES

    ('X01', 'a', 'b', 'c', 'd'),

    ('X01', 'e', 'f', 'g', 'h'),

    ('X02', 'a', 'b', 'c', 'd'),

    ('X03', 'a', 'b', 'c', 'd'),

    ('X03', 'e', 'f', 'g', 'h'),

    ('X04', 'a', 'b', 'c', 'd'),

    ('X04', 'e', 'f', 'g', 'h'),

    ('X04', 'k', 'l', 'm', 'n')

    SELECT

    set_id,

    (SELECT '' + col1 + '|' + col2 + '|' + col3 + '|' + col4 FROM @test-2 t2 WHERE t2.set_id = t1.set_id ORDER BY t2.set_id FOR XML PATH('')) AS mySet

    INTO #combine

    FROM @test-2 t1

    GROUP BY

    set_id

    SELECT c.*

    FROM #combine c

    JOIN (

    SELECT mySet, COUNT(mySet) AS SetCount FROM #combine GROUP BY myset HAVING COUNT(mySet) > 1

    ) x ON x.mySet = c.mySet

    DROP TABLE #combine

    My brain is fried and I'm about to go home. At the very least someone else might make use of the sample data I created. Please add that to your post next time.

    Cheers


    SELECT quote FROM brain WHERE original = 1
    0 rows returned

  • Thanks. This works. I just added another pipe next to col4.

  • You can alternatively use HASHBYTES to find the duplicates.

    ;WITH hashSet AS

    (

    SELECT

    set_id,

    SUM(CAST(HASHBYTES('SHA1',col1+col2+col3+col4) AS BIGINT)) AS uniqueId

    FROM @test-2

    GROUP BY set_id

    )

    SELECT * FROM hashSet a

    WHERE EXISTS (

    SELECT 1 FROM hashSet b

    WHERE a.set_id != b.set_id AND a.uniqueId = b.uniqueId

    )

    Recommend you to read this blog before you decide to use CHECKSUM or HASHBYTES

    http://www.sqlservercentral.com/blogs/steve_jones/2009/06/01/sql-server-encryption-hashing-collisions/

    [font="Calibri"]Raj[/font]
  • For fun, here is an alternative method

    😎

    USE tempdb;

    GO

    SET NOCOUNT ON;

    IF OBJECT_ID(N'dbo.TBL_TESTDUPE_05') IS NOT NULL DROP TABLE dbo.TBL_TESTDUPE_05;

    CREATE TABLE dbo.TBL_TESTDUPE_05

    (

    set_id CHAR(3)

    ,col1 CHAR(1)

    ,col2 CHAR(1)

    ,col3 CHAR(1)

    ,col4 CHAR(1)

    );

    INSERT INTO dbo.TBL_TESTDUPE_05

    (set_id,col1,col2,col3,col4)

    VALUES ('X01' ,'a' ,'b' ,'c' ,'d' )

    ,('X01' ,'e' ,'f' ,'g' ,'h' )

    ,('X02' ,'a' ,'b' ,'c' ,'d' )

    ,('X03' ,'a' ,'b' ,'c' ,'d' )

    ,('X03' ,'e' ,'f' ,'g' ,'h' )

    ,('X04' ,'a' ,'b' ,'c' ,'d' )

    ,('X04' ,'e' ,'f' ,'g' ,'h' )

    ,('X04' ,'k' ,'l' ,'m' ,'n' )

    ;

    ;WITH BASE_DATA AS

    (

    SELECT

    TD.set_id

    ,X.COL_VAL

    FROM dbo.TBL_TESTDUPE_05 TD

    CROSS APPLY

    (

    SELECT COL1 UNION ALL

    SELECT COL2 UNION ALL

    SELECT COL3 UNION ALL

    SELECT COL4

    ) AS X(COL_VAL)

    )

    ,COMBINED_SET_ROWS AS

    (

    SELECT DISTINCT

    BD.set_id

    ,(SELECT

    SBD.COL_VAL

    FROM BASE_DATA SBD

    WHERE SBD.set_id = BD.set_id

    FOR XML PATH(''),TYPE

    ).value('.[1]','VARCHAR(100)') AS SET_VAL

    FROM BASE_DATA BD

    )

    SELECT

    CSR.set_id

    ,DENSE_RANK() OVER

    (

    ORDER BY CSR.SET_VAL

    ) AS SET_RNK

    ,CSR.SET_VAL

    FROM COMBINED_SET_ROWS CSR;

    Results

    set_id SET_RNK SET_VAL

    ------ -------------------- ---------------

    X02 1 abcd

    X01 2 abcdefgh

    X03 2 abcdefgh

    X04 3 abcdefghklmn

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply