How to find duplicate sets of record?

Question

How to find duplicate sets of record?

SQL Guy 1

SSCoach

Points: 15979
More actions
December 22, 2015 at 12:36 pm

#323220

Here is my scenario (I signifficantly simplified from real problem).
I have a table with following record:
set_id col1 col2 col3 col4
------ ---- ---- ---- ----
X01 a b c d
X01 e f g h
X02 a b c d
X03 a b c d
X03 e f g h
X04 a b c d
X04 e f g h
X04 k l m n
I need to find duplicate SETS of record. In this table set is defined by set_id. Each set_id can have just one record as well as multiple records, but records within each given set are always unique.
In the example above only sets X01 and X03 are duplicate. They both must be present in my result-set. Set X02, even though it has same values as 1st record of X01, is not duplicate. Likewise, X04 that's having 1st two records the same as X01, is not duplicate either, because it also has 3rd record.
In the final result-set it does not matter whether duplicate records listed vertically or side-by-side.
Thanks

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Y.B. SSChampion Points: 11649 More actions · Answer 1

This is my stab at it but it's not quite right. The problem is making sure the set's are ordered correctly when they are together.

i.e. Making sure you don't get "e|f|g|h|a|b|c|d" instead of "a|b|c|de|f|g|h".

DECLARE @test-2 TABLE (set_id VARCHAR(3), col1 CHAR(1), col2 CHAR(1), col3 CHAR(1), col4 CHAR(1))

INSERT INTO @test-2

VALUES

('X01', 'a', 'b', 'c', 'd'),

('X01', 'e', 'f', 'g', 'h'),

('X02', 'a', 'b', 'c', 'd'),

('X03', 'a', 'b', 'c', 'd'),

('X03', 'e', 'f', 'g', 'h'),

('X04', 'a', 'b', 'c', 'd'),

('X04', 'e', 'f', 'g', 'h'),

('X04', 'k', 'l', 'm', 'n')

SELECT

set_id,

(SELECT '' + col1 + '|' + col2 + '|' + col3 + '|' + col4 FROM @test-2 t2 WHERE t2.set_id = t1.set_id ORDER BY t2.set_id FOR XML PATH('')) AS mySet

INTO #combine

FROM @test-2 t1

GROUP BY

set_id

SELECT c.*

FROM #combine c

JOIN (

SELECT mySet, COUNT(mySet) AS SetCount FROM #combine GROUP BY myset HAVING COUNT(mySet) > 1

) x ON x.mySet = c.mySet

DROP TABLE #combine

My brain is fried and I'm about to go home. At the very least someone else might make use of the sample data I created. Please add that to your post next time.

Cheers

SELECT quote FROM brain WHERE original = 1
0 rows returned

SQL Guy 1 SSCoach Points: 15979 More actions · Answer 2

Thanks. This works. I just added another pipe next to col4.

dfine Ten Centuries Points: 1251 More actions · Answer 3

You can alternatively use HASHBYTES to find the duplicates.

;WITH hashSet AS

(

SELECT

set_id,

SUM(CAST(HASHBYTES('SHA1',col1+col2+col3+col4) AS BIGINT)) AS uniqueId

FROM @test-2

GROUP BY set_id

)

SELECT * FROM hashSet a

WHERE EXISTS (

SELECT 1 FROM hashSet b

WHERE a.set_id != b.set_id AND a.uniqueId = b.uniqueId

)

Recommend you to read this blog before you decide to use CHECKSUM or HASHBYTES

http://www.sqlservercentral.com/blogs/steve_jones/2009/06/01/sql-server-encryption-hashing-collisions/

[font="Calibri"]Raj[/font]

Eirikur Eiriksson SSC Guru Points: 182915 More actions · Answer 4

For fun, here is an alternative method

😎

USE tempdb;

GO

SET NOCOUNT ON;

IF OBJECT_ID(N'dbo.TBL_TESTDUPE_05') IS NOT NULL DROP TABLE dbo.TBL_TESTDUPE_05;

CREATE TABLE dbo.TBL_TESTDUPE_05

(

set_id CHAR(3)

,col1 CHAR(1)

,col2 CHAR(1)

,col3 CHAR(1)

,col4 CHAR(1)

);

INSERT INTO dbo.TBL_TESTDUPE_05

(set_id,col1,col2,col3,col4)

VALUES ('X01' ,'a' ,'b' ,'c' ,'d' )

,('X01' ,'e' ,'f' ,'g' ,'h' )

,('X02' ,'a' ,'b' ,'c' ,'d' )

,('X03' ,'a' ,'b' ,'c' ,'d' )

,('X03' ,'e' ,'f' ,'g' ,'h' )

,('X04' ,'a' ,'b' ,'c' ,'d' )

,('X04' ,'e' ,'f' ,'g' ,'h' )

,('X04' ,'k' ,'l' ,'m' ,'n' )

;

;WITH BASE_DATA AS

(

SELECT

TD.set_id

,X.COL_VAL

FROM dbo.TBL_TESTDUPE_05 TD

CROSS APPLY

(

SELECT COL1 UNION ALL

SELECT COL2 UNION ALL

SELECT COL3 UNION ALL

SELECT COL4

) AS X(COL_VAL)

)

,COMBINED_SET_ROWS AS

(

SELECT DISTINCT

BD.set_id

,(SELECT

SBD.COL_VAL

FROM BASE_DATA SBD

WHERE SBD.set_id = BD.set_id

FOR XML PATH(''),TYPE

).value('.[1]','VARCHAR(100)') AS SET_VAL

FROM BASE_DATA BD

)

SELECT

CSR.set_id

,DENSE_RANK() OVER

(

ORDER BY CSR.SET_VAL

) AS SET_RNK

,CSR.SET_VAL

FROM COMBINED_SET_ROWS CSR;

Results

set_id SET_RNK SET_VAL

------ -------------------- ---------------

X02 1 abcd

X01 2 abcdefgh

X03 2 abcdefgh

X04 3 abcdefghklmn