December 22, 2015 at 12:36 pm
Here is my scenario (I signifficantly simplified from real problem).
I have a table with following record:
set_id col1 col2 col3 col4
------ ---- ---- ---- ----
X01 a b c d
X01 e f g h
X02 a b c d
X03 a b c d
X03 e f g h
X04 a b c d
X04 e f g h
X04 k l m n
I need to find duplicate SETS of record. In this table set is defined by set_id. Each set_id can have just one record as well as multiple records, but records within each given set are always unique.
In the example above only sets X01 and X03 are duplicate. They both must be present in my result-set. Set X02, even though it has same values as 1st record of X01, is not duplicate. Likewise, X04 that's having 1st two records the same as X01, is not duplicate either, because it also has 3rd record.
In the final result-set it does not matter whether duplicate records listed vertically or side-by-side.
Thanks
December 22, 2015 at 1:32 pm
This is my stab at it but it's not quite right. The problem is making sure the set's are ordered correctly when they are together.
i.e. Making sure you don't get "e|f|g|h|a|b|c|d" instead of "a|b|c|de|f|g|h".
DECLARE @test-2 TABLE (set_id VARCHAR(3), col1 CHAR(1), col2 CHAR(1), col3 CHAR(1), col4 CHAR(1))
INSERT INTO @test-2
VALUES
('X01', 'a', 'b', 'c', 'd'),
('X01', 'e', 'f', 'g', 'h'),
('X02', 'a', 'b', 'c', 'd'),
('X03', 'a', 'b', 'c', 'd'),
('X03', 'e', 'f', 'g', 'h'),
('X04', 'a', 'b', 'c', 'd'),
('X04', 'e', 'f', 'g', 'h'),
('X04', 'k', 'l', 'm', 'n')
SELECT
set_id,
(SELECT '' + col1 + '|' + col2 + '|' + col3 + '|' + col4 FROM @test-2 t2 WHERE t2.set_id = t1.set_id ORDER BY t2.set_id FOR XML PATH('')) AS mySet
INTO #combine
FROM @test-2 t1
GROUP BY
set_id
SELECT c.*
FROM #combine c
JOIN (
SELECT mySet, COUNT(mySet) AS SetCount FROM #combine GROUP BY myset HAVING COUNT(mySet) > 1
) x ON x.mySet = c.mySet
DROP TABLE #combine
My brain is fried and I'm about to go home. At the very least someone else might make use of the sample data I created. Please add that to your post next time.
Cheers
December 22, 2015 at 1:55 pm
Thanks. This works. I just added another pipe next to col4.
December 24, 2015 at 12:34 am
You can alternatively use HASHBYTES to find the duplicates.
;WITH hashSet AS
(
SELECT
set_id,
SUM(CAST(HASHBYTES('SHA1',col1+col2+col3+col4) AS BIGINT)) AS uniqueId
FROM @test-2
GROUP BY set_id
)
SELECT * FROM hashSet a
WHERE EXISTS (
SELECT 1 FROM hashSet b
WHERE a.set_id != b.set_id AND a.uniqueId = b.uniqueId
)
Recommend you to read this blog before you decide to use CHECKSUM or HASHBYTES
December 24, 2015 at 7:28 am
For fun, here is an alternative method
😎
USE tempdb;
GO
SET NOCOUNT ON;
IF OBJECT_ID(N'dbo.TBL_TESTDUPE_05') IS NOT NULL DROP TABLE dbo.TBL_TESTDUPE_05;
CREATE TABLE dbo.TBL_TESTDUPE_05
(
set_id CHAR(3)
,col1 CHAR(1)
,col2 CHAR(1)
,col3 CHAR(1)
,col4 CHAR(1)
);
INSERT INTO dbo.TBL_TESTDUPE_05
(set_id,col1,col2,col3,col4)
VALUES ('X01' ,'a' ,'b' ,'c' ,'d' )
,('X01' ,'e' ,'f' ,'g' ,'h' )
,('X02' ,'a' ,'b' ,'c' ,'d' )
,('X03' ,'a' ,'b' ,'c' ,'d' )
,('X03' ,'e' ,'f' ,'g' ,'h' )
,('X04' ,'a' ,'b' ,'c' ,'d' )
,('X04' ,'e' ,'f' ,'g' ,'h' )
,('X04' ,'k' ,'l' ,'m' ,'n' )
;
;WITH BASE_DATA AS
(
SELECT
TD.set_id
,X.COL_VAL
FROM dbo.TBL_TESTDUPE_05 TD
CROSS APPLY
(
SELECT COL1 UNION ALL
SELECT COL2 UNION ALL
SELECT COL3 UNION ALL
SELECT COL4
) AS X(COL_VAL)
)
,COMBINED_SET_ROWS AS
(
SELECT DISTINCT
BD.set_id
,(SELECT
SBD.COL_VAL
FROM BASE_DATA SBD
WHERE SBD.set_id = BD.set_id
FOR XML PATH(''),TYPE
).value('.[1]','VARCHAR(100)') AS SET_VAL
FROM BASE_DATA BD
)
SELECT
CSR.set_id
,DENSE_RANK() OVER
(
ORDER BY CSR.SET_VAL
) AS SET_RNK
,CSR.SET_VAL
FROM COMBINED_SET_ROWS CSR;
Results
set_id SET_RNK SET_VAL
------ -------------------- ---------------
X02 1 abcd
X01 2 abcdefgh
X03 2 abcdefgh
X04 3 abcdefghklmn
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply