T-SQL Puzzler - identify sets that have the same items (where set ID and items in same table)

Question

Post reply

T-SQL Puzzler - identify sets that have the same items (where set ID and items in same table)

Laurence Neville

SSC Eights!

Points: 889
More actions
February 25, 2015 at 11:22 am

#298044

I am struggling to come up with a set-based solution for this problem (i.e. that doesn't involve loops/cursors) .... any help gratefully appreciated!
A table contains items (identified by an ItemCode) and the set they belong to (identified by a SetId). Here is some sample data:
SetIdItemCode
1A
1B
24
28
26
310
312
410
418
58
612
716
816
916
10A
10B
114
116
118
You can see that there are some sets that have the same members:
- 1 and 10
- 2 and 11
- 7, 8 & 9
What I want to do is identify the sets that have the same members, by giving them the same ID in another column called UniqueSetId.
Any ideas?

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Luis Cazares SSC Guru Points: 183706 More actions · Answer 1

This is a possible option.

CREATE TABLE SetsItems(

SetId int,

ItemCode varchar(10))

INSERT INTO SetsItems

SELECT 1,'A' UNION ALL

SELECT 1,'B' UNION ALL

SELECT 2,'4' UNION ALL

SELECT 2,'8' UNION ALL

SELECT 2,'6' UNION ALL

SELECT 3,'10' UNION ALL

SELECT 3,'12' UNION ALL

SELECT 4,'10' UNION ALL

SELECT 4,'18' UNION ALL

SELECT 5,'8' UNION ALL

SELECT 6,'12' UNION ALL

SELECT 7,'16' UNION ALL

SELECT 8,'16' UNION ALL

SELECT 9,'16' UNION ALL

SELECT 10,'A' UNION ALL

SELECT 10,'B' UNION ALL

SELECT 11,'4' UNION ALL

SELECT 11,'6' UNION ALL

SELECT 11,'8';

WITH cteSets AS(

SELECT DISTINCT SetID,

CAST( (SELECT ',' + ItemCode

FROM SetsItems i

WHERE s.SetId = i.SetId

ORDER BY ItemCode

FOR XML PATH('')) AS varchar(max)) Items

FROM SetsItems s

)

SELECT *,

RANK() OVER( ORDER BY Items) Option1,

DENSE_RANK() OVER( ORDER BY Items) Option2

FROM cteSets

ORDER BY Option1

GO

DROP TABLE SetsItems

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Mark Cowne One Orange Chip Points: 26990 More actions · Answer 2

Here's another (using Luis's setup)

WITH Results(SetId1,SetId2) AS (

SELECT a.SetId,b.SetId

FROM SetsItems a

INNER JOIN SetsItems b ON b.SetId > a.SetId

AND b.ItemCode = a.ItemCode

GROUP BY a.SetId,b.SetId

HAVING COUNT(*) = (SELECT COUNT(*) FROM SetsItems c WHERE c.SetId = a.SetId)

AND COUNT(*) = (SELECT COUNT(*) FROM SetsItems d WHERE d.SetId = b.SetId))

SELECT DISTINCT ca.SetId,

DENSE_RANK() OVER(ORDER BY a.SetId1) AS UniqueSetId

FROM Results a

CROSS APPLY(VALUES(a.SetId1),(a.SetId2)) ca(SetId)

WHERE NOT EXISTS(SELECT * FROM Results b WHERE b.SetId2 = a.SetId1)

ORDER BY UniqueSetId,SetId;

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Laurence Neville SSC Eights! Points: 889 More actions · Answer 3

Both solutions worked, thanks very much. For some reason Luis's is much faster on my large data set (millions of rows).

siggemannen SSCrazy Points: 2436 More actions · Answer 4

I usually do something like this:

SELECTSetId, CHECKSUM_AGG(CHECKSUM(ItemCode)) ch

FROMSetsItems

GROUP BY SetId

ORDER BY ch

If you're **really** unlucky you might get a false positive because CHECKSUM isn't perfect, but i doubt it. You can do a HASHBYTES on the ItemCode to pretty much avoid any collisions.

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 5

This is a simple case of Relational Division, of which I am particular fond of.

The result isSetIDSetID

110

211

78

79

89By using this piece of codeSELECTkc.SetID,

nc.SetID

FROM(

SELECTSetID,

COUNT(*) AS cnt

FROMdbo.SetsItems

GROUP BYSetID

) AS kc

INNER JOIN(

SELECTSetID,

COUNT(*) AS cnt

FROMdbo.SetsItems

GROUP BYSetID

) AS nc ON nc.cnt = kc.cnt

AND nc.SetID > kc.SetID

INNER JOINdbo.SetsItems AS t ON t.SetID = kc.SetID

INNER JOINdbo.SetsItems AS n ON n.SetID = nc.SetID

WHEREt.ItemCode = n.ItemCode

GROUP BYkc.SetID,

nc.SetID

HAVINGCOUNT(*) = MIN(nc.cnt);

N 56°04'39.16"
E 12°55'05.25"

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 6

-- SwePeso - Preaggregation/filtering step

SELECTSetID,

COUNT(*) AS cnt,

MIN(ItemCode) AS mn,

MAX(ItemCode) AS mx,

CHECKSUM_AGG(CHECKSUM(ItemCode)) AS chk

INTO#Temp

FROMdbo.SetsItems WITH (NOLOCK)

GROUP BYSetID;

-- SwePeso - projection step

SELECTROW_NUMBER() OVER (ORDER BY w.chk) AS UniqueSetID,

STUFF(f.Data, 1, 2, N'') AS SetIDs

FROM(

SELECTcnt,

mn,

mx,

chk

FROM#Temp

GROUP BYcnt,

mn,

mx,

chk

HAVINGCOUNT(*) >= 2

) AS w

CROSS APPLY(

SELECT', ' + CAST(x.SetID AS NVARCHAR(12))

FROM#Temp AS x

WHEREx.cnt = w.cnt

AND x.mn = w.mn

AND x.mx = w.mx

AND x.chk = w.chk

FOR XMLPATH('')

) AS f(Data);

N 56°04'39.16"
E 12°55'05.25"

Laurence Neville SSC Eights! Points: 889 More actions · Answer 7

Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

Dwain Camps SSC Guru Points: 86978 More actions · Answer 8

Laurence Neville (3/1/2015)
Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

:blush:

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St