February 25, 2015 at 11:22 am
I am struggling to come up with a set-based solution for this problem (i.e. that doesn't involve loops/cursors) .... any help gratefully appreciated!
A table contains items (identified by an ItemCode) and the set they belong to (identified by a SetId). Here is some sample data:
SetIdItemCode
1A
1B
24
28
26
310
312
410
418
58
612
716
816
916
10A
10B
114
116
118
You can see that there are some sets that have the same members:
- 1 and 10
- 2 and 11
- 7, 8 & 9
What I want to do is identify the sets that have the same members, by giving them the same ID in another column called UniqueSetId.
Any ideas?
February 25, 2015 at 12:14 pm
This is a possible option.
CREATE TABLE SetsItems(
SetId int,
ItemCode varchar(10))
INSERT INTO SetsItems
SELECT 1,'A' UNION ALL
SELECT 1,'B' UNION ALL
SELECT 2,'4' UNION ALL
SELECT 2,'8' UNION ALL
SELECT 2,'6' UNION ALL
SELECT 3,'10' UNION ALL
SELECT 3,'12' UNION ALL
SELECT 4,'10' UNION ALL
SELECT 4,'18' UNION ALL
SELECT 5,'8' UNION ALL
SELECT 6,'12' UNION ALL
SELECT 7,'16' UNION ALL
SELECT 8,'16' UNION ALL
SELECT 9,'16' UNION ALL
SELECT 10,'A' UNION ALL
SELECT 10,'B' UNION ALL
SELECT 11,'4' UNION ALL
SELECT 11,'6' UNION ALL
SELECT 11,'8';
WITH cteSets AS(
SELECT DISTINCT SetID,
CAST( (SELECT ',' + ItemCode
FROM SetsItems i
WHERE s.SetId = i.SetId
ORDER BY ItemCode
FOR XML PATH('')) AS varchar(max)) Items
FROM SetsItems s
)
SELECT *,
RANK() OVER( ORDER BY Items) Option1,
DENSE_RANK() OVER( ORDER BY Items) Option2
FROM cteSets
ORDER BY Option1
GO
DROP TABLE SetsItems
February 25, 2015 at 3:12 pm
Here's another (using Luis's setup)
WITH Results(SetId1,SetId2) AS (
SELECT a.SetId,b.SetId
FROM SetsItems a
INNER JOIN SetsItems b ON b.SetId > a.SetId
AND b.ItemCode = a.ItemCode
GROUP BY a.SetId,b.SetId
HAVING COUNT(*) = (SELECT COUNT(*) FROM SetsItems c WHERE c.SetId = a.SetId)
AND COUNT(*) = (SELECT COUNT(*) FROM SetsItems d WHERE d.SetId = b.SetId))
SELECT DISTINCT ca.SetId,
DENSE_RANK() OVER(ORDER BY a.SetId1) AS UniqueSetId
FROM Results a
CROSS APPLY(VALUES(a.SetId1),(a.SetId2)) ca(SetId)
WHERE NOT EXISTS(SELECT * FROM Results b WHERE b.SetId2 = a.SetId1)
ORDER BY UniqueSetId,SetId;
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537February 26, 2015 at 12:20 pm
Both solutions worked, thanks very much. For some reason Luis's is much faster on my large data set (millions of rows).
February 28, 2015 at 4:15 am
I usually do something like this:
SELECTSetId, CHECKSUM_AGG(CHECKSUM(ItemCode)) ch
FROMSetsItems
GROUP BY SetId
ORDER BY ch
If you're **really** unlucky you might get a false positive because CHECKSUM isn't perfect, but i doubt it. You can do a HASHBYTES on the ItemCode to pretty much avoid any collisions.
March 1, 2015 at 2:18 am
This is a simple case of Relational Division, of which I am particular fond of.
The result isSetIDSetID
110
211
78
79
89By using this piece of codeSELECTkc.SetID,
nc.SetID
FROM(
SELECTSetID,
COUNT(*) AS cnt
FROMdbo.SetsItems
GROUP BYSetID
) AS kc
INNER JOIN(
SELECTSetID,
COUNT(*) AS cnt
FROMdbo.SetsItems
GROUP BYSetID
) AS nc ON nc.cnt = kc.cnt
AND nc.SetID > kc.SetID
INNER JOINdbo.SetsItems AS t ON t.SetID = kc.SetID
INNER JOINdbo.SetsItems AS n ON n.SetID = nc.SetID
WHEREt.ItemCode = n.ItemCode
GROUP BYkc.SetID,
nc.SetID
HAVINGCOUNT(*) = MIN(nc.cnt);
N 56°04'39.16"
E 12°55'05.25"
March 1, 2015 at 2:27 am
-- SwePeso - Preaggregation/filtering step
SELECTSetID,
COUNT(*) AS cnt,
MIN(ItemCode) AS mn,
MAX(ItemCode) AS mx,
CHECKSUM_AGG(CHECKSUM(ItemCode)) AS chk
INTO#Temp
FROMdbo.SetsItems WITH (NOLOCK)
GROUP BYSetID;
-- SwePeso - projection step
SELECTROW_NUMBER() OVER (ORDER BY w.chk) AS UniqueSetID,
STUFF(f.Data, 1, 2, N'') AS SetIDs
FROM(
SELECTcnt,
mn,
mx,
chk
FROM#Temp
GROUP BYcnt,
mn,
mx,
chk
HAVINGCOUNT(*) >= 2
) AS w
CROSS APPLY(
SELECT', ' + CAST(x.SetID AS NVARCHAR(12))
FROM#Temp AS x
WHEREx.cnt = w.cnt
AND x.mn = w.mn
AND x.mx = w.mx
AND x.chk = w.chk
FOR XMLPATH('')
) AS f(Data);
N 56°04'39.16"
E 12°55'05.25"
March 1, 2015 at 7:49 am
Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server
March 2, 2015 at 5:03 pm
Laurence Neville (3/1/2015)
Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server
:blush:
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply