May 12, 2011 at 9:33 am
I am trying to figure out how to compare lists in a table.
I want to find if any two Lists have the exact same members.
Here is some sample code. The real data is proprietary.
DECLARE @Color TABLE
(
IDINT IDENTITY(1,1)
,NameVARCHAR(50)
)
INSERT INTO @Color
SELECT 'White'
UNION ALL SELECT 'Black'
UNION ALL SELECT 'Purple'
UNION ALL SELECT 'Pink'
UNION ALL SELECT 'Red'
UNION ALL SELECT 'Blue'
UNION ALL SELECT 'Green'
UNION ALL SELECT 'Yellow'
UNION ALL SELECT 'Orange'
UNION ALL SELECT 'Brown'
DECLARE @List TABLE
(
IDINT IDENTITY(1,1)
,CodeCHAR(3)
)
INSERT INTO @List
SELECT 'ABC'
UNION ALL SELECT 'XYZ'
UNION ALL SELECT 'JKL'
DECLARE @ColorList TABLE
(
ColorIDINT
,ListIDINT
)
INSERT INTO @ColorList(ListID, ColorID)
SELECT 1,1
UNION ALL SELECT 1,5
UNION ALL SELECT 1,8
UNION ALL SELECT 1,3
UNION ALL SELECT 1,4
UNION ALL SELECT 2,4
UNION ALL SELECT 2,6
UNION ALL SELECT 2,7
UNION ALL SELECT 2,8
UNION ALL SELECT 2,10
UNION ALL SELECT 3,4
UNION ALL SELECT 3,6
UNION ALL SELECT 3,7
UNION ALL SELECT 3,8
UNION ALL SELECT 3,10
SELECT *
FROM @List L
JOIN @ColorList CL ON L.ID = CL.ListID
JOIN @Color C ON CL.ColorID = C.ID
-- expected results: Lists 2 & 3 are identical
May 12, 2011 at 12:19 pm
Here's the same issue with some answers
http://www.sqlservercentral.com/Forums/Topic1101072-392-1.aspx
For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
May 13, 2011 at 12:27 pm
Thanks for pointing me in the right direction, Mike!
SwePeso's solution worked for me.
Posting the solution here in case this helps others:
--10 times the record, 7 times the duration.
/* Total time
10k 1 sec
100k 7 sec
1000k 50 sec
*/
-- 0 sec (10k), 4 sec (100k), 39 sec (1000k)
SELECT x.ListID AS LowID,
y.ListID AS HighID,
MIN(x.ProductItems) AS ProductItems
INTO #Stage
FROM (
SELECT ListID,
COUNT(*) AS ProductItems
FROM @ColorList
GROUP BY ListID
) AS x
INNER JOIN (
SELECT ListID,
COUNT(*) AS ProductItems
FROM @ColorList
GROUP BY ListID
) AS y ON y.ProductItems = x.ProductItems
INNER JOIN @ColorList AS x1 ON x1.ListID = x.ListID
INNER JOIN @ColorList AS y1 ON y1.ListID = y.ListID
WHERE x1.ColorID = y1.ColorID
AND x.ListID < y.ListID
GROUP BY x.ListID,
y.ListID
HAVING COUNT(*) = MIN(x.ProductItems)
-- 0 sec (10k), 1 sec (100k), 2 sec (1000k)
CREATE INDEX IX_Low ON #Stage (LowID) INCLUDE (HighID, ProductItems)
CREATE INDEX IX_High ON #Stage (HighID) INCLUDE (LowID, ProductItems)
-- 0 sec (10k), 1 sec (100k), 8 sec (1000k)
;WITH cteDuplicate
AS (
SELECT s.LowID AS theGrp,
s.LowID AS ListID,
MIN(s.ProductItems) AS ProductItems,
1 AS SeqID
FROM #Stage AS s
WHERE NOT EXISTS (SELECT * FROM #Stage AS x WHERE x.HighID = s.LowID)
GROUP BY s.LowID
UNION ALL
SELECT d.theGrp,
f.ID AS ListID,
d.ProductItems,
d.SeqID + 1 AS SeqID
FROM cteDuplicate AS d
CROSS APPLY (
SELECT s.HighID,
ROW_NUMBER() OVER (ORDER BY s.HighID) AS SeqID
FROM #Stage AS s
WHERE s.LowID = d.ListID
) AS f(ID, SeqID)
WHERE f.SeqID = 1
)
SELECT DENSE_RANK() OVER (ORDER BY theGrp) AS theGroup,
SeqID,
ListID
INTO #Temp
FROM cteDuplicate
OPTION (MAXRECURSION 0)
SELECT *
FROM #Temp
-- 0 sec
DROP TABLE #Stage
DROP TABLE#Temp
May 14, 2011 at 1:16 pm
Hi Goldie,
Before we start on a solution, let's make more than just a handful of rows of data to test with so we can see what happens when scale increases.
--===== Conditionally drop the test tables to make reruns easier in SSMS
IF OBJECT_ID('tempdb..#Color' ,'U') IS NOT NULL DROP TABLE #Color;
IF OBJECT_ID('tempdb..#List' ,'U') IS NOT NULL DROP TABLE #List;
IF OBJECT_ID('tempdb..#ColorList' ,'U') IS NOT NULL DROP TABLE #ColorList;
--===== Create and populate the color table on the fly
SELECT ID = IDENTITY(INT,1,1),
Color = d.Color
INTO #Color
FROM (
SELECT 'White' UNION ALL
SELECT 'Black' UNION ALL
SELECT 'Purple' UNION ALL
SELECT 'Pink' UNION ALL
SELECT 'Red' UNION ALL
SELECT 'Blue' UNION ALL
SELECT 'Green' UNION ALL
SELECT 'Yellow' UNION ALL
SELECT 'Orange' UNION ALL
SELECT 'Brown'
) d (Color)
;
--===== Create and populate the list of codes on the fly
-- This creates all 17,576 possibilities
WITH
cteTally AS
(
SELECT TOP (POWER(26,3)) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
)
SELECT ID = IDENTITY(INT,1,1),
Code = CHAR((N/26/26)%26+65) --26^2
+ CHAR((N/26) %26+65) --26^1
+ CHAR((N) %26+65) --26^0
INTO #List
FROM cteTally
;
--===== Create and populate the ColorList junction table on the fly
-- This creates approximately 20,000
SELECT TOP 100000
ColorID = ABS(CHECKSUM(NEWID()))%10+1,
ListID = ABS(CHECKSUM(NEWID()))%POWER(26,3)+1
INTO #ColorList
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
We don't actually need the Color table or the List table at this point. What we're really interested in is figuring out the answer to your question of "find if any two Lists have the exact same members". Now, there are a whole lot of ways to answer that single question and some of them can be pretty darned slow. It turns out that one of the easiest ways to find out is also one of the fastest ways. The easy way is to simply concatenate each color into an ordered CSV list for each ListID and then number the lists using DENSE_RANK.
The following code only takes 5 seconds on my 9 year old single CPU desktop...
--===== Conditionally drop the test tables to make reruns easier in SSMS
IF OBJECT_ID('tempdb..#ColorGroup','U') IS NOT NULL DROP TABLE #ColorGroup;
--===== Group lists that have precisely the same colors
WITH
cteAggColors AS
(
SELECT cl2.ListID,
ItemCount = COUNT(*),
Colors =
(--==== Create an ordered CSV list of colors
-- for the current cl2.ListID.
SELECT ','+CAST(cl1.ColorID AS VARCHAR(10))
FROM #ColorList cl1
WHERE cl1.ListID = cl2.ListID
ORDER BY cl1.ColorID
FOR XML PATH('')
)
+ ','
FROM #ColorList cl2
GROUP BY cl2.ListID
)
SELECT ColorGroup = DENSE_RANK() OVER (ORDER BY Colors),
ColorGroupCount = CAST(NULL AS INT),
ListID,
ItemCount,
Colors
INTO #ColorGroup
FROM cteAggColors
ORDER BY ColorGroup, ListID
;
--===== Count the members of each "color group" and store that
-- in the table, as well.
WITH
cteCountMembers AS
(
SELECT ColorGroup, ColorGroupCount = COUNT(*)
FROM #ColorGroup
GROUP BY ColorGroup
)
UPDATE cg
SET ColorGroupCount = cm.ColorGroupCount
FROM #ColorGroup cg
LEFT JOIN cteCountMembers cm
ON cg.ColorGroup = cm.ColorGroup
;
Of course, you can then easily find ListIDs that have precisely the same colors very easily...
SELECT * FROM #ColorGroup WHERE ColorGroupCount > 1
Then, unlike many of the other methods, we can now do some pretty fancy stuff. For example, we can now easily answer the question of "Which ListID's have unique color combinations compared to all other lists"?
SELECT * FROM #ColorGroup WHERE ColorGroupCount = 1
Because of the large amount of random data we used, it turns out that the answer to the question above is "None" but now we have the proof.
If the "lists" were purchases of something like "flowers" and each list represented a purchase of a small group of flowers, we can also easily answer questions like "What are the most popular color combinations of 4 flowers and which lists have those combinations"?
WITH
cteFindMostPopular AS
(
SELECT Popularity = MAX(ColorGroupCount) FROM #ColorGroup WHERE ItemCount = 4
)
SELECT color.*
FROM #ColorGroup color
INNER JOIN cteFindMostPopular popular
ON color.ColorGroupCount = popular.Popularity
WHERE color.ItemCount = 4
;
If you did, indeed, own a flower shop, let's say you're getting ready to make up flower-sets for the 4th of July. What are the 2 most popular combinations of ONLY Red, White, and Blue flowers that you sold for the previous 4th of July and which lists are they on?
WITH
cteFindMostPopular AS
(
SELECT DISTINCT TOP 2 Popularity = ColorGroupCount
FROM #ColorGroup
WHERE DistinctColors = ',1,5,6,'
ORDER BY ColorGroupCount DESC
)
SELECT color.*
FROM #ColorGroup color
INNER JOIN cteFindMostPopular popular
ON color.ColorGroupCount = popular.Popularity
WHERE color.DistinctColors = ',1,5,6,'
;
How about this. You got a huge deal on Pink flowers and bought a lot of them. What are the top selling color combinations that have at least 2 Pink flowers in them?
WITH
cteFindMostPopular AS
(
SELECT DISTINCT TOP 3 Popularity = ColorGroupCount
FROM #ColorGroup
WHERE Colors LIKE '%,4,4,%'
ORDER BY ColorGroupCount DESC
)
SELECT color.*
FROM #ColorGroup color
INNER JOIN cteFindMostPopular popular
ON color.ColorGroupCount = popular.Popularity
WHERE color.Colors LIKE '%,4,4,%'
ORDER BY ColorGroupCount DESC, ColorGroup, ListID
;
With a little imagination, the combinations are just about endless.
The other thing to note on this method is the performance. For example, the code you posted as the solution you used takes more than 25 MINUTES to run on my machine where the method I used above takes only seconds to run and leaves behind a table where many, many different and "fairly odd" questions asked of the data can be easily and quickly answered.
Of course, using the "1960's style pointers" to do it all made it a whole lot easier to demonstrate and use. 😛 You could, however, use the color name and the list name to do the same thing.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 15, 2011 at 8:32 am
CELKO (5/13/2011)
I know this is a skeleton, but you got so much so wrong in so little code. Colors would be encoded with an industry standard (Land numbers, Pantone, etc), but never a count of physical insertion attempts like you did. What you are doing is faking 1960's pointer chains and not yet writing RDBMS at allIt would also help if you knew ISO-11179 naming rules and did not use vague names
T-SQL has row constructors so stop doing dialect. It is both ugly and slow.
[...]
As I mentioned, these tables are just samples to try to figure out how to solve an issue.
The real table structure data is proprietary and has nothing to do with colors. I have two tables with primary keys, and another "intersection" table that joins them together. Primary & foreign keys are in place, I just did not include them in this small example.
I was trying to figure out how to find out if any two "lists" were identical.
This is a kind of relational division. You can do it with proprietary APPLY and other fancy stuff. Let's make it easy. If two sets are equal, then their cardinality is the same. That is a great filter.
I'm not sure what you are saying here. Just because the counts are the same, does not mean that the lists are identical.
May 15, 2011 at 8:50 am
...
We don't actually need the Color table or the List table at this point. What we're really interested in is figuring out the answer to your question of "find if any two Lists have the exact same members". Now, there are a whole lot of ways to answer that single question and some of them can be pretty darned slow. It turns out that one of the easiest ways to find out is also one of the fastest ways. The easy way is to simply concatenate each color into an ordered CSV list for each ListID and then number the lists using DENSE_RANK.
...
Thanks for your help, Jeff.
This is a really straightforward approach. A lot easier to understand than the solution I posted.
I am surprised that it is so fast. I would have thought that all of that string concatenation would end up slow.
(It's a shame I don't run a flower shop :hehe: Sounds interesting!)
May 15, 2011 at 10:30 am
Thanks for the feedback, Goldie. I needed to do such a thing with 300,000 lists that had 1 to 12 items in each list. I tried other folks methods and they just didn't cut it performance-wise. I even tried Celko's method of building 300,000 "micro" Triangular Joins which wasn't bad but was still too slow, IMHO. To be honest, I was surprised at how fast the concatenate and group-counting actually was. Having it be simple to understand was definitely a bonus.
Then I starting playing with scenarios such as "People who bought x also bought y" and was amazed at how simple it was to do so (used an inner join back to the original table for a different take on that) and I was hooked.
Since I knew your data was a made up scenario, I went to the "flower" example just to show you that, in the absence of my knowledge of your data, that other things were possible with your data if you needed to do something different.
Again, thank you very much for the feedback. I'm glad I could help.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply