December 9, 2016 at 4:20 am
Hmmm... OK, I can do this but only with some really expensive RBAR table operations, the script I've written takes 10 minutes to run (longer with cursors) and I can't help thinking I'm being a bit thick.
So I have a simple paired list which contains relationships between two products, column A is paired with Column B.
Column A ColumnB
ProdA --> ProdB
ProdB --> ProdA
ProdB --> ProdC
ProdC --> ProdB
ProdD --> ProdE
ProdE --> ProdD
I don't know how many items are in a group (could be up to 20), and I also can't guarantee that every relationship is expressed explicitly which means chaining with self joins doesn't necessarily work, so for example in the above set of records there is no explicit relationship between ProdC and ProdA.
Say what you like about the modelling of this, web programmers, don't get me started.
Anyway, I have to group these records together - the above data creates two groups :
Group 1 --> ProdA, ProdB, ProdC
Group 2 --> ProdD, ProdE
The only way I've got it working is:
SELECT TOP 1 from column A and it's corresponding value from column B into variables
SELECT columns A, B where variable A in column A or column B or variable B in column A or column B
Dump results into a temp table
Union both columns to create a single list, select distinct from there,
Increment a counter, dump it into a grouping table with the counter value to create a group number,
Delete everything from the product table where the value in A or B is in either column of the grouping table.
Repeat until the product table is empty.
But like I say, I think my solution is not at all smart.
Any thoughts appreciated.
December 9, 2016 at 5:11 am
Recursion would usually be used for this Richard.
Try the following:
IF 0 = 1 BEGIN; -- Knock up some sample data
WITH MyData (A, B) AS (
SELECT 'ProdA', 'ProdB' UNION ALL
SELECT 'ProdB', 'ProdA' UNION ALL
SELECT 'ProdB', 'ProdC' UNION ALL
SELECT 'ProdC', 'ProdB' UNION ALL
SELECT 'ProdD', 'ProdE' UNION ALL
SELECT 'ProdE', 'ProdD')
SELECT * INTO #MyData FROM MyData; END;
-- Get rid of dupe pairs
WITH DedupedSet AS (
SELECT DISTINCT x.*
FROM #MyData
CROSS APPLY (
SELECT
A = CASE WHEN A < B THEN A ELSE B END,
B = CASE WHEN A < B THEN B ELSE A END
) x
-- use recursion to resolve the hierarchy
), rCTE AS (
SELECT d.A, d.B, Seq = CAST(A+'>'+B AS VARCHAR(50)), [Level] = 0
FROM DedupedSet d
WHERE NOT EXISTS (SELECT 1 FROM DedupedSet di WHERE di.B = d.A)
UNION ALL
SELECT nl.A, nl.B, Seq = CAST(ll.Seq +'>'+ nl.B AS VARCHAR(50)), [Level] = ll.[Level]+1
FROM rCTE ll
INNER JOIN DedupedSet nl
ON nl.A = ll.B
)
SELECT Seq
FROM rCTE ro
WHERE NOT EXISTS (SELECT 1 FROM rCTE ri WHERE ri.[Level] > ro.[Level] AND ri.Seq LIKE ro.Seq+'%')
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 9, 2016 at 6:50 am
Thanks for that Chris....
Right, so now I understand how a recursive CTE works. Doh! There's a certain point when I'm reading into things that my brain just gives up and I just hack and slash until stuff works :-D.
Logically this is more or less what I'm already doing, but surprisingly it actually takes longer than my existing query, I think because I'm identifying and hiving off smaller record sets and deleting from the source table then doing a select top 1 to prime the next pass, rather than using 2 Exists checks (I started with a similar construct to your "WHERE not EXISTS" and dropped it).
So, technically superior but the idiot seems to have the edge on performance.....
December 9, 2016 at 7:18 am
You could accelerate it by saving off DedupedSet as a temp table and indexing it - a clustered index on column A, possibly.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 14, 2016 at 1:29 pm
Does this help?
It sounds like the same root problem. If you can solve it efficiently, I am sure that Facebook would love to hear from you 🙂
December 15, 2016 at 8:47 am
Just curious if this is any easier?
WITH MyData (A, B) AS (
SELECT 'ProdA', 'ProdB' UNION ALL
SELECT 'ProdB', 'ProdA' UNION ALL
SELECT 'ProdB', 'ProdC' UNION ALL
SELECT 'ProdC', 'ProdB' UNION ALL
SELECT 'ProdD', 'ProdE' UNION ALL
SELECT 'ProdE', 'ProdD'
)
SELECT *
INTO #MyData
FROM MyData
WHERE A <> B;
CREATE TABLE #TEMP_DATA (
RN int IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED,
A varchar(10),
B varchar(10)
);
CREATE NONCLUSTERED INDEX IX_TEMP_DATA_A_B ON #TEMP_DATA (
A ASC,
B ASC
);
CREATE NONCLUSTERED INDEX IX_TEMP_DATA_B_A ON #TEMP_DATA (
B ASC,
A ASC
);
INSERT INTO #TEMP_DATA (A, B)
SELECT DISTINCT
CASE WHEN A < B THEN A ELSE B END AS A,
CASE WHEN B > A THEN B ELSE A END AS B
FROM #MyData
ORDER BY A, B;
/*
SELECT *
FROM #TEMP_DATA
ORDER BY A, B;
*/
WITH CONNECTED_SETS AS (
SELECT T.A, T.B,
CASE
WHEN T.RN = 1 THEN 1
WHEN LAG(T.B, 1, NULL) OVER(ORDER BY T.RN) <> T.A THEN 1
ELSE 0
END AS GROUP_START,
T.RN
FROM #TEMP_DATA AS T
),
GROUPS AS (
SELECT CS.A, CS.B, CS.GROUP_START, CS.RN,
SUM(CS.GROUP_START) OVER(ORDER BY CS.RN ROWS UNBOUNDED PRECEDING) AS GROUP_NUMBER
FROM CONNECTED_SETS AS CS
)
SELECT GRP.GROUP_NUMBER, X.VALUE
FROM (
SELECT G.GROUP_NUMBER
FROM GROUPS AS G
GROUP BY G.GROUP_NUMBER
) AS GRP
CROSS APPLY (
SELECT DISTINCT G2.A AS VALUE
FROM GROUPS AS G2
WHERE G2.GROUP_NUMBER = GRP.GROUP_NUMBER
UNION
SELECT DISTINCT G3.B AS VALUE
FROM GROUPS AS G3
WHERE G3.GROUP_NUMBER = GRP.GROUP_NUMBER
) AS X
ORDER BY GRP.GROUP_NUMBER, X.VALUE;
DROP TABLE #MyData;
DROP TABLE #TEMP_DATA;
See the attached sqlplan file for performance comparison purposes.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
December 18, 2016 at 5:10 pm
richard.gardner 6009 (12/9/2016)
Hmmm... OK, I can do this but only with some really expensive RBAR table operations, the script I've written takes 10 minutes to run (longer with cursors) and I can't help thinking I'm being a bit thick.So I have a simple paired list which contains relationships between two products, column A is paired with Column B.
Column A ColumnB
ProdA --> ProdB
ProdB --> ProdA
ProdB --> ProdC
ProdC --> ProdB
ProdD --> ProdE
ProdE --> ProdD
I don't know how many items are in a group (could be up to 20), and I also can't guarantee that every relationship is expressed explicitly which means chaining with self joins doesn't necessarily work, so for example in the above set of records there is no explicit relationship between ProdC and ProdA.
Say what you like about the modelling of this, web programmers, don't get me started.
Anyway, I have to group these records together - the above data creates two groups :
Group 1 --> ProdA, ProdB, ProdC
Group 2 --> ProdD, ProdE
The only way I've got it working is:
SELECT TOP 1 from column A and it's corresponding value from column B into variables
SELECT columns A, B where variable A in column A or column B or variable B in column A or column B
Dump results into a temp table
Union both columns to create a single list, select distinct from there,
Increment a counter, dump it into a grouping table with the counter value to create a group number,
Delete everything from the product table where the value in A or B is in either column of the grouping table.
Repeat until the product table is empty.
But like I say, I think my solution is not at all smart.
Any thoughts appreciated.
How many rows in this table? Also, are products unique to their individual chain or can they appear in more than one chain?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 21, 2016 at 6:28 am
Hey, Jeff Moden, you've taught me a lot over the years (you might not believe it looking at my code :-))...
OK, so we're only looking at about 89,000 records, so I'm a little confused as to why we can't improve on my bludgeoning, Steve's solution ran in similar time to mine, but it only produces pairs of records. Because there didn't seem to be a time advantage I didn't trouble shoot it further. Chris's solution does work, I can see it dumping records out, but I think there's something askance in the recursion which is hitting performance, I cut the query after 15 minutes.
By the nature of things if there is a link between two numbers both of those numbers are in the same group, so all groups are by definition mutually exclusive.
So here's the dirty version which works (it takes 10 - 12 minutes), I've commented my second pass, it was the only way I could figure out how to resolve incomplete links. Note I've moved through CTES --> Temp tables --> actual tables to try and squeeze some performance out of it, but I can't polish it, I'm just rolling it in glitter 😉
DECLARE @a int
DECLARE @b-2 int
DECLARE @grouper INT
SET @a = NULL
SET @b-2 = NULL
SET @grouper = 1
TRUNCATE TABLE HoldLinks
TRUNCATE TABLE Holding
WHILE EXISTS
(
SELECT TOP 1 product_id as A, linked_product_id as B
FROM [ProductLinks]
)
BEGIN
--Prime the procedure with the first two records from the recordset
SELECT TOP 1 product_id as A, linked_product_id as B
INTO #Primer
FROM [ProductLinks]
SET @a = (SELECT A FROM #Primer)
SET @b-2 = (SELECT B FROM #Primer)
--Insert all relevant records into holding table
INSERT INTO Holding
SELECT product_id as A, linked_product_id as B
FROM ProductLinks
WHERE product_id IN (@A, @b-2) OR linked_product_id IN (@A, @b-2)
;
--Do a second pass with all the records currently in Holding to resolve situations where links are not complete
--i.e. Record A is connected to B and C, C is connected to D but D is not connected to A or B, therefore single pass will not pick up D
WITH Uniques
as
(
SELECT DISTINCT A FROM Holding
UNION
SELECT DISTINCT B FROM Holding
)
INSERT INTO Holding
SELECT U.A, P.linked_product_id as B
FROM Uniques U
LEFT JOIN ProductLinks P ON U.A = P.product_id
INSERT INTO HoldLinks
SELECT DISTINCT A, @grouper as [SwatchGroup] FROM Holding
UNION
SELECT DISTINCT B, @grouper as [SwatchGroup] FROM Holding
DELETE FROM [ProductLinks] WHERE product_id IN (SELECT A FROM HoldLinks)
DROP TABLE #Primer
TRUNCATE TABLE Holding
END
December 21, 2016 at 6:57 am
Hey Richard, if you post up your rCTE version, I'll figure out an index or two for you. rCTE's can be blazingly fast if the indexing is correct.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 21, 2016 at 11:22 am
richard.gardner 6009 (12/21/2016)
Hey, Jeff Moden, you've taught me a lot over the years (you might not believe it looking at my code :-))...
Thank you for the kind feedback, Richard.
OK, so we're only looking at about 89,000 records, so I'm a little confused as to why we can't improve on my bludgeoning, Steve's solution ran in similar time to mine, but it only produces pairs of records. Because there didn't seem to be a time advantage I didn't trouble shoot it further. Chris's solution does work, I can see it dumping records out, but I think there's something askance in the recursion which is hitting performance, I cut the query after 15 minutes.
By the nature of things if there is a link between two numbers both of those numbers are in the same group, so all groups are by definition mutually exclusive.
Excellent. That's the info I needed to setup for some tests... unless, with only 89K rows, you wanted to attach a ZIP file that I could import. Of course, no proprietary or PII info, please.
So here's the dirty version which works (it takes 10 - 12 minutes), I've commented my second pass, it was the only way I could figure out how to resolve incomplete links. Note I've moved through CTES --> Temp tables --> actual tables to try and squeeze some performance out of it, but I can't polish it, I'm just rolling it in glitter 😉
That will help as well... thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 22, 2016 at 3:55 am
Hi Jeff,
Here's a sample dataset, my bad, 89,000 is a different number, it's just 50,000 pairs of internal keys.
@chris-2 - Thanks for the offer, appreciated, I'll have another look at it, I did run it a couple of times and the first time I got an error >100 levels of recursion, second time it ran until I killed it.
Regards
Richard
December 22, 2016 at 9:25 am
Hi Richard, how many rows are you expecting back from this query?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 22, 2016 at 9:33 am
Well I can't be precise because I reloaded the source data this morning and I haven't run it today, but yesterday it was 15,760 with maximum group sizes of between 2 and 20 (this is with the data organised as below):
ID Group
3998 1
1409121
7972 2
263742
263772
263802
266863
266903
266943
266983
9769 3
270033
270113
December 22, 2016 at 9:35 am
I also started with a recursive CTE but it never came to an end. A cursor seems to be reasonably fast in this case. On my laptop your script runs in 4:30 min, the script below took 2 seconds with your test dataset. I almost never use cursors so I'm not sure if LOCAL FAST_FORWARD is the fastest possible cursor in this case.
CREATE TABLE #ProductSets
(
set_id INT NOT NULL,
product_id INT NOT NULL
);
CREATE INDEX IX_ProductSet_product_id ON #ProductSets(product_id);
DECLARE PL CURSOR LOCAL FAST_FORWARD FOR
SELECT DISTINCT
IIF(product_id < linked_product_id, product_id, linked_product_id) AS product_id,
IIF(product_id < linked_product_id, linked_product_id, product_id) AS linked_product_id
FROM
dbo.ProductLinks
ORDER BY
product_id, linked_product_id;
DECLARE @product_id INT;
DECLARE @linked_product_id INT;
DECLARE @set_counter INT = 0;
DECLARE @set_id INT;
OPEN PL
FETCH NEXT FROM PL
INTO @product_id, @linked_product_id;
WHILE @@FETCH_STATUS = 0
BEGIN
SET @set_id = NULL;
SELECT @set_id = set_id FROM #ProductSets WHERE product_id = @product_id
IF (@set_id IS NULL)
BEGIN
SET @set_counter += 1;
SET @set_id = @set_counter;
INSERT INTO
#ProductSets(set_id, product_id)
VALUES
(@set_id, @product_id),
(@set_id, @linked_product_id)
END
ELSE
BEGIN
INSERT INTO
#ProductSets(set_id, product_id)
VALUES
(@set_id, @linked_product_id)
END
FETCH NEXT FROM PL
INTO @product_id, @linked_product_id;
END;
CLOSE PL;
DEALLOCATE PL;
PRINT CAST(GETDATE() AS VARCHAR(100));
SELECT DISTINCT
set_id,
STUFF((
SELECT
', ' + CAST(Q.product_id AS VARCHAR(10))
FROM
(
SELECT DISTINCT
PS2.product_id
FROM
#ProductSets PS2
WHERE
PS2.set_id = PS1.set_id
) Q
FOR XML PATH('')
), 1, 2, '') product_set
FROM
#ProductSets PS1
ORDER BY
set_id
DROP TABLE #ProductSets;
December 22, 2016 at 9:47 am
The sample data set has a row where both columns have the same value which messes up rCTE's unless you remove it. Even then, some of the parent values have at least 21 values linked in the sequence, for instance
179440>179479>179481>179488>179490>179492>179494>179496>179498>179500>179502>179504>179530>179556>179566>179570>179572>179576>179578>179580>179582>179784
Using longhand code to get a decent handle on the shape of the data, I get over 50 million rows output at 11 levels of linkage...
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply