Parsing paired relationships.

Question

Post reply

Parsing paired relationships.

richard.gardner 6009

SSC-Addicted

Points: 484
More actions
December 9, 2016 at 4:20 am

#338148

Hmmm... OK, I can do this but only with some really expensive RBAR table operations, the script I've written takes 10 minutes to run (longer with cursors) and I can't help thinking I'm being a bit thick.
So I have a simple paired list which contains relationships between two products, column A is paired with Column B.
Column A ColumnB
ProdA --> ProdB
ProdB --> ProdA
ProdB --> ProdC
ProdC --> ProdB
ProdD --> ProdE
ProdE --> ProdD
I don't know how many items are in a group (could be up to 20), and I also can't guarantee that every relationship is expressed explicitly which means chaining with self joins doesn't necessarily work, so for example in the above set of records there is no explicit relationship between ProdC and ProdA.
Say what you like about the modelling of this, web programmers, don't get me started.
Anyway, I have to group these records together - the above data creates two groups :
Group 1 --> ProdA, ProdB, ProdC
Group 2 --> ProdD, ProdE
The only way I've got it working is:
SELECT TOP 1 from column A and it's corresponding value from column B into variables
SELECT columns A, B where variable A in column A or column B or variable B in column A or column B
Dump results into a temp table
Union both columns to create a single list, select distinct from there,
Increment a counter, dump it into a grouping table with the counter value to create a group number,
Delete everything from the product table where the value in A or B is in either column of the grouping table.
Repeat until the product table is empty.
But like I say, I think my solution is not at all smart.
Any thoughts appreciated.

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

Recursion would usually be used for this Richard.

Try the following:

IF 0 = 1 BEGIN; -- Knock up some sample data

WITH MyData (A, B) AS (

SELECT 'ProdA', 'ProdB' UNION ALL

SELECT 'ProdB', 'ProdA' UNION ALL

SELECT 'ProdB', 'ProdC' UNION ALL

SELECT 'ProdC', 'ProdB' UNION ALL

SELECT 'ProdD', 'ProdE' UNION ALL

SELECT 'ProdE', 'ProdD')

SELECT * INTO #MyData FROM MyData; END;

-- Get rid of dupe pairs

WITH DedupedSet AS (

SELECT DISTINCT x.*

FROM #MyData

CROSS APPLY (

SELECT

A = CASE WHEN A < B THEN A ELSE B END,

B = CASE WHEN A < B THEN B ELSE A END

) x

-- use recursion to resolve the hierarchy

), rCTE AS (

SELECT d.A, d.B, Seq = CAST(A+'>'+B AS VARCHAR(50)), [Level] = 0

FROM DedupedSet d

WHERE NOT EXISTS (SELECT 1 FROM DedupedSet di WHERE di.B = d.A)

UNION ALL

SELECT nl.A, nl.B, Seq = CAST(ll.Seq +'>'+ nl.B AS VARCHAR(50)), [Level] = ll.[Level]+1

FROM rCTE ll

INNER JOIN DedupedSet nl

ON nl.A = ll.B

)

SELECT Seq

FROM rCTE ro

WHERE NOT EXISTS (SELECT 1 FROM rCTE ri WHERE ri.[Level] > ro.[Level] AND ri.Seq LIKE ro.Seq+'%')

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

richard.gardner 6009 SSC-Addicted Points: 484 More actions · Answer 2

Thanks for that Chris....

Right, so now I understand how a recursive CTE works. Doh! There's a certain point when I'm reading into things that my brain just gives up and I just hack and slash until stuff works :-D.

Logically this is more or less what I'm already doing, but surprisingly it actually takes longer than my existing query, I think because I'm identifying and hiving off smaller record sets and deleting from the source table then doing a select top 1 to prime the next pass, rather than using 2 Exists checks (I started with a similar construct to your "WHERE not EXISTS" and dropped it).

So, technically superior but the idiot seems to have the edge on performance.....

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

You could accelerate it by saving off DedupedSet as a temp table and indexing it - a clustered index on column A, possibly.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

aaron.reese SSChampion Points: 13498 More actions · Answer 4

Does this help?

http://stackoverflow.com/questions/2076715/challenge-how-to-implement-an-algorithm-for-six-degree-of-separation

It sounds like the same root problem. If you can solve it efficiently, I am sure that Facebook would love to hear from you 🙂

sgmunson SSC Guru Points: 110651 More actions · Answer 5

Just curious if this is any easier?

WITH MyData (A, B) AS (

SELECT 'ProdA', 'ProdB' UNION ALL

SELECT 'ProdB', 'ProdA' UNION ALL

SELECT 'ProdB', 'ProdC' UNION ALL

SELECT 'ProdC', 'ProdB' UNION ALL

SELECT 'ProdD', 'ProdE' UNION ALL

SELECT 'ProdE', 'ProdD'

)

SELECT *

INTO #MyData

FROM MyData

WHERE A <> B;

CREATE TABLE #TEMP_DATA (

RN int IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED,

A varchar(10),

B varchar(10)

);

CREATE NONCLUSTERED INDEX IX_TEMP_DATA_A_B ON #TEMP_DATA (

A ASC,

B ASC

);

CREATE NONCLUSTERED INDEX IX_TEMP_DATA_B_A ON #TEMP_DATA (

B ASC,

A ASC

);

INSERT INTO #TEMP_DATA (A, B)

SELECT DISTINCT

CASE WHEN A < B THEN A ELSE B END AS A,

CASE WHEN B > A THEN B ELSE A END AS B

FROM #MyData

ORDER BY A, B;

/*

SELECT *

FROM #TEMP_DATA

ORDER BY A, B;

*/

WITH CONNECTED_SETS AS (

SELECT T.A, T.B,

CASE

WHEN T.RN = 1 THEN 1

WHEN LAG(T.B, 1, NULL) OVER(ORDER BY T.RN) <> T.A THEN 1

ELSE 0

END AS GROUP_START,

T.RN

FROM #TEMP_DATA AS T

),

GROUPS AS (

SELECT CS.A, CS.B, CS.GROUP_START, CS.RN,

SUM(CS.GROUP_START) OVER(ORDER BY CS.RN ROWS UNBOUNDED PRECEDING) AS GROUP_NUMBER

FROM CONNECTED_SETS AS CS

)

SELECT GRP.GROUP_NUMBER, X.VALUE

FROM (

SELECT G.GROUP_NUMBER

FROM GROUPS AS G

GROUP BY G.GROUP_NUMBER

) AS GRP

CROSS APPLY (

SELECT DISTINCT G2.A AS VALUE

FROM GROUPS AS G2

WHERE G2.GROUP_NUMBER = GRP.GROUP_NUMBER

UNION

SELECT DISTINCT G3.B AS VALUE

FROM GROUPS AS G3

WHERE G3.GROUP_NUMBER = GRP.GROUP_NUMBER

) AS X

ORDER BY GRP.GROUP_NUMBER, X.VALUE;

DROP TABLE #MyData;

DROP TABLE #TEMP_DATA;

See the attached sqlplan file for performance comparison purposes.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 6

richard.gardner 6009 (12/9/2016)
Hmmm... OK, I can do this but only with some really expensive RBAR table operations, the script I've written takes 10 minutes to run (longer with cursors) and I can't help thinking I'm being a bit thick.
So I have a simple paired list which contains relationships between two products, column A is paired with Column B.
Column A ColumnB
ProdA --> ProdB
ProdB --> ProdA
ProdB --> ProdC
ProdC --> ProdB
ProdD --> ProdE
ProdE --> ProdD
I don't know how many items are in a group (could be up to 20), and I also can't guarantee that every relationship is expressed explicitly which means chaining with self joins doesn't necessarily work, so for example in the above set of records there is no explicit relationship between ProdC and ProdA.
Say what you like about the modelling of this, web programmers, don't get me started.
Anyway, I have to group these records together - the above data creates two groups :
Group 1 --> ProdA, ProdB, ProdC
Group 2 --> ProdD, ProdE
The only way I've got it working is:
SELECT TOP 1 from column A and it's corresponding value from column B into variables
SELECT columns A, B where variable A in column A or column B or variable B in column A or column B
Dump results into a temp table
Union both columns to create a single list, select distinct from there,
Increment a counter, dump it into a grouping table with the counter value to create a group number,
Delete everything from the product table where the value in A or B is in either column of the grouping table.
Repeat until the product table is empty.
But like I say, I think my solution is not at all smart.
Any thoughts appreciated.

How many rows in this table? Also, are products unique to their individual chain or can they appear in more than one chain?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

richard.gardner 6009 SSC-Addicted Points: 484 More actions · Answer 7

Hey, Jeff Moden, you've taught me a lot over the years (you might not believe it looking at my code :-))...

OK, so we're only looking at about 89,000 records, so I'm a little confused as to why we can't improve on my bludgeoning, Steve's solution ran in similar time to mine, but it only produces pairs of records. Because there didn't seem to be a time advantage I didn't trouble shoot it further. Chris's solution does work, I can see it dumping records out, but I think there's something askance in the recursion which is hitting performance, I cut the query after 15 minutes.

By the nature of things if there is a link between two numbers both of those numbers are in the same group, so all groups are by definition mutually exclusive.

So here's the dirty version which works (it takes 10 - 12 minutes), I've commented my second pass, it was the only way I could figure out how to resolve incomplete links. Note I've moved through CTES --> Temp tables --> actual tables to try and squeeze some performance out of it, but I can't polish it, I'm just rolling it in glitter 😉

DECLARE @A int

DECLARE @b-2 int

DECLARE @grouper INT

SET @A = NULL

SET @b-2 = NULL

SET @grouper = 1

TRUNCATE TABLE HoldLinks

TRUNCATE TABLE Holding

WHILE EXISTS

(

SELECT TOP 1 product_id as A, linked_product_id as B

FROM [ProductLinks]

)

BEGIN

--Prime the procedure with the first two records from the recordset

SELECT TOP 1 product_id as A, linked_product_id as B

INTO #Primer

FROM [ProductLinks]

SET @A = (SELECT A FROM #Primer)

SET @b-2 = (SELECT B FROM #Primer)

--Insert all relevant records into holding table

INSERT INTO Holding

SELECT product_id as A, linked_product_id as B

FROM ProductLinks

WHERE product_id IN (@A, @b-2) OR linked_product_id IN (@A, @b-2)

;

--Do a second pass with all the records currently in Holding to resolve situations where links are not complete

--i.e. Record A is connected to B and C, C is connected to D but D is not connected to A or B, therefore single pass will not pick up D

WITH Uniques

as

(

SELECT DISTINCT A FROM Holding

UNION

SELECT DISTINCT B FROM Holding

)

INSERT INTO Holding

SELECT U.A, P.linked_product_id as B

FROM Uniques U

LEFT JOIN ProductLinks P ON U.A = P.product_id

INSERT INTO HoldLinks

SELECT DISTINCT A, @grouper as [SwatchGroup] FROM Holding

UNION

SELECT DISTINCT B, @grouper as [SwatchGroup] FROM Holding

DELETE FROM [ProductLinks] WHERE product_id IN (SELECT A FROM HoldLinks)

DROP TABLE #Primer

TRUNCATE TABLE Holding

SET @grouper = @grouper + 1

END

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

Hey Richard, if you post up your rCTE version, I'll figure out an index or two for you. rCTE's can be blazingly fast if the indexing is correct.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

richard.gardner 6009 (12/21/2016)
Hey, Jeff Moden, you've taught me a lot over the years (you might not believe it looking at my code :-))...

Thank you for the kind feedback, Richard.

OK, so we're only looking at about 89,000 records, so I'm a little confused as to why we can't improve on my bludgeoning, Steve's solution ran in similar time to mine, but it only produces pairs of records. Because there didn't seem to be a time advantage I didn't trouble shoot it further. Chris's solution does work, I can see it dumping records out, but I think there's something askance in the recursion which is hitting performance, I cut the query after 15 minutes.
By the nature of things if there is a link between two numbers both of those numbers are in the same group, so all groups are by definition mutually exclusive.

Excellent. That's the info I needed to setup for some tests... unless, with only 89K rows, you wanted to attach a ZIP file that I could import. Of course, no proprietary or PII info, please.

So here's the dirty version which works (it takes 10 - 12 minutes), I've commented my second pass, it was the only way I could figure out how to resolve incomplete links. Note I've moved through CTES --> Temp tables --> actual tables to try and squeeze some performance out of it, but I can't polish it, I'm just rolling it in glitter 😉

That will help as well... thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

richard.gardner 6009 SSC-Addicted Points: 484 More actions · Answer 10

Hi Jeff,

Here's a sample dataset, my bad, 89,000 is a different number, it's just 50,000 pairs of internal keys.

@chris-2 - Thanks for the offer, appreciated, I'll have another look at it, I did run it a couple of times and the first time I got an error >100 levels of recursion, second time it ran until I killed it.

Regards

Richard

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

Hi Richard, how many rows are you expecting back from this query?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

richard.gardner 6009 SSC-Addicted Points: 484 More actions · Answer 12

Well I can't be precise because I reloaded the source data this morning and I haven't run it today, but yesterday it was 15,760 with maximum group sizes of between 2 and 20 (this is with the data organised as below):

ID Group

3998 1

1409121

7972 2

263742

263772

263802

266863

266903

266943

266983

9769 3

270033

270113

matzer SSCrazy Eights Points: 9727 More actions · Answer 13

I also started with a recursive CTE but it never came to an end. A cursor seems to be reasonably fast in this case. On my laptop your script runs in 4:30 min, the script below took 2 seconds with your test dataset. I almost never use cursors so I'm not sure if LOCAL FAST_FORWARD is the fastest possible cursor in this case.

CREATE TABLE #ProductSets

(

set_id INT NOT NULL,

product_id INT NOT NULL

);

CREATE INDEX IX_ProductSet_product_id ON #ProductSets(product_id);

DECLARE PL CURSOR LOCAL FAST_FORWARD FOR

SELECT DISTINCT

IIF(product_id < linked_product_id, product_id, linked_product_id) AS product_id,

IIF(product_id < linked_product_id, linked_product_id, product_id) AS linked_product_id

FROM

dbo.ProductLinks

ORDER BY

product_id, linked_product_id;

DECLARE @product_id INT;

DECLARE @linked_product_id INT;

DECLARE @set_counter INT = 0;

DECLARE @set_id INT;

OPEN PL

FETCH NEXT FROM PL

INTO @product_id, @linked_product_id;

WHILE @@FETCH_STATUS = 0

BEGIN

SET @set_id = NULL;

SELECT @set_id = set_id FROM #ProductSets WHERE product_id = @product_id

IF (@set_id IS NULL)

BEGIN

SET @set_counter += 1;

SET @set_id = @set_counter;

INSERT INTO

#ProductSets(set_id, product_id)

VALUES

(@set_id, @product_id),

(@set_id, @linked_product_id)

END

ELSE

BEGIN

INSERT INTO

#ProductSets(set_id, product_id)

VALUES

(@set_id, @linked_product_id)

END

FETCH NEXT FROM PL

INTO @product_id, @linked_product_id;

END;

CLOSE PL;

DEALLOCATE PL;

PRINT CAST(GETDATE() AS VARCHAR(100));

SELECT DISTINCT

set_id,

STUFF((

SELECT

', ' + CAST(Q.product_id AS VARCHAR(10))

FROM

(

SELECT DISTINCT

PS2.product_id

FROM

#ProductSets PS2

WHERE

PS2.set_id = PS1.set_id

) Q

FOR XML PATH('')

), 1, 2, '') product_set

FROM

#ProductSets PS1

ORDER BY

set_id

DROP TABLE #ProductSets;

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 14

The sample data set has a row where both columns have the same value which messes up rCTE's unless you remove it. Even then, some of the parent values have at least 21 values linked in the sequence, for instance

179440>179479>179481>179488>179490>179492>179494>179496>179498>179500>179502>179504>179530>179556>179566>179570>179572>179576>179578>179580>179582>179784

Using longhand code to get a decent handle on the shape of the data, I get over 50 million rows output at 11 levels of linkage...

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden