Recursive query problem - find ALL related records for a given record

Question

Post reply

Recursive query problem - find ALL related records for a given record

CanuckBuck

Hall of Fame

Points: 3915
More actions
June 5, 2021 at 10:37 pm

#3892450
I'm stumped but confident that there's a solution. I'm hoping someone can help.
I need a way to find all of the related records in a set of relationships given any record ID in the set. The nature of the relationships is variable. The 'happy path' is that the relationships represent a simple 'daisy-chain' where a child is related only to its immediately preceding record - simple recursion. Unfortunately there are other types of relationships expressed in the data;
- A 'child' record can be related to more than one 'parent' record (like in real human life, although in this scenario there could potentially be more than two...)
- multiple 'children' can be related to a single 'parent' record
- some records may skip a 'generation' in the daisy-chain, being related to potentially any 'parent' up the chain
- a given set of relationships may include some of all of the scenarios expressed above.
Data rules governing the relationships in the source data;
- The first column can not be null
- The 2nd column can be null
- The value of the 2nd column for a row will always be less than the value if the first column
- The value in the first column can be repeated i.e. it is related to more than one record
Below is where I've gotten so far. I've tried to keep my test set of data small and simple so I can easily visually validate the result. The query solves the first three scenarios but not the last one.
I've been trying to solve this as a query-based solution but if the only way to solve it is with some procedure-based approach I think that would be fine. The final result must be a two-column table that will return all of the Ids of a relationship set when searching for any given member of the set i.e. given the 'Daisy-chain' scenario searching for any value in the Submission_Id column should return same set of values from the Related_Submission_Id column.
Thanks in advance for your assistance.
```
DROP TABLE IF EXISTS #tmp

CREATE TABLE [#tmp] (
       [Submission_ID] INT NOT NULL
       ,[Related_Submission_Id] INT
       )

INSERT INTO [#tmp] ([Submission_ID], [Related_Submission_Id])
VALUES
-- No related submissions scenario
--(1, NULL)
-- Daisy-chain scenario
(2, NULL)
,(85, 2)
,(123, 85)
,(246, 123)
-- Fan Scenario
--,(3, NULL)
--,(99, 3)
--,(150, 3)
--,(300, 3)
-- Combo Scenario
    --,(4, NULL)
    --,(5, 4)
    --,(1000, 4)
    --,(1000, 5)
    --,(1200, 4)
    --,(1500, 1000)

;WITH [Daisy_CTE] AS (
SELECT
[Submission_ID]
,[Related_Submission_Id]
FROM [#tmp]
UNION ALL
SELECT
T2.[Submission_ID]
,T1.[Related_Submission_Id]
FROM [Daisy_CTE] T1
JOIN [#tmp] T2
ON T2.[Related_Submission_Id] = T1.[Submission_ID]
)

,[Fan_CTE] AS (
SELECT
[Submission_ID]
,[Related_Submission_Id]
FROM [#tmp]
WHERE [Related_Submission_Id] IN (
SELECT
[Related_Submission_Id]
FROM [#tmp]
GROUP BY [Related_Submission_Id]
HAVING COUNT(*) > 1
)
)

SELECT DISTINCT
[Submission_ID]
,[Related_Submission_Id]
FROM (
SELECT -- Include Self in the list
[Submission_ID]
,[Submission_ID] AS [Related_Submission_Id]
FROM [Daisy_CTE]
UNION ALL
SELECT -- Include the 'as captured' relationship
[Submission_ID]
,[Related_Submission_Id]
FROM [Daisy_CTE]
UNION ALL
SELECT -- Include the reciprocal of the 'as captured' relationship
[Related_Submission_Id] AS [Submission_ID]
,[Submission_ID] AS [Related_Submission_Id]
FROM [Daisy_CTE]
UNION
SELECT -- Include relationships to a single parent
T1.[Submission_ID]
,T2.[Submission_ID]
FROM [Fan_CTE] T1
CROSS APPLY (SELECT * FROM [Fan_CTE]) T2
WHERE
T2.[Related_Submission_Id] = T1.[Related_Submission_Id]
) A
WHERE
[Submission_ID] IS NOT NULL
AND [Related_Submission_Id] IS NOT NULL
ORDER BY
[Submission_ID]
,[Related_Submission_Id]
```
- This topic was modified 3 years, 6 months ago by CanuckBuck.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Site Owners SSC Guru Points: 80325 More actions · Answer 1

Thanks for posting your issue and hopefully someone will answer soon.

This is an automated bump to increase visibility of your question.

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 2

Perhaps this diagram will help illustrate the situation. In any of the given scenarios I need to find ALL related nodes given the ID of any node in the scenario.

Capture

frederico_fonseca SSCoach Points: 16164 More actions · Answer 3

If this is more than finding descendants and ascendants then a simple query will not be enough.

if you also need siblings (and/or descendants of ascendants / ascendants of descendants) then that becomes a lot more complex.

imagine your scenario 3 - assuming it can, if 1200 is also related to 3 do you need to report this 3?

but based on your input data and your query why is your output incorrect and what were you expecting it to be? please post both actual output and desired output and reasons why it is incorrect.

and is this something that will be executed for a single setid or for more than 1? for most cases this "selection" would work as a anchor and would always be on the output as a "top level" item.

Jeff Moden SSC Guru Points: 1004401 More actions · Answer 4

CanuckBuck wrote:

Perhaps this diagram will help illustrate the situation. In any of the given scenarios I need to find ALL related nodes given the ID of any node in the scenario.

Ok... So for each of those scenarios, what is the "starting node" that starts the search?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 5

Frederico;

I do need to find siblings. In the scenarios I've illustrated above the code I've posted solves for scenarios 1 and 2. If you run the code above with the data from scenario 4, There should be six rows for each Submission_ID. The rows that are missing are;

5 : 1500

1000 : 1100

1100 : 1000

1100 : 1200

1100 : 1500

1200 : 1100

1200 : 1500

1500 : 5

1500 : 1100

1500 : 1200

Effectively, If I had a cross apply of the result where Submission_ID = 4 then I'd have what I need.

With respect to your question regarding 1200 being related to 3, if that were the case then yes, all of those relationships would need to be included as well.

Perhaps a little background will help;

In the data capture scenario that produces the data I'm dealing with, an external party submits information. A submission_id is generated when it is submitted. Periodically they must submit an update. The information capture solution permits them to look up previous submissions and choose one of them to relate the updated submission to. Typically the submitter will choose the most recent submission, effectively creating a 'daisy-chain' set of relationships. However, the information capture solution does not require selection of the most recent previous submission all previous submissions are displayed and the system permits them to select any (or even multiple) of them.

Users, reviewing the submissions, want to find all of the submissions related to any given submission.

Users are viewing these submissions using a Tableau workbook connected to the table containing the submissions. They need to be able to use a filter to select any given submission_ID and find all related submissions.

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 6

Jeff;

For any given scenario the starting node could be any one in a defined set.

My thought was to have a simple two-column table where every node is related to every other node in the set. Users would be able to query the Submission_ID column for a given node and list all of the related nodes.

frederico_fonseca SSCoach Points: 16164 More actions · Answer 7

so if I got it correctly if a user "searches" for submission id 1500 then 4, 5, 1000, 1100, 1200 should be returned as being related - and on this case do you need to know that 1100 is child of 5 and 5 a child of 4 or it does not matter?

and would it be important to know that 4 was the original submission id?

if the above is correct then you are on a bit of a pickle as it means finding all children/parent and for each one do the same (and repeat until no more relations found)

this looks like its more a case for a graph style db.

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 8

Frederico;

You are correct that ...if a user "searches" for submission id 1500 then 4, 5, 1000, 1100, 1200 should be returned as being related. Knowing that 11oo is a child of 5 and that 5 is a child of 4 is not required. Just that they all are related. It is not necessary to know that 4 was the original submission, although, this would be evident in the fact that each submission is dated, and also that lowest number submission_Id in a set would be the original.

frederico_fonseca SSCoach Points: 16164 More actions · Answer 9

there may be a possible way to simplify this on your case - but depends on your data.

if there is only one and never more than 1 possible top level (original submission) for the different chains then it is possible to traverse first for the highest parent and then traverse again for all children of that parent.

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 10

In discussing this requirement with the SMEs for the product I'm building and also delving into the details of what the information capture solution permits submitters to do in terms of relating one submission to another, it appears that pretty much any relationship scenario is permitted. I'm not certain that anyone thought about the implications of what that would mean for data consumption when the information capture solution was developed.

Upon explaining the potential difficulties, SMEs have decided (at least for now) that the solution shown above will meet their need.

I'm not convinced but, the customer is never wrong...

Thanks for responding to my post.

Jonathan AC Roberts SSCoach Points: 18361 More actions · Answer 11

Not sure I quite understand what results you want but I think this might do:

DROP TABLE IF EXISTS #tmp2
CREATE TABLE [#tmp2] (
       [Submission_ID] INT NOT NULL
       ,[Related_Submission_Id] INT
       )

declare @id int = 85;
declare @rowcount int = -1
while @rowcount <> 0 begin

    insert into #tmp2
    (
        [Submission_ID], 
        [Related_Submission_Id]
    )
    select [Submission_ID], 
           [Related_Submission_Id]
      from #tmp t
     where (@id in (t.Submission_ID, t.Related_Submission_Id)
            or exists(select *
                        from #tmp2 t2
                       where t2.Related_Submission_Id in (t.Submission_ID, t.Related_Submission_Id)
                          or t2.Submission_ID in (t.Submission_ID, t.Related_Submission_Id))
           )
       and not exists(select *
                        from #tmp2 t2
                       where t2.Submission_ID = t.Submission_ID
                         and (t2.Related_Submission_Id = t.Related_Submission_Id
                                or (t2.Related_Submission_Id is null 
                                     and t.Related_Submission_Id is null))
                     )
    set @rowcount = @@ROWCOUNT
end

select *
from #tmp2

CanuckBuck Hall of Fame Points: 3915 More actions · Answer 12

Jonathan;

As far as output I my mind has been focused on creating a complete list of all relationships for all nodes in the input data set. However, I see that the solution you've provided does indeed work for every relationship scenario I've been able to create sample data for. My struggle would be incorporating this into my solution.

If you're wanting to see what the output I'm going for would look like you can run the query I provided above for scenarios 1, 2, and 3. My code doesn't work correctly for scenario 4 or a new scenario I've created based on Frederico's mention of a node set where there is more than one top level node. Your code works for my scenario 4 and the new scenario I've created.

Jonathan AC Roberts SSCoach Points: 18361 More actions · Answer 13

Why is it difficult to incorporate it into your solution?

frederico_fonseca SSCoach Points: 16164 More actions · Answer 14

does that code work when searching for submission ID 1000

output I get (based on the supplied input) is not quite what was mentioned above as being required