June 5, 2021 at 10:37 pm
I'm stumped but confident that there's a solution. I'm hoping someone can help.
I need a way to find all of the related records in a set of relationships given any record ID in the set. The nature of the relationships is variable. The 'happy path' is that the relationships represent a simple 'daisy-chain' where a child is related only to its immediately preceding record - simple recursion. Unfortunately there are other types of relationships expressed in the data;
Data rules governing the relationships in the source data;
Below is where I've gotten so far. I've tried to keep my test set of data small and simple so I can easily visually validate the result. The query solves the first three scenarios but not the last one.
I've been trying to solve this as a query-based solution but if the only way to solve it is with some procedure-based approach I think that would be fine. The final result must be a two-column table that will return all of the Ids of a relationship set when searching for any given member of the set i.e. given the 'Daisy-chain' scenario searching for any value in the Submission_Id column should return same set of values from the Related_Submission_Id column.
Thanks in advance for your assistance.
DROP TABLE IF EXISTS #tmp
CREATE TABLE [#tmp] (
[Submission_ID] INT NOT NULL
,[Related_Submission_Id] INT
)
INSERT INTO [#tmp] ([Submission_ID], [Related_Submission_Id])
VALUES
-- No related submissions scenario
--(1, NULL)
-- Daisy-chain scenario
(2, NULL)
,(85, 2)
,(123, 85)
,(246, 123)
-- Fan Scenario
--,(3, NULL)
--,(99, 3)
--,(150, 3)
--,(300, 3)
-- Combo Scenario
--,(4, NULL)
--,(5, 4)
--,(1000, 4)
--,(1000, 5)
--,(1200, 4)
--,(1500, 1000)
;WITH [Daisy_CTE] AS (
SELECT
[Submission_ID]
,[Related_Submission_Id]
FROM [#tmp]
UNION ALL
SELECT
T2.[Submission_ID]
,T1.[Related_Submission_Id]
FROM [Daisy_CTE] T1
JOIN [#tmp] T2
ON T2.[Related_Submission_Id] = T1.[Submission_ID]
)
,[Fan_CTE] AS (
SELECT
[Submission_ID]
,[Related_Submission_Id]
FROM [#tmp]
WHERE [Related_Submission_Id] IN (
SELECT
[Related_Submission_Id]
FROM [#tmp]
GROUP BY [Related_Submission_Id]
HAVING COUNT(*) > 1
)
)
SELECT DISTINCT
[Submission_ID]
,[Related_Submission_Id]
FROM (
SELECT -- Include Self in the list
[Submission_ID]
,[Submission_ID] AS [Related_Submission_Id]
FROM [Daisy_CTE]
UNION ALL
SELECT -- Include the 'as captured' relationship
[Submission_ID]
,[Related_Submission_Id]
FROM [Daisy_CTE]
UNION ALL
SELECT -- Include the reciprocal of the 'as captured' relationship
[Related_Submission_Id] AS [Submission_ID]
,[Submission_ID] AS [Related_Submission_Id]
FROM [Daisy_CTE]
UNION
SELECT -- Include relationships to a single parent
T1.[Submission_ID]
,T2.[Submission_ID]
FROM [Fan_CTE] T1
CROSS APPLY (SELECT * FROM [Fan_CTE]) T2
WHERE
T2.[Related_Submission_Id] = T1.[Related_Submission_Id]
) A
WHERE
[Submission_ID] IS NOT NULL
AND [Related_Submission_Id] IS NOT NULL
ORDER BY
[Submission_ID]
,[Related_Submission_Id]
June 6, 2021 at 11:10 pm
Thanks for posting your issue and hopefully someone will answer soon.
This is an automated bump to increase visibility of your question.
June 7, 2021 at 4:02 pm
Perhaps this diagram will help illustrate the situation. In any of the given scenarios I need to find ALL related nodes given the ID of any node in the scenario.
June 7, 2021 at 5:02 pm
If this is more than finding descendants and ascendants then a simple query will not be enough.
if you also need siblings (and/or descendants of ascendants / ascendants of descendants) then that becomes a lot more complex.
imagine your scenario 3 - assuming it can, if 1200 is also related to 3 do you need to report this 3?
but based on your input data and your query why is your output incorrect and what were you expecting it to be? please post both actual output and desired output and reasons why it is incorrect.
and is this something that will be executed for a single setid or for more than 1? for most cases this "selection" would work as a anchor and would always be on the output as a "top level" item.
June 7, 2021 at 5:10 pm
Perhaps this diagram will help illustrate the situation. In any of the given scenarios I need to find ALL related nodes given the ID of any node in the scenario.
Ok... So for each of those scenarios, what is the "starting node" that starts the search?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 7, 2021 at 5:53 pm
Frederico;
I do need to find siblings. In the scenarios I've illustrated above the code I've posted solves for scenarios 1 and 2. If you run the code above with the data from scenario 4, There should be six rows for each Submission_ID. The rows that are missing are;
5 : 1500
1000 : 1100
1100 : 1000
1100 : 1200
1100 : 1500
1200 : 1100
1200 : 1500
1500 : 5
1500 : 1100
1500 : 1200
Effectively, If I had a cross apply of the result where Submission_ID = 4 then I'd have what I need.
With respect to your question regarding 1200 being related to 3, if that were the case then yes, all of those relationships would need to be included as well.
Perhaps a little background will help;
In the data capture scenario that produces the data I'm dealing with, an external party submits information. A submission_id is generated when it is submitted. Periodically they must submit an update. The information capture solution permits them to look up previous submissions and choose one of them to relate the updated submission to. Typically the submitter will choose the most recent submission, effectively creating a 'daisy-chain' set of relationships. However, the information capture solution does not require selection of the most recent previous submission all previous submissions are displayed and the system permits them to select any (or even multiple) of them.
Users, reviewing the submissions, want to find all of the submissions related to any given submission.
Users are viewing these submissions using a Tableau workbook connected to the table containing the submissions. They need to be able to use a filter to select any given submission_ID and find all related submissions.
June 7, 2021 at 5:56 pm
Jeff;
For any given scenario the starting node could be any one in a defined set.
My thought was to have a simple two-column table where every node is related to every other node in the set. Users would be able to query the Submission_ID column for a given node and list all of the related nodes.
June 7, 2021 at 9:32 pm
so if I got it correctly if a user "searches" for submission id 1500 then 4, 5, 1000, 1100, 1200 should be returned as being related - and on this case do you need to know that 1100 is child of 5 and 5 a child of 4 or it does not matter?
and would it be important to know that 4 was the original submission id?
if the above is correct then you are on a bit of a pickle as it means finding all children/parent and for each one do the same (and repeat until no more relations found)
this looks like its more a case for a graph style db.
June 7, 2021 at 10:23 pm
Frederico;
You are correct that ...if a user "searches" for submission id 1500 then 4, 5, 1000, 1100, 1200 should be returned as being related. Knowing that 11oo is a child of 5 and that 5 is a child of 4 is not required. Just that they all are related. It is not necessary to know that 4 was the original submission, although, this would be evident in the fact that each submission is dated, and also that lowest number submission_Id in a set would be the original.
June 7, 2021 at 11:05 pm
there may be a possible way to simplify this on your case - but depends on your data.
if there is only one and never more than 1 possible top level (original submission) for the different chains then it is possible to traverse first for the highest parent and then traverse again for all children of that parent.
June 8, 2021 at 4:43 pm
In discussing this requirement with the SMEs for the product I'm building and also delving into the details of what the information capture solution permits submitters to do in terms of relating one submission to another, it appears that pretty much any relationship scenario is permitted. I'm not certain that anyone thought about the implications of what that would mean for data consumption when the information capture solution was developed.
Upon explaining the potential difficulties, SMEs have decided (at least for now) that the solution shown above will meet their need.
I'm not convinced but, the customer is never wrong...
Thanks for responding to my post.
June 8, 2021 at 5:22 pm
Not sure I quite understand what results you want but I think this might do:
DROP TABLE IF EXISTS #tmp2
CREATE TABLE [#tmp2] (
[Submission_ID] INT NOT NULL
,[Related_Submission_Id] INT
)
declare @id int = 85;
declare @rowcount int = -1
while @rowcount <> 0 begin
insert into #tmp2
(
[Submission_ID],
[Related_Submission_Id]
)
select [Submission_ID],
[Related_Submission_Id]
from #tmp t
where (@id in (t.Submission_ID, t.Related_Submission_Id)
or exists(select *
from #tmp2 t2
where t2.Related_Submission_Id in (t.Submission_ID, t.Related_Submission_Id)
or t2.Submission_ID in (t.Submission_ID, t.Related_Submission_Id))
)
and not exists(select *
from #tmp2 t2
where t2.Submission_ID = t.Submission_ID
and (t2.Related_Submission_Id = t.Related_Submission_Id
or (t2.Related_Submission_Id is null
and t.Related_Submission_Id is null))
)
set @rowcount = @@ROWCOUNT
end
select *
from #tmp2
June 8, 2021 at 9:04 pm
Jonathan;
As far as output I my mind has been focused on creating a complete list of all relationships for all nodes in the input data set. However, I see that the solution you've provided does indeed work for every relationship scenario I've been able to create sample data for. My struggle would be incorporating this into my solution.
If you're wanting to see what the output I'm going for would look like you can run the query I provided above for scenarios 1, 2, and 3. My code doesn't work correctly for scenario 4 or a new scenario I've created based on Frederico's mention of a node set where there is more than one top level node. Your code works for my scenario 4 and the new scenario I've created.
June 8, 2021 at 9:31 pm
Why is it difficult to incorporate it into your solution?
June 8, 2021 at 9:43 pm
does that code work when searching for submission ID 1000
output I get (based on the supplied input) is not quite what was mentioned above as being required
10004
10005
15001000
4NULL
54
12004
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply