April 15, 2012 at 12:10 am
Hello,
I´m trying to use CTE to find the very top parent within a table without any NULLs
My test table seems to work but when applying the logic on a larger table I´m running into very poor performance due to the recursion.
create table #t (childid int,name nvarchar(50),parentid int)
insert into #t values (1,'a',10)
insert into #t values (2,'b',1)
insert into #t values (3,'c',1)
insert into #t values (4,'d',3)
insert into #t values (5,'e',2)
insert into #t values (6,'f',11)
insert into #t values (7,'g',6)
insert into #t values (8,'h',5)
GO
with cte1
as
(
select childid,name,parentid, 0 AS level from #t where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1)
--example:
--childid 4 = top parent: 10
--childid 8 = top parent: 10
--childid 7 = top parent: 11
drop table #t
Is there a better approach to achieve the same goal?
Many thanks
April 15, 2012 at 7:18 am
In the sample data given the recursive part ends when the selected parent (either 10 or 11) is not found. The chosen sample data does not have a loop either. This is a danger of not having out nodes, nodes where the parent is null.
If you had the sample data so 4-> 6 -> 7 -> 4 where 4 has a parent 6 that has a parent 7 that has parent 4 then you would have a closed loop that would never stop, hence the recursion problem.
You could think of this as a problem with the data, or with some network types links and points might make loops.
One answer is to keep a varchar of the nodes (ids) already visited and not to return to the same point if the parent is already part of the history, cutting the loop.
Demo as below:
----------------------------------- Extra row added to create a loop
insert into #t values (10,'h',5);
select * from #t;
-- Original CTE fails (8 -> 5 -> 2 -> 1 -> 10 -> 5 -> 2 -> 1 -> 10 etc)
with cte1
as
(
select childid,name,parentid, 0 AS level from #t where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1) ;
-- New CTE with loop check
select * from #t
go
-- New CTE succeeds (8 -> 5 -> 2 -> 1 -> 10 [10 stop point as would go back to 5])
with cte1
as
(
select childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited
from #t
where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 ,
convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>')
from #t inner join cte1
on #t.childid=cte1.parentid
where Visited NOT LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%')
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1) ;
go
Fitz
April 15, 2012 at 9:13 am
Hi Mark and thank you for your answer and explanation.
Yes, I understand that having out-nodes would be preferable but unfortunately this is not possible in the table.
The solution with keeping track of the nodes visited sounds like an interesting approach.
If you are willing to share your example, this would be very much appreciated.
EDIT: Mark was too fast for me, example provided 😀
EDIT2: I´ll try this approach on the real table tomorrow and report back!
April 15, 2012 at 1:47 pm
Mark Fitzgerald-331224 (4/15/2012)
In the sample data given the recursive part ends when the selected parent (either 10 or 11) is not found. The chosen sample data does not have a loop either. This is a danger of not having out nodes, nodes where the parent is null.If you had the sample data so 4-> 6 -> 7 -> 4 where 4 has a parent 6 that has a parent 7 that has parent 4 then you would have a closed loop that would never stop, hence the recursion problem.
You could think of this as a problem with the data, or with some network types links and points might make loops.
One answer is to keep a varchar of the nodes (ids) already visited and not to return to the same point if the parent is already part of the history, cutting the loop.
Demo as below:
----------------------------------- Extra row added to create a loop
insert into #t values (10,'h',5);
select * from #t;
-- Original CTE fails (8 -> 5 -> 2 -> 1 -> 10 -> 5 -> 2 -> 1 -> 10 etc)
with cte1
as
(
select childid,name,parentid, 0 AS level from #t where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1) ;
-- New CTE with loop check
select * from #t
go
-- New CTE succeeds (8 -> 5 -> 2 -> 1 -> 10 [10 stop point as would go back to 5])
with cte1
as
(
select childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited
from #t
where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 ,
convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>')
from #t inner join cte1
on #t.childid=cte1.parentid
where Visited NOT LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%')
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1) ;
go
Fitz
Oh... be really, really careful here. Checking for loops like that may look fine on paper but they give absolutely no indication that a loop actually exists never mind that one needs to be repaired. Bad data could go on for years before anyone discovers they have bad data with this type of "fix". At least the original code will give a max recursion error in a loop.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 15, 2012 at 2:14 pm
F.L (4/15/2012)
Hello,I´m trying to use CTE to find the very top parent within a table without any NULLs
My test table seems to work but when applying the logic on a larger table I´m running into very poor performance due to the recursion.
... {snip}...
Is there a better approach to achieve the same goal?
The problem with CTEs is that they work just exactly the same way as a VIEW works. Every time you call it, it get's re-executed. Since you're calling the CTE twice, it get's executed twice. Since your example has no indexes, the code you posted does 4 full table scans.
Any chance of you posting the DDL (i.e. CREATE TABLE) statement for your real table along with any indexes/keys/constraints you may have? Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 15, 2012 at 3:42 pm
Sorry Jeff, this was only a demonstration of the closed loop idea that may occur.;-)
"Oh... be really, really careful here. Checking for loops like that may look fine on paper but they give absolutely no indication that a loop actually exists never mind that one needs to be repaired. Bad data could go on for years before anyone discovers they have bad data with this type of "fix". At least the original code will give a max recursion error in a loop."
I have produced other code to examine tree structures for loops that do indicate the existence of closed loops.
Fitz
April 15, 2012 at 3:48 pm
Never mind. I don't need the DDL or the indexes or anything. I built a million row hierarchical test table to match what was originally posted. Performance of the original code was between 7 and 8 seconds. As you said, it was a performance problem especially to return 1 row out of a list of 9 for any given lookup ID.
Even for the problem of calling the CTE more than once, adding the following index causs the performance to drop to 1 milli-second... which is right about where it should be.
ALTER TABLE #t
ADD PRIMARY KEY CLUSTERED (ChildID)
;
And, yes... all other things being equal, a "child ID" in an adjacency list should be unique. If it's not, you may have "cycles" in your list and you need to handle thing by "position" instead of by "ID".
You can also nearly double the performance even of that by getting rid of one of the calls to the CTE...
WITH
cteSearchUpline as
(
SELECT ChildID, Name, ParentID, 0 AS Level
FROM #t
WHERE ChildID = 1000000
UNION ALL
SELECT t.childid, t.name, t.parentid, Level = Level + 1
FROM #t t
INNER JOIN cteSearchUpline cte ON t.ChildID = cte.ParentID
)
SELECT TOP 1 *
FROM cteSearchUpline d
ORDER BY Level DESC
;
--Jeff Moden
Change is inevitable... Change for the better is not.
April 15, 2012 at 3:57 pm
Apologies. I almost forgot. Here's what I used to build the test data.
--===== Conditionaly drop the test table so we can do reruns more easily
IF OBJECT_ID('TempDB..#t','U') IS NOT NULL
DROP TABLE TempDB..#t
;
--===== Build the test table and populate it on the fly.
-- Everything except ParentID is populated here.
SELECT TOP (1000000)
ChildID = ISNULL(ROW_NUMBER() OVER (ORDER BY (SELECT 1)),0),
ParentID = CAST(0 AS INT),
Name = CAST(NEWID() AS VARCHAR(36))
INTO #t
FROM Master.sys.All_Columns ac1
CROSS JOIN Master.sys.All_Columns ac2
;
--===== Update the test table with ParentIDs. The ParentID is some random
-- value which is always less than the current ChildID to keep the
-- hierarchy "clean" and free from "loop backs".
UPDATE #t
SET ParentID = CASE
WHEN ChildID > 1
THEN ABS(CHECKSUM(NEWID())) % (ChildID-1) +1
ELSE NULL
END
;
--===== Drop the first 100 "children" so that we have top level "parents"
-- "without nulls" like the original problem.
DELETE FROM #t
WHERE ChildID <= 100
;
--===== Add the quintessential index for this test. If you already have a Clustered Index,
-- you should consider changing it.
ALTER TABLE #t
ADD PRIMARY KEY CLUSTERED (ChildID)
;
--Jeff Moden
Change is inevitable... Change for the better is not.
April 15, 2012 at 4:40 pm
Mark Fitzgerald-331224 (4/15/2012)
I have produced other code to examine tree structures for loops that do indicate the existence of closed loops.
Any chance of us seeing such code as it would apply to the original table in this thread?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 16, 2012 at 12:56 am
Thank you Jeff for your example.
The original table is very simple, only two columns (childid, parentid) and with PK on the child id column.
Since this table only has about 100k+ records this indicates that there must be some bad data breaking the loop right?
Mark, I tried your approach and indeed it seems to work! However, I would gladly try your solution for finding closed loops if you are willing to share it.
April 16, 2012 at 7:56 am
As requested by Jeff and FL here is a script that can find a closed loop in a hierarchy.
It would not be too efficient with a large data set. The issues is that using the top down approach may not get a loop.
The steps are:
1) Add every row as a start point (child).. add the child into the history, set loopfound and loopbackto both to 0
2) Find the next parent for each child. If the parent to be visited next is in the history then loopfound = 1 and loopbackto = parentid. If loopfound = 1 then ignore the row on subsequent recursions.
3) Add the parent to the history.
4) At the end show the initial child, the history to get into the loop and the loopback to for each row that returns a loop.
-- From initial OP post:
create table #t (childid int,name nvarchar(50),parentid int)
go
insert into #t values (1,'a',10)
insert into #t values (2,'b',1)
insert into #t values (3,'c',1)
insert into #t values (4,'d',3)
insert into #t values (5,'e',2)
insert into #t values (6,'f',11)
insert into #t values (7,'g',6)
insert into #t values (8,'h',5)
-- Row to cause loop (1->10->5->2->1->10 etc)
insert into #t values (10,'X',5);
go
with CTE1 as
(
select childid as StartingChildID, childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited,
0 as LoopFound, 0 as LoopBackTo
from #t
union all
select cte1.StartingChildID, #t.childid,#t.name,#t.parentid, level +1 ,
convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>'),
case when Visited LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%') then 1 else 0 end,
#t.ParentID
from #t inner join cte1
on #t.childid=cte1.parentid
where LoopFound = 0
)
select StartingChildID, Visited, LoopBackTo
from CTE1
where LoopFound = 1
order by StartingChildID
Notes that rows 1,2,3,4,5,8,10 all form part of the loop (or children of the loop) as 3,4,5,8 are children of members of the loop (1,2,5,10)
Fitz
April 16, 2012 at 10:22 am
Are we saying "the" top level parent is the one with the longest lineage? I wouldn't have defined it that way (to me, top level parents = records where noone else has you as a child). I also don't see anything requiring there to be only *one* top level parent, in which case there could be any number of top levels with same depth.
Just curious. If so that's a very odd way to describe it.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 17, 2012 at 4:31 am
Matt,
This was not so much a problem of not having an out node (null parent) but more of creating closed loops within the structure.
Example:
Node 1-> Node 2-> Node 3-> Node 1
...where this would go on and on using a recursive CTE.
This may occur in a hierarchy by mistake where one or more top parents include most of the structure, but a substructure looks as shown in the example. One of the nodes should point out to a node in the structure.
An example of a network that should have loops is a road network.
Node A -> Node B -> Node C -> Node D
and
Node D -> Node B
Node C -> Node A
So trying to find a route from A to D would include a loop that :
A->B->C->A.. and on and on
The could posted a few days ago would ignore the C-A choice as A has already been visited in the history of the route.
Fitz
Fitz
April 17, 2012 at 7:53 am
Thanks Mark, and I did get that part. It's just that depending how you define top-most parent, you might not need recursion at all . Having to put something this intensive together (with closed loop checking) will prompt me to confirm what we are really after. Is it:
- node(s) which have the longest lineage
- nodes which have the MOST descendant nodes
- nodes which aren't children of anyone else (possibly with a second requirement of having children themselves).
Also - unless the data is very specifically set up, the presumption that there will only be one top parent is a dangerous one.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 17, 2012 at 10:01 am
Matt, agreed there should be no presumption of a single ultimate parent in a hierarchy.
Answering you questions
Is it:
- node(s) which have the longest lineage -> in this case it does not matter whose has the longest lineage as long as a closed loop is not involved
- nodes which have the MOST descendant nodes -> not in this case, each node may have 1 or more children but only 1 parent, the amount of decendents does not matter
- nodes which aren't children of anyone else (possibly with a second requirement of having children themselves).-> not in this case either, the data may have a loop within a single node by a node having itself as a parent.
The request was to stop a long running CTE for parent and child originally and the OP had selected a set of data that could not loop and would succeed quickly. To get the recursion to bug out there must have been something else (over 100 levels intentionally or unintensionally).
I have seen this in lots of situations where the child node is edited to have a parent node regardless of whether the parent node exists (OP data node 10 did not exist originally), parent node = child node or a 2+ closed loop was created as discussed earlier.
Fitz
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply