Finding top parent using CTE

Question

Finding top parent using CTE

F.L

Right there with Babe

Points: 734
More actions
April 15, 2012 at 12:10 am

#254094

Hello,
I´m trying to use CTE to find the very top parent within a table without any NULLs
My test table seems to work but when applying the logic on a larger table I´m running into very poor performance due to the recursion.
create table #t (childid int,name nvarchar(50),parentid int)
insert into #t values (1,'a',10)
insert into #t values (2,'b',1)
insert into #t values (3,'c',1)
insert into #t values (4,'d',3)
insert into #t values (5,'e',2)
insert into #t values (6,'f',11)
insert into #t values (7,'g',6)
insert into #t values (8,'h',5)
GO
with cte1
as
(
select childid,name,parentid, 0 AS level from #t where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1)
--example:
--childid 4 = top parent: 10
--childid 8 = top parent: 10
--childid 7 = top parent: 11
drop table #t
Is there a better approach to achieve the same goal?
Many thanks

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Mark Fitzgerald-331224 SSCrazy Eights Points: 8489 More actions · Answer 1

In the sample data given the recursive part ends when the selected parent (either 10 or 11) is not found. The chosen sample data does not have a loop either. This is a danger of not having out nodes, nodes where the parent is null.

If you had the sample data so 4-> 6 -> 7 -> 4 where 4 has a parent 6 that has a parent 7 that has parent 4 then you would have a closed loop that would never stop, hence the recursion problem.

You could think of this as a problem with the data, or with some network types links and points might make loops.

One answer is to keep a varchar of the nodes (ids) already visited and not to return to the same point if the parent is already part of the history, cutting the loop.

Demo as below:

----------------------------------- Extra row added to create a loop

insert into #t values (10,'h',5);

select * from #t;

-- Original CTE fails (8 -> 5 -> 2 -> 1 -> 10 -> 5 -> 2 -> 1 -> 10 etc)

with cte1

as

(

select childid,name,parentid, 0 AS level from #t where childid =8

union all

select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid

)

select * from cte1 D1

where D1.level = (select MAX(level) from cte1) ;

-- New CTE with loop check

select * from #t

go

-- New CTE succeeds (8 -> 5 -> 2 -> 1 -> 10 [10 stop point as would go back to 5])

with cte1

as

(

select childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited

from #t

where childid =8

union all

select #t.childid,#t.name,#t.parentid, level +1 ,

convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>')

from #t inner join cte1

on #t.childid=cte1.parentid

where Visited NOT LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%')

)

select * from cte1 D1

where D1.level = (select MAX(level) from cte1) ;

go

Fitz

F.L Right there with Babe Points: 734 More actions · Answer 2

Hi Mark and thank you for your answer and explanation.

Yes, I understand that having out-nodes would be preferable but unfortunately this is not possible in the table.

The solution with keeping track of the nodes visited sounds like an interesting approach.

If you are willing to share your example, this would be very much appreciated.

EDIT: Mark was too fast for me, example provided 😀

EDIT2: I´ll try this approach on the real table tomorrow and report back!

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 3

Mark Fitzgerald-331224 (4/15/2012)
In the sample data given the recursive part ends when the selected parent (either 10 or 11) is not found. The chosen sample data does not have a loop either. This is a danger of not having out nodes, nodes where the parent is null.
If you had the sample data so 4-> 6 -> 7 -> 4 where 4 has a parent 6 that has a parent 7 that has parent 4 then you would have a closed loop that would never stop, hence the recursion problem.
You could think of this as a problem with the data, or with some network types links and points might make loops.
One answer is to keep a varchar of the nodes (ids) already visited and not to return to the same point if the parent is already part of the history, cutting the loop.
Demo as below:
----------------------------------- Extra row added to create a loop
insert into #t values (10,'h',5);
select * from #t;
-- Original CTE fails (8 -> 5 -> 2 -> 1 -> 10 -> 5 -> 2 -> 1 -> 10 etc)
with cte1
as
(
select childid,name,parentid, 0 AS level from #t where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1) ;
-- New CTE with loop check
select * from #t
go
-- New CTE succeeds (8 -> 5 -> 2 -> 1 -> 10 [10 stop point as would go back to 5])
with cte1
as
(
select childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited
from #t
where childid =8
union all
select #t.childid,#t.name,#t.parentid, level +1 ,
convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>')
from #t inner join cte1
on #t.childid=cte1.parentid
where Visited NOT LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%')
)
select * from cte1 D1
where D1.level = (select MAX(level) from cte1) ;
go
Fitz

Oh... be really, really careful here. Checking for loops like that may look fine on paper but they give absolutely no indication that a loop actually exists never mind that one needs to be repaired. Bad data could go on for years before anyone discovers they have bad data with this type of "fix". At least the original code will give a max recursion error in a loop.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 4

F.L (4/15/2012)
Hello,
I´m trying to use CTE to find the very top parent within a table without any NULLs
My test table seems to work but when applying the logic on a larger table I´m running into very poor performance due to the recursion.
... {snip}...
Is there a better approach to achieve the same goal?

The problem with CTEs is that they work just exactly the same way as a VIEW works. Every time you call it, it get's re-executed. Since you're calling the CTE twice, it get's executed twice. Since your example has no indexes, the code you posted does 4 full table scans.

Any chance of you posting the DDL (i.e. CREATE TABLE) statement for your real table along with any indexes/keys/constraints you may have? Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mark Fitzgerald-331224 SSCrazy Eights Points: 8489 More actions · Answer 5

Sorry Jeff, this was only a demonstration of the closed loop idea that may occur.;-)

"Oh... be really, really careful here. Checking for loops like that may look fine on paper but they give absolutely no indication that a loop actually exists never mind that one needs to be repaired. Bad data could go on for years before anyone discovers they have bad data with this type of "fix". At least the original code will give a max recursion error in a loop."

I have produced other code to examine tree structures for loops that do indicate the existence of closed loops.

Fitz

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 6

Never mind. I don't need the DDL or the indexes or anything. I built a million row hierarchical test table to match what was originally posted. Performance of the original code was between 7 and 8 seconds. As you said, it was a performance problem especially to return 1 row out of a list of 9 for any given lookup ID.

Even for the problem of calling the CTE more than once, adding the following index causs the performance to drop to 1 milli-second... which is right about where it should be.

ALTER TABLE #t

ADD PRIMARY KEY CLUSTERED (ChildID)

;

And, yes... all other things being equal, a "child ID" in an adjacency list should be unique. If it's not, you may have "cycles" in your list and you need to handle thing by "position" instead of by "ID".

You can also nearly double the performance even of that by getting rid of one of the calls to the CTE...

WITH

cteSearchUpline as

(

SELECT ChildID, Name, ParentID, 0 AS Level

FROM #t

WHERE ChildID = 1000000

UNION ALL

SELECT t.childid, t.name, t.parentid, Level = Level + 1

FROM #t t

INNER JOIN cteSearchUpline cte ON t.ChildID = cte.ParentID

)

SELECT TOP 1 *

FROM cteSearchUpline d

ORDER BY Level DESC

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 7

Apologies. I almost forgot. Here's what I used to build the test data.

--===== Conditionaly drop the test table so we can do reruns more easily

IF OBJECT_ID('TempDB..#t','U') IS NOT NULL

DROP TABLE TempDB..#t

;

--===== Build the test table and populate it on the fly.

-- Everything except ParentID is populated here.

SELECT TOP (1000000)

ChildID = ISNULL(ROW_NUMBER() OVER (ORDER BY (SELECT 1)),0),

ParentID = CAST(0 AS INT),

Name = CAST(NEWID() AS VARCHAR(36))

INTO #t

FROM Master.sys.All_Columns ac1

CROSS JOIN Master.sys.All_Columns ac2

;

--===== Update the test table with ParentIDs. The ParentID is some random

-- value which is always less than the current ChildID to keep the

-- hierarchy "clean" and free from "loop backs".

UPDATE #t

SET ParentID = CASE

WHEN ChildID > 1

THEN ABS(CHECKSUM(NEWID())) % (ChildID-1) +1

ELSE NULL

END

;

--===== Drop the first 100 "children" so that we have top level "parents"

-- "without nulls" like the original problem.

DELETE FROM #t

WHERE ChildID <= 100

;

--===== Add the quintessential index for this test. If you already have a Clustered Index,

-- you should consider changing it.

ALTER TABLE #t

ADD PRIMARY KEY CLUSTERED (ChildID)

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 8

Mark Fitzgerald-331224 (4/15/2012)
I have produced other code to examine tree structures for loops that do indicate the existence of closed loops.

Any chance of us seeing such code as it would apply to the original table in this thread?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

F.L Right there with Babe Points: 734 More actions · Answer 9

Thank you Jeff for your example.

The original table is very simple, only two columns (childid, parentid) and with PK on the child id column.

Since this table only has about 100k+ records this indicates that there must be some bad data breaking the loop right?

Mark, I tried your approach and indeed it seems to work! However, I would gladly try your solution for finding closed loops if you are willing to share it.

Mark Fitzgerald-331224 SSCrazy Eights Points: 8489 More actions · Answer 10

As requested by Jeff and FL here is a script that can find a closed loop in a hierarchy.

It would not be too efficient with a large data set. The issues is that using the top down approach may not get a loop.

The steps are:

1) Add every row as a start point (child).. add the child into the history, set loopfound and loopbackto both to 0

2) Find the next parent for each child. If the parent to be visited next is in the history then loopfound = 1 and loopbackto = parentid. If loopfound = 1 then ignore the row on subsequent recursions.

3) Add the parent to the history.

4) At the end show the initial child, the history to get into the loop and the loopback to for each row that returns a loop.

-- From initial OP post:

create table #t (childid int,name nvarchar(50),parentid int)

go

insert into #t values (1,'a',10)

insert into #t values (2,'b',1)

insert into #t values (3,'c',1)

insert into #t values (4,'d',3)

insert into #t values (5,'e',2)

insert into #t values (6,'f',11)

insert into #t values (7,'g',6)

insert into #t values (8,'h',5)

-- Row to cause loop (1->10->5->2->1->10 etc)

insert into #t values (10,'X',5);

go

with CTE1 as

(

select childid as StartingChildID, childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited,

0 as LoopFound, 0 as LoopBackTo

from #t

union all

select cte1.StartingChildID, #t.childid,#t.name,#t.parentid, level +1 ,

convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>'),

case when Visited LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%') then 1 else 0 end,

#t.ParentID

from #t inner join cte1

on #t.childid=cte1.parentid

where LoopFound = 0

)

select StartingChildID, Visited, LoopBackTo

from CTE1

where LoopFound = 1

order by StartingChildID

Notes that rows 1,2,3,4,5,8,10 all form part of the loop (or children of the loop) as 3,4,5,8 are children of members of the loop (1,2,5,10)

Fitz

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 11

Are we saying "the" top level parent is the one with the longest lineage? I wouldn't have defined it that way (to me, top level parents = records where noone else has you as a child). I also don't see anything requiring there to be only *one* top level parent, in which case there could be any number of top levels with same depth.

Just curious. If so that's a very odd way to describe it.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Mark Fitzgerald-331224 SSCrazy Eights Points: 8489 More actions · Answer 12

Matt,

This was not so much a problem of not having an out node (null parent) but more of creating closed loops within the structure.

Example:

Node 1-> Node 2-> Node 3-> Node 1

...where this would go on and on using a recursive CTE.

This may occur in a hierarchy by mistake where one or more top parents include most of the structure, but a substructure looks as shown in the example. One of the nodes should point out to a node in the structure.

An example of a network that should have loops is a road network.

Node A -> Node B -> Node C -> Node D

and

Node D -> Node B

Node C -> Node A

So trying to find a route from A to D would include a loop that :

A->B->C->A.. and on and on

The could posted a few days ago would ignore the C-A choice as A has already been visited in the history of the route.

Fitz

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

Thanks Mark, and I did get that part. It's just that depending how you define top-most parent, you might not need recursion at all . Having to put something this intensive together (with closed loop checking) will prompt me to confirm what we are really after. Is it:

- node(s) which have the longest lineage

- nodes which have the MOST descendant nodes

- nodes which aren't children of anyone else (possibly with a second requirement of having children themselves).

Also - unless the data is very specifically set up, the presumption that there will only be one top parent is a dangerous one.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Mark Fitzgerald-331224 SSCrazy Eights Points: 8489 More actions · Answer 14

Matt, agreed there should be no presumption of a single ultimate parent in a hierarchy.

Answering you questions

Is it:

- node(s) which have the longest lineage -> in this case it does not matter whose has the longest lineage as long as a closed loop is not involved

- nodes which have the MOST descendant nodes -> not in this case, each node may have 1 or more children but only 1 parent, the amount of decendents does not matter

- nodes which aren't children of anyone else (possibly with a second requirement of having children themselves).-> not in this case either, the data may have a loop within a single node by a node having itself as a parent.

The request was to stop a long running CTE for parent and child originally and the OP had selected a set of data that could not loop and would succeed quickly. To get the recursion to bug out there must have been something else (over 100 levels intentionally or unintensionally).

I have seen this in lots of situations where the child node is edited to have a parent node regardless of whether the parent node exists (OP data node 10 did not exist originally), parent node = child node or a 2+ closed loop was created as discussed earlier.

Fitz