Finding top parent using CTE

  • Hello,

    I´m trying to use CTE to find the very top parent within a table without any NULLs

    My test table seems to work but when applying the logic on a larger table I´m running into very poor performance due to the recursion.

    create table #t (childid int,name nvarchar(50),parentid int)

    insert into #t values (1,'a',10)

    insert into #t values (2,'b',1)

    insert into #t values (3,'c',1)

    insert into #t values (4,'d',3)

    insert into #t values (5,'e',2)

    insert into #t values (6,'f',11)

    insert into #t values (7,'g',6)

    insert into #t values (8,'h',5)

    GO

    with cte1

    as

    (

    select childid,name,parentid, 0 AS level from #t where childid =8

    union all

    select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid

    )

    select * from cte1 D1

    where D1.level = (select MAX(level) from cte1)

    --example:

    --childid 4 = top parent: 10

    --childid 8 = top parent: 10

    --childid 7 = top parent: 11

    drop table #t

    Is there a better approach to achieve the same goal?

    Many thanks

  • In the sample data given the recursive part ends when the selected parent (either 10 or 11) is not found. The chosen sample data does not have a loop either. This is a danger of not having out nodes, nodes where the parent is null.

    If you had the sample data so 4-> 6 -> 7 -> 4 where 4 has a parent 6 that has a parent 7 that has parent 4 then you would have a closed loop that would never stop, hence the recursion problem.

    You could think of this as a problem with the data, or with some network types links and points might make loops.

    One answer is to keep a varchar of the nodes (ids) already visited and not to return to the same point if the parent is already part of the history, cutting the loop.

    Demo as below:

    ----------------------------------- Extra row added to create a loop

    insert into #t values (10,'h',5);

    select * from #t;

    -- Original CTE fails (8 -> 5 -> 2 -> 1 -> 10 -> 5 -> 2 -> 1 -> 10 etc)

    with cte1

    as

    (

    select childid,name,parentid, 0 AS level from #t where childid =8

    union all

    select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid

    )

    select * from cte1 D1

    where D1.level = (select MAX(level) from cte1) ;

    -- New CTE with loop check

    select * from #t

    go

    -- New CTE succeeds (8 -> 5 -> 2 -> 1 -> 10 [10 stop point as would go back to 5])

    with cte1

    as

    (

    select childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited

    from #t

    where childid =8

    union all

    select #t.childid,#t.name,#t.parentid, level +1 ,

    convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>')

    from #t inner join cte1

    on #t.childid=cte1.parentid

    where Visited NOT LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%')

    )

    select * from cte1 D1

    where D1.level = (select MAX(level) from cte1) ;

    go

    Fitz

  • Hi Mark and thank you for your answer and explanation.

    Yes, I understand that having out-nodes would be preferable but unfortunately this is not possible in the table.

    The solution with keeping track of the nodes visited sounds like an interesting approach.

    If you are willing to share your example, this would be very much appreciated.

    EDIT: Mark was too fast for me, example provided 😀

    EDIT2: I´ll try this approach on the real table tomorrow and report back!

  • Mark Fitzgerald-331224 (4/15/2012)


    In the sample data given the recursive part ends when the selected parent (either 10 or 11) is not found. The chosen sample data does not have a loop either. This is a danger of not having out nodes, nodes where the parent is null.

    If you had the sample data so 4-> 6 -> 7 -> 4 where 4 has a parent 6 that has a parent 7 that has parent 4 then you would have a closed loop that would never stop, hence the recursion problem.

    You could think of this as a problem with the data, or with some network types links and points might make loops.

    One answer is to keep a varchar of the nodes (ids) already visited and not to return to the same point if the parent is already part of the history, cutting the loop.

    Demo as below:

    ----------------------------------- Extra row added to create a loop

    insert into #t values (10,'h',5);

    select * from #t;

    -- Original CTE fails (8 -> 5 -> 2 -> 1 -> 10 -> 5 -> 2 -> 1 -> 10 etc)

    with cte1

    as

    (

    select childid,name,parentid, 0 AS level from #t where childid =8

    union all

    select #t.childid,#t.name,#t.parentid, level +1 from #t inner join cte1 on #t.childid=cte1.parentid

    )

    select * from cte1 D1

    where D1.level = (select MAX(level) from cte1) ;

    -- New CTE with loop check

    select * from #t

    go

    -- New CTE succeeds (8 -> 5 -> 2 -> 1 -> 10 [10 stop point as would go back to 5])

    with cte1

    as

    (

    select childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited

    from #t

    where childid =8

    union all

    select #t.childid,#t.name,#t.parentid, level +1 ,

    convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>')

    from #t inner join cte1

    on #t.childid=cte1.parentid

    where Visited NOT LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%')

    )

    select * from cte1 D1

    where D1.level = (select MAX(level) from cte1) ;

    go

    Fitz

    Oh... be really, really careful here. Checking for loops like that may look fine on paper but they give absolutely no indication that a loop actually exists never mind that one needs to be repaired. Bad data could go on for years before anyone discovers they have bad data with this type of "fix". At least the original code will give a max recursion error in a loop.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • F.L (4/15/2012)


    Hello,

    I´m trying to use CTE to find the very top parent within a table without any NULLs

    My test table seems to work but when applying the logic on a larger table I´m running into very poor performance due to the recursion.

    ... {snip}...

    Is there a better approach to achieve the same goal?

    The problem with CTEs is that they work just exactly the same way as a VIEW works. Every time you call it, it get's re-executed. Since you're calling the CTE twice, it get's executed twice. Since your example has no indexes, the code you posted does 4 full table scans.

    Any chance of you posting the DDL (i.e. CREATE TABLE) statement for your real table along with any indexes/keys/constraints you may have? Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry Jeff, this was only a demonstration of the closed loop idea that may occur.;-)

    "Oh... be really, really careful here. Checking for loops like that may look fine on paper but they give absolutely no indication that a loop actually exists never mind that one needs to be repaired. Bad data could go on for years before anyone discovers they have bad data with this type of "fix". At least the original code will give a max recursion error in a loop."

    I have produced other code to examine tree structures for loops that do indicate the existence of closed loops.

    Fitz

  • Never mind. I don't need the DDL or the indexes or anything. I built a million row hierarchical test table to match what was originally posted. Performance of the original code was between 7 and 8 seconds. As you said, it was a performance problem especially to return 1 row out of a list of 9 for any given lookup ID.

    Even for the problem of calling the CTE more than once, adding the following index causs the performance to drop to 1 milli-second... which is right about where it should be.

    ALTER TABLE #t

    ADD PRIMARY KEY CLUSTERED (ChildID)

    ;

    And, yes... all other things being equal, a "child ID" in an adjacency list should be unique. If it's not, you may have "cycles" in your list and you need to handle thing by "position" instead of by "ID".

    You can also nearly double the performance even of that by getting rid of one of the calls to the CTE...

    WITH

    cteSearchUpline as

    (

    SELECT ChildID, Name, ParentID, 0 AS Level

    FROM #t

    WHERE ChildID = 1000000

    UNION ALL

    SELECT t.childid, t.name, t.parentid, Level = Level + 1

    FROM #t t

    INNER JOIN cteSearchUpline cte ON t.ChildID = cte.ParentID

    )

    SELECT TOP 1 *

    FROM cteSearchUpline d

    ORDER BY Level DESC

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Apologies. I almost forgot. Here's what I used to build the test data.

    --===== Conditionaly drop the test table so we can do reruns more easily

    IF OBJECT_ID('TempDB..#t','U') IS NOT NULL

    DROP TABLE TempDB..#t

    ;

    --===== Build the test table and populate it on the fly.

    -- Everything except ParentID is populated here.

    SELECT TOP (1000000)

    ChildID = ISNULL(ROW_NUMBER() OVER (ORDER BY (SELECT 1)),0),

    ParentID = CAST(0 AS INT),

    Name = CAST(NEWID() AS VARCHAR(36))

    INTO #t

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    ;

    --===== Update the test table with ParentIDs. The ParentID is some random

    -- value which is always less than the current ChildID to keep the

    -- hierarchy "clean" and free from "loop backs".

    UPDATE #t

    SET ParentID = CASE

    WHEN ChildID > 1

    THEN ABS(CHECKSUM(NEWID())) % (ChildID-1) +1

    ELSE NULL

    END

    ;

    --===== Drop the first 100 "children" so that we have top level "parents"

    -- "without nulls" like the original problem.

    DELETE FROM #t

    WHERE ChildID <= 100

    ;

    --===== Add the quintessential index for this test. If you already have a Clustered Index,

    -- you should consider changing it.

    ALTER TABLE #t

    ADD PRIMARY KEY CLUSTERED (ChildID)

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Mark Fitzgerald-331224 (4/15/2012)


    I have produced other code to examine tree structures for loops that do indicate the existence of closed loops.

    Any chance of us seeing such code as it would apply to the original table in this thread?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you Jeff for your example.

    The original table is very simple, only two columns (childid, parentid) and with PK on the child id column.

    Since this table only has about 100k+ records this indicates that there must be some bad data breaking the loop right?

    Mark, I tried your approach and indeed it seems to work! However, I would gladly try your solution for finding closed loops if you are willing to share it.

  • As requested by Jeff and FL here is a script that can find a closed loop in a hierarchy.

    It would not be too efficient with a large data set. The issues is that using the top down approach may not get a loop.

    The steps are:

    1) Add every row as a start point (child).. add the child into the history, set loopfound and loopbackto both to 0

    2) Find the next parent for each child. If the parent to be visited next is in the history then loopfound = 1 and loopbackto = parentid. If loopfound = 1 then ignore the row on subsequent recursions.

    3) Add the parent to the history.

    4) At the end show the initial child, the history to get into the loop and the loopback to for each row that returns a loop.

    -- From initial OP post:

    create table #t (childid int,name nvarchar(50),parentid int)

    go

    insert into #t values (1,'a',10)

    insert into #t values (2,'b',1)

    insert into #t values (3,'c',1)

    insert into #t values (4,'d',3)

    insert into #t values (5,'e',2)

    insert into #t values (6,'f',11)

    insert into #t values (7,'g',6)

    insert into #t values (8,'h',5)

    -- Row to cause loop (1->10->5->2->1->10 etc)

    insert into #t values (10,'X',5);

    go

    with CTE1 as

    (

    select childid as StartingChildID, childid,name,parentid, 0 AS level , convert(varchar(100), '<'+convert(varchar(3),childid)+'>') as Visited,

    0 as LoopFound, 0 as LoopBackTo

    from #t

    union all

    select cte1.StartingChildID, #t.childid,#t.name,#t.parentid, level +1 ,

    convert(varchar(100), Visited + '<'+convert(varchar(3),cte1.parentid)+'>'),

    case when Visited LIKE ('%<'+convert(varchar(3),#t.ParentID)+'>%') then 1 else 0 end,

    #t.ParentID

    from #t inner join cte1

    on #t.childid=cte1.parentid

    where LoopFound = 0

    )

    select StartingChildID, Visited, LoopBackTo

    from CTE1

    where LoopFound = 1

    order by StartingChildID

    Notes that rows 1,2,3,4,5,8,10 all form part of the loop (or children of the loop) as 3,4,5,8 are children of members of the loop (1,2,5,10)

    Fitz

  • Are we saying "the" top level parent is the one with the longest lineage? I wouldn't have defined it that way (to me, top level parents = records where noone else has you as a child). I also don't see anything requiring there to be only *one* top level parent, in which case there could be any number of top levels with same depth.

    Just curious. If so that's a very odd way to describe it.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt,

    This was not so much a problem of not having an out node (null parent) but more of creating closed loops within the structure.

    Example:

    Node 1-> Node 2-> Node 3-> Node 1

    ...where this would go on and on using a recursive CTE.

    This may occur in a hierarchy by mistake where one or more top parents include most of the structure, but a substructure looks as shown in the example. One of the nodes should point out to a node in the structure.

    An example of a network that should have loops is a road network.

    Node A -> Node B -> Node C -> Node D

    and

    Node D -> Node B

    Node C -> Node A

    So trying to find a route from A to D would include a loop that :

    A->B->C->A.. and on and on

    The could posted a few days ago would ignore the C-A choice as A has already been visited in the history of the route.

    Fitz

    Fitz

  • Thanks Mark, and I did get that part. It's just that depending how you define top-most parent, you might not need recursion at all . Having to put something this intensive together (with closed loop checking) will prompt me to confirm what we are really after. Is it:

    - node(s) which have the longest lineage

    - nodes which have the MOST descendant nodes

    - nodes which aren't children of anyone else (possibly with a second requirement of having children themselves).

    Also - unless the data is very specifically set up, the presumption that there will only be one top parent is a dangerous one.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt, agreed there should be no presumption of a single ultimate parent in a hierarchy.

    Answering you questions

    Is it:

    - node(s) which have the longest lineage -> in this case it does not matter whose has the longest lineage as long as a closed loop is not involved

    - nodes which have the MOST descendant nodes -> not in this case, each node may have 1 or more children but only 1 parent, the amount of decendents does not matter

    - nodes which aren't children of anyone else (possibly with a second requirement of having children themselves).-> not in this case either, the data may have a loop within a single node by a node having itself as a parent.

    The request was to stop a long running CTE for parent and child originally and the OP had selected a set of data that could not loop and would succeed quickly. To get the recursion to bug out there must have been something else (over 100 levels intentionally or unintensionally).

    I have seen this in lots of situations where the child node is edited to have a parent node regardless of whether the parent node exists (OP data node 10 did not exist originally), parent node = child node or a 2+ closed loop was created as discussed earlier.

    Fitz

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply