Selecting from hierarchies like Managers and Employees

Question

Selecting from hierarchies like Managers and Employees

Create a recursive query. For more information, see Recursive Queries Using Common Table Expressions.
Substitute for a view when the general use of a view is not required; that is, you do not have to store the definition in metadata.
Enable grouping by a column that is derived from a scalar subselect, or a function that is either not deterministic or has external access.
Reference the resulting table multiple times in the same statement.

So that would leave the main use of CTE to be writing "simpler", more concise sql (which of course can come down to personal taste, as single-statement-scripts can be a nightmare too).

So I'm not sure about where people are getting the idea that it is faster. However! What I asked in the first comment in this thread was whether rCTEs are "any worse" than a manual looping structure, i.e. is there any hidden overhead? Essentially they are both looping over the same set of data and joining more rows to it through recursion, correct?

If the performance is equivalent, then it comes down to style, taking into account the following factors: readability, compatibility, and re-useability.

jelmer@minitab.com SSC Enthusiast Points: 149 More actions · Answer 1

jelmer@minitab.com

SSC Enthusiast

Points: 149

May 3, 2012 at 8:22 am

#1483059

thanks for the book reference

Steve Jones - SSC Editor SSC Guru Points: 736172 More actions · Answer 2

novak 32871 (5/3/2012)
I know but there is already SQL 2012 out. It is good exercise but 2000/2003 has it's age and most of clients already moved. But I understand your point.

Hardly. Some people have moved, but the last surveys I saw earlier this year had the majority (>50%) of databases on SQL 2000/2005

W. Kevin Hazzard SSC Enthusiast Points: 174 More actions · Answer 3

The recursive CTE of these procedures is much faster on large data sets. Easier to grok, too, IMO:

CREATE PROCEDURE [dbo].[usp_Employee_Select_Down_Param_EmpID]

@selectEmpID INT = NULL, -- NULL = all

@maxSupLevel INT = 10 -- NULL = no limit

AS

SET @selectEmpID = ISNULL(@selectEmpID, 0);

WITH [workTbl]

AS

(

SELECT

[e1].[EmpID]

, [e1].[EmpName]

, [e1].[SupID]

, 0 AS [SupLevel]

FROM [dbo].[Employee] AS [e1]

WHERE

[e1].[EmpID] = @selectEmpID

UNION ALL

SELECT

[e2].[EmpID]

, [e2].[EmpName]

, [e2].[SupID]

, [w].[SupLevel] + 1 AS [SupLevel]

FROM [workTbl] AS [w]

JOIN [dbo].[Employee] AS [e2] ON

[e2].[SupID] = [w].[EmpID]

)

SELECT

[EmpID]

, [EmpName]

, [SupID]

, [SupLevel]

FROM [workTbl]

WHERE [SupLevel] <= @maxSupLevel

ORDER BY

[SupLevel]

, [SupID]

, [EmpID];

GO

CREATE PROCEDURE [dbo].[usp_Employee_Select_Up_Param_EmpID]

@selectEmpID INT, -- required

@maxSupLevel INT = 10 -- NULL = no limit

AS

DECLARE @workTbl TABLE

(

[EmpID] INT NOT NULL

, [EmpName] VARCHAR(50) NOT NULL

, [SupID] INT NULL

, [SupLevel] INT NOT NULL

);

WITH [workTbl]

AS

(

SELECT

[e1].[EmpID]

, [e1].[EmpName]

, [e1].[SupID]

, 0 AS [SupLevel]

FROM [dbo].[Employee] AS [e1]

WHERE

[e1].[EmpID] = @selectEmpID

UNION ALL

SELECT

[e2].[EmpID]

, [e2].[EmpName]

, [e2].[SupID]

, [w].[SupLevel] - 1 AS [SupLevel]

FROM [workTbl] AS [w]

JOIN [dbo].[Employee] AS [e2] ON

[w].[SupID] = [e2].[EmpID]

)

INSERT @workTbl

SELECT

[EmpID]

, [EmpName]

, [SupID]

, [SupLevel]

FROM [workTbl]

WHERE [SupLevel] >= -@maxSupLevel;

DECLARE @minFoundSupLevel INT;

SELECT @minFoundSupLevel = MIN([SupLevel]) FROM @workTbl;

SELECT

[EmpID]

, [EmpName]

, [SupID]

, [SupLevel] - @minFoundSupLevel AS [SupLevel]

FROM @workTbl

ORDER BY

[SupLevel]

, [SupID]

, [EmpID];

GO

Koen Verbeeck SSC Guru Points: 259105 More actions · Answer 4

novak 32871 (5/3/2012)
I know but there is already SQL 2012 out. It is good exercise but 2000/2003 has it's age and most of clients already moved. But I understand your point.

Oooh, 2003, that was an excellent edition of SQL Server!

But very exclusive, hard to find.

😉

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Jeff Moden SSC Guru Points: 1004321 More actions · Answer 5

W. Kevin Hazzard (5/3/2012)
The recursive CTE of these procedures is much faster on large data sets. Easier to grok, too, IMO:
CREATE PROCEDURE [dbo].[usp_Employee_Select_Down_Param_EmpID]
@selectEmpID INT = NULL, -- NULL = all
@maxSupLevel INT = 10 -- NULL = no limit
AS
SET @selectEmpID = ISNULL(@selectEmpID, 0);
WITH [workTbl]
AS
(
SELECT
[e1].[EmpID]
, [e1].[EmpName]
, [e1].[SupID]
, 0 AS [SupLevel]
FROM [dbo].[Employee] AS [e1]
WHERE
[e1].[EmpID] = @selectEmpID
UNION ALL
SELECT
[e2].[EmpID]
, [e2].[EmpName]
, [e2].[SupID]
, [w].[SupLevel] + 1 AS [SupLevel]
FROM [workTbl] AS [w]
JOIN [dbo].[Employee] AS [e2] ON
[e2].[SupID] = [w].[EmpID]
)
SELECT
[EmpID]
, [EmpName]
, [SupID]
, [SupLevel]
FROM [workTbl]
WHERE [SupLevel] <= @maxSupLevel
ORDER BY
[SupLevel]
, [SupID]
, [EmpID];
GO
CREATE PROCEDURE [dbo].[usp_Employee_Select_Up_Param_EmpID]
@selectEmpID INT, -- required
@maxSupLevel INT = 10 -- NULL = no limit
AS
DECLARE @workTbl TABLE
(
[EmpID] INT NOT NULL
, [EmpName] VARCHAR(50) NOT NULL
, [SupID] INT NULL
, [SupLevel] INT NOT NULL
);
WITH [workTbl]
AS
(
SELECT
[e1].[EmpID]
, [e1].[EmpName]
, [e1].[SupID]
, 0 AS [SupLevel]
FROM [dbo].[Employee] AS [e1]
WHERE
[e1].[EmpID] = @selectEmpID
UNION ALL
SELECT
[e2].[EmpID]
, [e2].[EmpName]
, [e2].[SupID]
, [w].[SupLevel] - 1 AS [SupLevel]
FROM [workTbl] AS [w]
JOIN [dbo].[Employee] AS [e2] ON
[w].[SupID] = [e2].[EmpID]
)
INSERT @workTbl
SELECT
[EmpID]
, [EmpName]
, [SupID]
, [SupLevel]
FROM [workTbl]
WHERE [SupLevel] >= -@maxSupLevel;
DECLARE @minFoundSupLevel INT;
SELECT @minFoundSupLevel = MIN([SupLevel]) FROM @workTbl;
SELECT
[EmpID]
, [EmpName]
, [SupID]
, [SupLevel] - @minFoundSupLevel AS [SupLevel]
FROM @workTbl
ORDER BY
[SupLevel]
, [SupID]
, [EmpID];
GO

Another one with yet another unfounded, unproven claim of performance. This is how rumors get started, folks. Whether it's true or not, without proof, claims of performance should remain highly dubious until proof is actually offered.

At the personal level, I think it's much easier to "grok" things that don't have recursion. 😉

There are now 3 people on this thread who have unsubstantiated claims of performance. Any of you willing to prove it with code? 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004321 More actions · Answer 6

imarran (5/3/2012)
I also have a solution in the form of a view that you can join to that will display upline and down line structure depending on how it is joined to
Its quite technical but if you guys are keen I will submit it through an article

I don't know about anyone else, but I'd love to see it. Write that bad boy up! 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ssandler SSC Enthusiast Points: 101 More actions · Answer 7

Koen Verbeeck (5/3/2012)
novak 32871 (5/3/2012)
I know but there is already SQL 2012 out. It is good exercise but 2000/2003 has it's age and most of clients already moved. But I understand your point.
Oooh, 2003, that was an excellent edition of SQL Server!
But very exclusive, hard to find.
😉

I assume 2003 refers to Windows Server 2003 as a platform upon which many people still run SQL Server 2000.

W. Kevin Hazzard SSC Enthusiast Points: 174 More actions · Answer 8

Jeff, the while loop-based approach works well if you have a few thousand records or less. And for employee data, that design is likely to provide sufficient performance. If you find a large enough set of hierarchical data in the public domain, I'd put recursive CTEs up against the iterative approach any day. On the comprehension issue, I find recursive CTEs to be very expressive, much easier to understand than row-by-row designs that accomplish the same thing. Not everyone's agrees with me though but that's OK. To each his own.

Chuck Hoffman-291200 SSC Veteran Points: 234 More actions · Answer 9

Just for the record, I *LIKE* recursive code. I've written a lot of recursive code in various languages over the years, as well as reusable and reenterable code.

I do want to keep in mind that when working with, and for, people with less experience in those areas it can be a big jump for them.

I just haven't used the rCTE construct (yet). This blog has given me the confidence to go ahead and do it. For that I thank you all.

Chuck Hoffman
------------
I'm not sure why we're here, but I am sure that while
we're here we're supposed to help each other
------------

miika.langille SSC Rookie Points: 26 More actions · Answer 10

From microsoft's own documentation, they don't claim rCTEs to be faster than any other option.

W. Kevin Hazzard SSC Enthusiast Points: 174 More actions · Answer 11

Great points, Miika. In my experience with terabytes of hierarchical data in my system, recursive CTEs are faster, especially when going down the tree. Going up creates a single "thread" of records so the iterative approach tends to be about as fast as recursive CTEs in that case. Going down the tree creates many more records due to the one-to-many relationships that exist.

If Jeff Moden (or anyone) aims me at some public domain data that's hierarchical in nature, I'll load it, test both approaches for traversing up and down then I'll publish the results. I'd need a million rows or so, I think to show big differences. The other thing that will show dramatic results is having a lots of one-to-many relationships as you move down the tree. That's where the query optimizer in SQL injects most of the performance advantages that recursive CTEs provide, in my experience.

Jeff Moden SSC Guru Points: 1004321 More actions · Answer 12

W. Kevin Hazzard (5/4/2012)
Great points, Miika. In my experience with terabytes of hierarchical data in my system, recursive CTEs are faster, especially when going down the tree. Going up creates a single "thread" of records so the iterative approach tends to be about as fast as recursive CTEs in that case. Going down the tree creates many more records due to the one-to-many relationships that exist.
If Jeff Moden (or anyone) aims me at some public domain data that's hierarchical in nature, I'll load it, test both approaches for traversing up and down then I'll publish the results. I'd need a million rows or so, I think to show big differences. The other thing that will show dramatic results is having a lots of one-to-many relationships as you move down the tree. That's where the query optimizer in SQL injects most of the performance advantages that recursive CTEs provide, in my experience.

How big a hierarchy have you actually worked with? I ask because million node hierarchies don't exactly grow on tree's (no pun intended) on the internet and I'd like to know what you're basing your claims of performance on.

With that thought in mind, you'll have to use some made-up data. The following stored proc will allow you to do some very quick and flexible testing for small numbers of nodes in a hierarchy and takes just 22 seconds (or less on a good box) to build a "pure" DAG with a million nodes.

--===== Do this in a nice safe place that everyone has.

USE tempdb

;

-- DROP PROCEDURE dbo.BuildTestHierarchy

;

GO

CREATE PROCEDURE dbo.BuildTestHierarchy

/******************************************************************************

Create a randomized "clean" hierarchy. Each EmployeeID (except the first one,

of course) is assigned a random ManagerID number which is always less than the

current EmployeeID. This code runs nasty fast and is great for testing

hierarchical processing code.

Usage: (both examples build a million row Adjacency List Hierarchy)

EXEC dbo.BuildTestHierarchy 1000000

EXEC dbo.BuildTestHierarchy @pRowsToBuild = 1000000

Revision History:

Rev 00 - 28 Apr 2010 - Jeff Moden - Initial creation and test.

Rev 01 - 15 May 2010 - Jeff Moden - Abort if current DB isn't "tempdb" to

protect users that want to "play".

******************************************************************************/

--===== Declare the I/O parameters

@pRowsToBuild INT

AS

--===== Make sure that we're in a safe place to run this...

IF DB_NAME() <> N'tempdb'

BEGIN

RAISERROR('Current DB is NOT tempdb. Run aborted.',11,1);

RETURN;

END

;

--===== Conditionaly drop the test table so we can do reruns more easily

IF OBJECT_ID('TempDB.dbo.Employee','U') IS NOT NULL

DROP TABLE TempDB.dbo.Employee

;

--===== Build the test table and populate it on the fly.

-- Everything except ManagerID is populated here.

SELECT TOP (@pRowsToBuild)

ISNULL(ROW_NUMBER() OVER (ORDER BY (SELECT 1)),0) AS EmployeeID,

CAST(0 AS INT) AS ManagerID,

CAST(NEWID() AS VARCHAR(36)) AS EmployeeName,

(ABS(CHECKSUM(NEWID()))%12+1)*1000 AS Sales

INTO TempDB.dbo.Employee

FROM Master.sys.All_Columns ac1

CROSS JOIN Master.sys.All_Columns ac2

;

--===== Update the test table with ManagerID's. The ManagerID is some random

-- value which is always less than the current EmployeeID to keep the

-- hierarchy "clean" and free from "loop backs".

UPDATE TempDB.dbo.Employee

SET ManagerID = CASE

WHEN EmployeeID > 1

THEN ABS(CHECKSUM(NEWID())) % (EmployeeID-1) +1

ELSE NULL

END

;

--===== Add some indexes that most folks would like have on such a table

ALTER TABLE TempDB.dbo.Employee

ADD CONSTRAINT PK_Employee

PRIMARY KEY CLUSTERED (EmployeeID)

;

CREATE INDEX IX_Employee_ManagerID

ON TempDB.dbo.Employee (ManagerID)

;

I'd need a million rows or so, I think to show big differences

See? That's what I'm talking about. A lot of people "think" the rCTE will be faster but I'm not so sure that anyone making that claim on this thread has actually tested to find out for sure. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

tmarkham1 Grasshopper Points: 18 More actions · Answer 13

Proving yet again there is always more than one way to skin a cat. This may not be the best way, but it is clearly better than the author's first solution. rCTE is the most efficient solution, and having co-workers with less experience is not a good enough reason to skip implementing it. The less experienced should challenge themselves, break out of their comfort zones, and take the time to learn.

W. Kevin Hazzard SSC Enthusiast Points: 174 More actions · Answer 14

@tmarkham1 - I taught programming languages in college for 11 years and loved every minute of it. It was hard work but I found that I learned as much as the students did, not about programming languages per se, but about human nature and the power of exploration. Once in a while, though, a student would teach me something entirely new. And I loved that most of all.

The difference between Chuck's iterative approach and the recursive approach highlights potentially the largest split between developers in the world today. And I don't mean just software development using languages like Java and C#. This applies to database developers, too. We write a lot of code, don't we?

Chuck used the so-called imperative approach to his design: ordered, often iterative steps designed to mutate data in an external structure (i.e. his @workTbl variable). My response was functional and set-based, using recursive function calls and limiting state changes to the inside of the recursive "thread" (i.e. my [workTbl] common table expression). The key point I want to make here is that neither of these approaches is better than the other unless they offer some advantages (to Jeff Moden's point).

I think the recursive approach is better because it's in keeping with the set-based mindset of T-SQL. It's not just WHILE loops that I avoid. I shun cursors for the same reason. The whole idea of category theory (which is what T-SQL is derived from), is to get out of the mindset of thinking about instances of data and to start thinking of data more categorically. WHILE loops, like cursors, get your brain into "instance mode", making you think at a lower level of abstraction. Recursive CTEs on the other hand force your brain into "category mode", making you think at a higher level of abstraction.

When database developers cringe at recursion, most of the time it's because they aren't allowing themselves the luxury of higher abstraction. But this is our greatest advantage in the database realm. We have tools like JOIN, UNION, INTERSECT, etc. that allow us to escape from "instance brain". That escape hatch allows us to handle immense amounts of information with ease. Now, I just need to prove to Jeff that recursive CTEs are better by being faster, too, and we'll have some real fun. 🙂

Kudos to Chuck for his article. I loved it which is why I responded in the first place. I hope that he's emboldened by the experience and shares more great ideas in the future.

Dave Brooking Ten Centuries Points: 1373 More actions · Answer 15

You can add my vote to seeing imarran’s hierarchy solution using views.

Chuck has stated, in an earlier posting in this thread, that he is supporting SQL Server 2000 databases (amongst others), so CTEs are not available in that environment, irrespective of any performance improvements they may or may not provide. They just do not work.

Dave