Department management hierarchies. Bills of materials. Sales regions and offices. Data access levels. All of these and more are things that require storing hierarchical data in a database.
First, any mention of this subject needs to carry with it a mention of Joe Celko and his work on "nested sets" hierarchies. He has an excellent article on the subject at http://www.intelligententerprise.com/001020/celko.jhtml.
He’s completely correct that nested sets hierarchies are fast to query. The problem is, they are complex to add to or update. For relatively static hierarchies, like most bills of materials, use his method. Some organization charts or chains of command can be done that way as well.
The other general method of doing hierarchies in relational databases is the adjacency model, in which each row has the potential to have a parent in the same table. This might be built as:
create table dbo.Hierarchy ( ID int identity primary key, ParentID int null references dbo.Hierarchy(ID), Name varchar(100));
That’s a very simple example, but it illustrates the main point, with an ID and ParentID. This can be given a natural key, and the Parent column would need to be the same data type, but the structure would still be essentially the same.
This can easily be modified to allow a single level to have more than one parent as well as more than one child. (The structure above allows multiple children, but only one parent.) I found that particularly useful when I was working for a marketing company, and many of the people we marketed for had affiliations with multiple companies. In these cases, when a customer placed an order, different sets of people would have access to that order data, and the results of the marketing campaign, depending on how each order was placed, but the individual customer needed access to all of his campaigns in one place. That kind of multi-level access is traditionally handled by hierarchies of data access, but in this case, it needed to allow multiple parents.
I accomplished this with a structure like the below:
create table dbo.HierarchyNodes ( NodeID int identity primary key, Name varchar(100)); go create table dbo.HierarchyRelations ( NodeID int not nullreferences dbo.HierarchyNodes (NodeID), ParentID int not nullreferences dbo.HierarchyNodes (NodeID), constraintPK_HierarchyRelations primary key (NodeID, ParentID), constraint CK_NodeNotParent check (NodeID != ParentID));
I had to add moderately complex triggers to this to make sure no hierarchy was created where a level was its own parent or child, regardless of how many levels apart, but they performed quite well on such a simple structure.
Common Table Expressions in SQL 2005/8 make it much easier to resolve such a hierarchy, whether it’s a one-parent-per child or a many-parent-per-child model.
Books Online has a great example on how to build these and how they work. Here’s one for the simple hierarchy.
;withHierarchyCTE (NID, PID)as (select NodeID, ParentID from dbo.Hierarchy where NodeID = @NodeID_in union all select NodeID, ParentID from dbo.Hierarchy innerjoin HierarchyCTE on NID = ParentID) select * from HierarchyCTE
This assumes an input parameter of @NodeID_in, and will find all children of that level of the hierarchy. (I’m using the term "node" for each level of the hierarchy, since that’s how they are often referenced in tree-views and such in applications. I hope that doesn’t confuse anyone.)
The CTE can be reversed, and run up the hierarchy, simply by changing the relationship in the second part of it, to PID = NodeID, from NID = ParentID. Two such CTEs, one up, one down, can be hooked together into a single view of the whole structure above and below a level.
An interesting modification can be made to the CTE to add the level of relationship, by adding a Level column to it.
;withHierarchyCTE (NID, PID, Lvl) as (select NodeID, ParentID, 0 from dbo.Hierarchy where NodeID = @NodeID_in union all select NodeID, ParentID, Lvl + 1 from dbo.Hierarchy innerjoin HierarchyCTE on PID = NodeID) select * from HierarchyCTE
This will give a number that increments by one for each level away from the root ID (the one that matches the input parameter), which can be used in Order By clauses and such. For an upwards CTE, I’d change the increment to -1 from +1.
The disadvantage of these hierarchies, using the adjacency model (ID and ParentID), is that they take longer to query than nested sets hierarchies. The advantage is that they are very, very fast to add to or update. They are a little more complex to delete from in some ways, potentially simpler in others.
If you want to add a level in an adjacency hierarchy, you just insert it and give it the right parent ID, or no parent at all (for a top-level node). If you need to move a whole branch of the hierarchy, you just change the parent ID at the right point. It’s quite simple.
Say, for example, your company has a Proofreading Department, and in a re-organization it is being moved from the Marketing Division to the Quality Control Division. In a nested sets hierarchy, you have to rebuild the whole department, with new ranges for every row. In an adjacency hierarchy, you change the parent ID for the department row in the table, and you’re done.
To illustrate this a bit more, here’s an example of a nested sets hierarchy table:
create table dbo.HierarchySets ( RangeStart int not null, RangeEnd int not null, constraintPK_HierarchySets primary key (RangeStart, RangeEnd), constraintCK_RangeValid check (RangeStart < RangeEnd), Name varchar(100));
For the whole company, you might have a RangeStart of 0 and a RangeEnd of 1-million. For the Marketing Division (to use the example I mentioned above), you might have a RangeStart of 1000 and a RangeEnd of 2000. Proofreading could start at 1600 and end at 1650, and there might be ten or twenty rows in the table with ranges inside that. (The whole idea is that the parent has a wider range than the child, thus, because Marketing starts its range below that of Proofreading and ends its range above, it is the parent of Proofreading.) If Quality Control is 2001 to 2100, then Proofreading has to have it’s Start and End changed to some range inside the Quality Control range, and each level inside of Proofreading also has to be changed. If Quality Control doesn’t have enough free space in its range to fit all the ranges for Proofreading, then it also has to be moved.
That makes moves in such a model much more difficult. They can still be done, but it involves much more IO and many more transactions, and has to be wrapped in a much larger overall transaction so any error can cause the whole thing to roll back. That means longer locks on larger parts of the table. Given enough rows being moved, it may even result in a relatively long lock on the whole table, blocking other transactions and reads.
To make up for that, selecting from an adjacency involves a number of reads at least equal to the number of levels in the hierarchy being queried, and more often closer to the number of rows being queried, which can be quite IO intensive on complex, large hierarchies, but selecting from a nested sets table requires a single-pass, single read, usually of a small range straight from the primary key.
The speed difference and IO difference can be significant on the two. For example, I have a hierarchy in one of my databases with 2,700 nodes in it, going up to six levels deep. If someone at the top of that hierarchy signs in, it takes 11 seconds for my server to resolve the whole hierarchy and determine what data that person has access to (this is a security access hierarchy that controls much of what is displayed to customers on a web site). That’s using the adjacency model. Using a nested sets table, that same hierarchy takes less than 1 millisecond. (This isn’t the one with multiple parents. Different company.)
If this same database didn’t have a lot of updates to the hierarchies, I’d definitely use a straight-up nested sets hierarchy, and have much faster web pages. But it does have a lot of updates, sometimes several per minute, sometimes several at the same time. Each nested sets rebuild takes about 20-30 seconds to finish, and locks the table pretty aggressively while it’s running.
So, I came up with what I’m calling a hybrid hierarchy. The table looks something like this:
create table dbo.HierarchyHybrid ( ID int identity primary key, ParentID int null references dbo.HierarchyHybrid(ID), TopParentID int null references dbo.HierarchyHybrid(ID), RangeStart int null, RangeEnd int null, TempRangeStart int null, TempRangeEnd int null, constraintCK_RangeValid check (RangeStart < RangeEnd), Name varchar(100));
When a row is inserted, it has an ID and (if not a top level) a ParentID, but no Range data. When a row with Range data is updated (ParentID or TopParentID changed), the RangeStart and RangeEnd columns are set to null.
Additionally, I have the insert statement figure out what the top level of the new row would be. If the row has no parent, it puts in its own ID as the TopParentID, otherwise, it resolves what level above it has a null ParentID, regardless of how many levels that is removed from it, and puts that in there. With the right indexes on TopParentID, I’ve sped up the resolution of adjacency queries by up to 80%, since the recursive portion of the CTE can reference that column and greatly narrow down the number of rows it has to scan to resolve the query. (Your mileage may vary. Some queries are sped up a lot, some barely at all.)
Then, every few minutes, I have a job that looks to see if any rows exist where RangeStart is null. If it finds any, it goes through the hierarchy and set the values in TempRangeStart and TempRangeEnd. Once those are all set, it updates RangeStart and RangeEnd. That way, I don’t have partial sets and all of it gets done at once, which has resulted in better average performance.
When a hierarchy needs to be resolved, the code checks if any levels exist in that hierarchy that have null RangeStart (using the TopParentID), and uses the adjacency method of resolution if it finds any. If not, it uses the nested sets method. The code for that looks something like this:
create function [dbo].[udf_Hierarchy] (@NodeID_in int) returns table as return (with TopSet (SS, SE) as -- Get the range for the requested node (select RangeStart, RangeEnd from dbo.HierarchyHybrid where ID = @NodeID_in), Sets (RangeStart, RangeEnd, NodeID) as-- Nested Sets Query (select RangeStart, RangeEnd, ID from dbo.HierarchyHybrid innerjoin TopSet on RangeStart between ss and se and RangeEnd between ss and se), Adjacency (NodeID, ParentID) as -- Adjacency Query (select 0, ID, ParentID from dbo.HierarchyHybrid where ID = @NodeID_in andexists (select* from dbo.HierarchyHybrid h2 where h2.TopParentID = HierarchyHybrid.TopParentID and RangeStart is null) union all select h3.ID, h3.ParentID from dbo.HierarchyHybrid h3 innerjoin Adjacency on h3.ParentID = Adjacency.NodeID) select NodeID from Sets union select NodeID from Adjacency);
Using this method, I get a compromise. It takes a little longer to resolve a hierarchy than a pure nested sets method, up to about 8 milliseconds on some of the bigger ones, but updates and moves are fast, and can show on the web page immediately.
After any changes, a hierarchy takes a few seconds to resolve until the range data on it gets rebuilt, of course, but customers and users have been okay with that. It’s not usual for any given customer to change their hierarchy more than a couple of times per week, and changing one hierarchy doesn’t negatively impact any other hierarchy.
I don’t know if this will be of use to anyone else, but for the situation I have, it’s been very, very valuable, so I thought I’d share it. Without Joe Celko’s work on nested sets, this would never have been built, so do read the article I linked to at the beginning. I’m definitely grateful to him for his work.