Add a character, or expand to non-printable?

  • I am currently using a varchar field to capture a hierarchy tree. I'm using a 2 character delimited value store the data. For example: '/01/A5/H2' has a "parent" of '/01/A5' and a "grandparent" of '/01', both of which are also values in the hierarchy with their own attributes. The values range from 00 to ZZ, which gives me 36*36=1296 values for each "branch" of the tree. Quite a bit, I thought... of course, looking at the data we will be storing, I need just a little more.

    In your opinion (yes you!), would I be better off keeping a two character code (so it takes up less space, better indexes, etc.) and going to 256*256 = 65536 values, or going to a three character code with 36*36*36=46656 values? Keeping the values from 000-ZZZ makes nice, readable, easy to cut-n-paste and understand values at the expense of an extra digit at each branch. My 'longest' branch is 14 deep (so it's 43 characters long), but they average 6 deep (19 characters).

    This is on 2K5, which is why we haven't considered moving it into the HierarchyID datatype, but I think we've configured it so that it will be easy to migrate if/when the time comes.

    I'm thinking that for the extra 6 bytes (on average), it's worth keeping it readable. What do you think?

    Thanks,

    Chad

  • Alright, if there are no opinions on the issue, how about suggestions on the question? Too long? Not enough detail? Boring title? Too complex? Too dumb? Pick any of the above and cut-n-paste into a reply. 🙂

    Chad

  • I would probably agree with you: keeping it readable is worthwhile.

    You could also consider storing it in VARBINARY() and use a fnBinToB36() udf to display it and a corrsponding fnB36ToBin() udf to encode it an do index lookups. Maybe throw a calculated column on for display.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Chad Crawford (12/17/2008)


    Alright, if there are no opinions on the issue, how about suggestions on the question? Too long? Not enough detail? Boring title? Too complex? Too dumb? Pick any of the above and cut-n-paste into a reply. 🙂

    Chad

    Heh... no... just not enough time in the day to view all the posts in a given day and I guess you just got the short end of the stick. 😉 That and you've pretty much answered your own question.

    My question would be that you've already (understandably) underestimated the number of positions you needed once... you sure that 46K is going to be enough for future scalability? If you used integers converted to varchar, you'd only need 153 characters for what you currently have and would be sure not to run out.

    Also, have you given any thought to using a hybrid of the Nested Set Model instead of the Adjacency Model you're currently using?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/17/2008)


    Also, have you given any thought to using a hybrid of the Nested Set Model instead of the Adjacency Model you're currently using?

    Actually, Jeff, I believe that this is the rarely seen "Path" or "Natural" model, which has long been my favorite.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Thanks for the reply!

    If you don't mind, help me understand the advantage of having it stored in a binary format, but converted for readability. I'm going to use the same number of bytes unless udf aggregates the string in 7 character chunks, right (36 goes into 256 7 times)? Ahhh... wait a second... I think I see where you're going. I could use two byte binary value pairs and then have a UDF that converts the bytes into a readable character string, maybe going from 2 binary bytes to something like 4 readable characters, breaking it up by bits - for the first binary byte, the 1st 5 bits is the first character, 2nd 5 bits is the second, etc... hmmm, that is very interesting. The only downside would be that the next person to come along would have a hard time maintaining it, but it does sound very cool. Wow, I'll have to think about that one, it's something I hadn't considered.

    Thanks again Barry for taking the time to scribble a note.

    Chad

  • My question would be that you've already (understandably) underestimated the number of positions you needed once... you sure that 46K is going to be enough for future scalability? If you used integers converted to varchar, you'd only need 153 characters for what you currently have and would be sure not to run out.

    Very fair question. The tree I have models business relationships, and I didn't think people would physically have enough time to maintain that many relationships. What I didn't count on were people passing on (or abandoning) their relationships, and tracking inactive/dead relationships. Those two together increased the scope of what I was trying to record. I haven't outgrown the tree, but it is close enough to be unprofessional not to account for the possibility. I went with base36 since I get more data per byte than I would with just a base10 number. I wanted to keep it as short as I could so my queries and indexes would be fast - I'm replacing an arbitrary depth self-joining table, and selling the change based on increased flexibility and speed. I did consider keeping the current structure and using CTEs, but I think this is still just a little faster.

    Also, have you given any thought to using a hybrid of the Nested Set Model instead of the Adjacency Model you're currently using?

    Well, there some research I'm going to have to do. Back when I originally designed it I went through several options and these sound familiar, but I don't remember now specifically why I rejected them.

    Wow. Both Jeff and Barry on my question. I feel special. I'm going to print this page and put it in my Autograph book. Thanks guys!

    Chad

  • Chad Crawford (12/17/2008)


    If you don't mind, help me understand the advantage of having it stored in a binary format, but converted for readability.

    I think that you've got it later on here, but just to be clear: Binary is a compact as you can get and the only real disadvantage of the Path model IMHO, is the size of the key. So anything that makes it smaller (but still directly usable) is generally good.

    Varbinary() is a still a string (just binary instead of characters) so it can still be compared with other binary strings (and substrings). It only really needs to be converted to and from human-readable form at the input & output boundaries.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (12/17/2008)


    Jeff Moden (12/17/2008)


    Also, have you given any thought to using a hybrid of the Nested Set Model instead of the Adjacency Model you're currently using?

    Actually, Jeff, I believe that this is the rarely seen "Path" or "Natural" model, which has long been my favorite.

    Actually, you're right... kind of... each row will still have a parent and child "id" which keeps it in the realm of the Adjacency Model. The hybrid Nested Set Model I'm talking about would do the same thing just to make rebuilding the Nested Set Model easier like this "Path" Model does.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The Path model is the original unadorned natural model of a Tree or a tree-based hierarchy. To use the analogy of our C: drive's filesystem The path mode keys consist of one very long string that contains the complete folder path and file name or each file and are what you would get if you executed the DOS command: "DIR/B/S | SORT". Something like this:

    C:\byoung\DataIn\Devel\FM7\AssemblyInfo.vb

    C:\byoung\DataIn\Devel\FM7\bin

    C:\byoung\DataIn\Devel\FM7\bin\FM7.exe

    C:\byoung\DataIn\Devel\FM7\bin\FM7.pdb

    C:\byoung\DataIn\Devel\FM7\Board.cls

    C:\byoung\DataIn\Devel\FM7\Card.vb

    C:\byoung\DataIn\Devel\FM7\Deck.vb

    C:\byoung\DataIn\Devel\FM7\FM7.resx

    C:\byoung\DataIn\Devel\FM7\FM7.sln

    C:\byoung\DataIn\Devel\FM7\FM7.vb

    C:\byoung\DataIn\Devel\FM7\FM7.vbproj

    C:\byoung\DataIn\Devel\FM7\FM7.vbproj.user

    C:\byoung\DataIn\Devel\FM7\frmBoard.resx

    C:\byoung\DataIn\Devel\FM7\frmBoard.vb

    C:\byoung\DataIn\Devel\FM7\Game.vb

    C:\byoung\DataIn\Devel\FM7\GameSet.vb

    C:\byoung\DataIn\Devel\FM7\GameSetMgr.vb

    C:\byoung\DataIn\Devel\FM7\GenManager.vb

    C:\byoung\DataIn\Devel\FM7\GenNetPlayer.vb

    C:\byoung\DataIn\Devel\FM7\H7

    C:\byoung\DataIn\Devel\FM7\H7\Form1.frm

    C:\byoung\DataIn\Devel\FM7\H7\H7.vbp

    C:\byoung\DataIn\Devel\FM7\H7\H7.vbw

    C:\byoung\DataIn\Devel\FM7\H7\H7_Stats.csv

    C:\byoung\DataIn\Devel\FM7\H7\H7_Stats.xls

    C:\byoung\DataIn\Devel\FM7\H7\H7_Stats_c.xls

    C:\byoung\DataIn\Devel\FM7\H7\HMain.bas

    C:\byoung\DataIn\Devel\FM7\H7\MSSCCPRJ.SCC

    C:\byoung\DataIn\Devel\FM7\H7\S7_StatQ.xls

    C:\byoung\DataIn\Devel\FM7\H7\S7_Stats.csv

    C:\byoung\DataIn\Devel\FM7\H7\S7_Stats.xls

    C:\byoung\DataIn\Devel\FM7\H7\S7Stat_p.xls

    C:\byoung\DataIn\Devel\FM7\H7\S7Stat1M.xls

    C:\byoung\DataIn\Devel\FM7\obj

    C:\byoung\DataIn\Devel\FM7\obj\Debug

    C:\byoung\DataIn\Devel\FM7\obj\Debug\FM7.exe

    C:\byoung\DataIn\Devel\FM7\obj\Debug\FM7.FM7.resources

    C:\byoung\DataIn\Devel\FM7\obj\Debug\FM7.frmBoard.resources

    C:\byoung\DataIn\Devel\FM7\obj\Debug\FM7.pdb

    C:\byoung\DataIn\Devel\FM7\Player.vb

    C:\byoung\DataIn\Devel\FM7\PPSTimeUtil.vb

    C:\byoung\DataIn\Devel\FM7\S7

    C:\byoung\DataIn\Devel\FM7\S7.zip

    C:\byoung\DataIn\Devel\FM7\S7\Form1.frm

    C:\byoung\DataIn\Devel\FM7\S7\H7_Stats.csv

    C:\byoung\DataIn\Devel\FM7\S7\MSSCCPRJ.SCC

    C:\byoung\DataIn\Devel\FM7\S7\S7_StatQ.xls

    C:\byoung\DataIn\Devel\FM7\S7\S7_Stats.csv

    C:\byoung\DataIn\Devel\FM7\S7\S7_Stats.xls

    C:\byoung\DataIn\Devel\FM7\S7\S7Stat_p.xls

    C:\byoung\DataIn\Devel\FM7\S7\S7Stat1M.xls

    C:\byoung\DataIn\Devel\FM7\S7\STUD7.exe

    C:\byoung\DataIn\Devel\FM7\S7\Stud7.vbp

    C:\byoung\DataIn\Devel\FM7\S7\Stud7.vbw

    C:\byoung\DataIn\Devel\FM7\S7\StudMain.bas

    On the other hand, the most commonly implemented model in Relational databases is the path model. The path model requires another (fixed-size) unique key to be used as the "Linking Field", like a FIleID. Then the Adjacency model keys consist of two fields, the File name (without the path) and a "ParentID" field that contains the FileID of the folder that contains the file. I do not know a DOS command to list file names with their File IDs, but the TREE/F gives a good idea of how it organized: here the indenting and lines represent FileIDs that point back to their containing directories:

    C:.

    ¦ AssemblyInfo.vb

    ¦ Board.cls

    ¦ Card.vb

    ¦ Deck.vb

    ¦ FM7.resx

    ¦ FM7.sln

    ¦ FM7.vb

    ¦ FM7.vbproj

    ¦ FM7.vbproj.user

    ¦ frmBoard.resx

    ¦ frmBoard.vb

    ¦ Game.vb

    ¦ GameSet.vb

    ¦ GameSetMgr.vb

    ¦ GenManager.vb

    ¦ GenNetPlayer.vb

    ¦ Player.vb

    ¦ PPSTimeUtil.vb

    ¦ S7.zip

    ¦

    +---bin

    ¦ FM7.exe

    ¦ FM7.pdb

    ¦

    +---H7

    ¦ Form1.frm

    ¦ H7.vbp

    ¦ H7.vbw

    ¦ H7_Stats.csv

    ¦ H7_Stats.xls

    ¦ H7_Stats_c.xls

    ¦ HMain.bas

    ¦ MSSCCPRJ.SCC

    ¦ S7Stat1M.xls

    ¦ S7Stat_p.xls

    ¦ S7_StatQ.xls

    ¦ S7_Stats.csv

    ¦ S7_Stats.xls

    ¦

    +---obj

    ¦ +---Debug

    ¦ FM7.exe

    ¦ FM7.FM7.resources

    ¦ FM7.frmBoard.resources

    ¦ FM7.pdb

    ¦

    +---S7

    Form1.frm

    H7_Stats.csv

    MSSCCPRJ.SCC

    S7Stat1M.xls

    S7Stat_p.xls

    S7_StatQ.xls

    S7_Stats.csv

    S7_Stats.xls

    STUD7.exe

    Stud7.vbp

    Stud7.vbw

    StudMain.bas

    Key distinctions:

    1) The path model always contains the full name, including its path. The Adjacency model on contains keeps the "local" or "leaf" name, plus a pointer to the parent, next up the hierarchy. The full name is constructed by incrementally traversing up the tree, adding each local name onto the left side of the name until the top is reached.

    2) The Adjacency model requires that extra key: the fixed size unique identifier. If the data does not naturally have one (like FileID or EmployeeID) then one can be easily added with an Identity type field in SQL.

    3) The Path model's key is a long (or potentially long) fixed or variable sized string. The Adjacency model's key is the fixed size ParentID plus the much smaller string containing the local name only.

    The Nested Set model is much more complicated and I will not try to explain it right now.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • This article by Frédéric BROUARD gives the best explanation that I know of for the Nested Sets model: http://www.sqlservercentral.com/articles/SQL+Server+2005+-+TSQL/recursivequeriesinsql1999andsqlserver2005/1846/

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Alright, I dug up my old research and refreshed all the terms in memory. Yes, this is kind of an adaptation of the adjacency model with the advantage of only needing one self-join for any depth of query, and indexing the hierarchy makes the queries lightning fast. I'm replacing a pure adjacency implementation that was too cumbersome for live queries.

    I did consider the nested sets model and it would work fine, but I didn't like having to update a large portion of the tree every time I added or removed a node. In particular, it didn't sit well that I would have to update records in non-impacted branches when a change was made somewhere else. Sure, the query isn't too complex and it would be safe, but I didn't feel comfortable having to update records that were not directly related to the change. It is also easier to see your depth in the tree with the Path/Natural model (to borrow Jeff's term - is this its formal name?) and drill to a particular depth, 1st level children for example. You could overcome this by adding a depth integer to the nested sets model, and I actually did that with my implementation of the Path model too to eke out just a little more performance since stopping at a particular depth will be part of almost every query. I also have to explain how the model works to DB and Code developers that will be working with the data and the Path is much easier to understand and there is less chance that they will mess it up. I thought the Path model was faster, but I reran the experiment today and they came out almost exactly the same when indexed (within .25%).

    Jeff - I really appreciate your time and if you think the nested set is superior, I would appreciate some additional thoughts to help steer me in that direction. So is the hybrid model you were suggesting a combination of both the adjacency and nested set models? The adjacency model used to double check and completely overhaul the nested set model if it ever gets out of sync?

    Barry - I'm glad to know this is your favorite, it gives me a warm feeling inside (and it's snowing) to think I might have got it right. You've won me over to use the binary datatype - I figure I can use a two byte value and let it convert naturally to 4 characters when it needs to be displayed. This will give me 65K values at each node for the same space I'm getting 1200 from now. I just have to decide if I want to use some kind of visual separator to easily see each 4 character set.

    Chad

    PS - just saw the posts that came in while I was writing this - that article is awsome Barry, thanks for sharing. I see the advantage of nested sets over adjacency, but didn't see performance gains over the path model. But maybe I didn't run enough permutations. Here are some additional links for those who may find this thread later:

    Nested sets article by Joe Celko: http://www.developersdex.com/gurus/articles/112.asp?Page=1

    Article by Mike Hillyer that references Joe Celko (mysql, but the theory is applicable) http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

    and for me, my post from awhile ago: http://www.sqlservercentral.com/Forums/Topic506063-373-1.aspx

  • Chad Crawford (12/18/2008)


    Jeff - I really appreciate your time and if you think the nested set is superior, I would appreciate some additional thoughts to help steer me in that direction. So is the hybrid model you were suggesting a combination of both the adjacency and nested set models? The adjacency model used to double check and completely overhaul the nested set model if it ever gets out of sync?

    That's basically it in a nutshell except that I let people maintain the Adjacency model because that's what they understand. Most folks just don't understand the Nested Set model even when they see it. So, I combine the two to make it easy on the human on the one side and good for performance on the other side.

    The real key is that if you absolutely sure about the scale of the DB, the Path Model Barry and you used will likely be just fine and not have much in the line of performance problems if you only go 14 nodes deep for the whole shootin match.

    You're also correct about recurrsion... might prove useful to build the path one time, but it just doesn't seem wise to keep calculating the same thing over and over and over.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Chad;

    A while back, I gave a presentation comparing three different methods for representing hierarchies using SQL Server; regular adjacency list, matrix path encoding, and HierarchyID. Since I included the new HierarchyID in the comparison, I was running against 2K8 (CTP 6, at the time). Interestingly, in my comparison, the HierarchyID implementation fared the worst in terms of query performance (although the queries were a little easier to formulate from an end-user perspective).

    The implementation choice will largely be influenced by the types of queries you need to run most often (find all children of path " ", etc.), and the means by which those queries are formulated (i.e., is the T-SQL query sent to the DBMS "hidden" behind a friendly user interface, or are your users expected to formulate ad-hoc SQL on their own, etc.)

    I encourage you to study Vadim Tropashko's ideas in this area. Although I ran into some limitations for my application that rendered the solution unsatisfactory for my problem, it may fit yours. Plus, it's a very interesting way of looking at the problem: http://arxiv.org/ftp/cs/papers/0402/0402051.pdf

    I have had success using the encoded path method similar to what you're discussing. One thing to note is that if you encode to a predictably-sized value, you can forego the "/" delimiter and save a bit of space that way, if it's a concern.

    HTH,

    TroyK

  • Wow, that is an interesting paper. But if most folks have trouble with Celko's Nested Sets, I don't know how I'd ever explain this one to them!

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply