What table design is better ?

  • We have a table that holds history of assignments of user_id to contact_id. What that means that we can have multiple (non-unique) user_id's, but for a given poit in time only one contact_id can be assigned to it. I have two versions of this table design in my mind.

    Version A:

    contact_id,

    user_id,

    assigned_date,

    removed_date

    In this case table will look like this:

    1 | 123 | 04/03/2009 | 07/10/2009

    2 | 123 | 07/11/2009 | 10/10/2009

    3 | 123 | 10/10/2009 | NULL

    Version B:

    contact_id,

    user_id,

    log_date,

    action

    In this case the same data will look like this:

    1 | 123 | 04/03/2009 | assigned

    1 | 123 | 07/10/2009 | removed

    2 | 123 | 07/11/2009 | assigned

    2 | 123 | 10/10/2009 | removed

    3 | 123 | 10/10/2009 | assigned

    For design A the primary key would be contact_id + user_id, while for design B it would be ...+ action. Of course, for action in version B I can put 1 and 2.

    I am trying to choose the better design, but can't find the benefits of one over another one. Can someone suggest me what are the benefits and disadvantages of these two designs and what is better to choose ?

    Thanks

  • Option "A" is known as a "Slowly Changing Dimension Type 2" and, if I recall correctly, Option "B" is "Type 1". I prefer option "A" as I don't hav to look for a MAX DATE to find the current status... I just look for NULL or a very large date (a better option) in the EndDate.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Funny, I'm going to come down just the opposite and say that I like option B for the same reasons that Jeff likes option A. I don't like having to mess with NULL logic and while getting the TOP date (usually works better than MAX, depending on your indexing) can look a bit awkward in the code, it scales extremely well (again depending on how you index).

    Clearly, either approach will work.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I prefer option A, but I would set the end date to a default value of '9999-12-31' and document it.

    Point is to avoid the NULL.

    Having the actual datetime data will help you in some cases where you have to figure out when something actually changed.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I prefer option A, but I would set the end date to a default value of '9999-12-31' and document it.

    Point is to avoid the NULL.

    Having the actual datetime data will help you in some cases where you have to figure out when something actually changed.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • My recommendation would be to GOOGLE "Slowly Changing Dimension Type 2" and find the WikiPedia article on these two (and more) options. I do it like ALZDBA does above and Grant does it a bit different. I have a huge amount of respect for both and could argue for either method. The WikiPedia article will help you come to the best decision for your particular installation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • From the article on Wikipedia I can conclude that my option A is certainly falls into type 2, but the option B does not look like type 1. Type 1 does not preserve history while my option B does, just by adding 2nd row for each record in option A. However, option B does not look like falling into any of these types, at least by examples presented in the article.

    Since majority of recommendations are for option A, I would probably choose this one. This table will be used mostly for various reports that collect data for some point in time.

    With option A the Where clause will look like this:

    where

    assigned_date >= 09/30/2009

    and (removed_date <= 09/30/2009 or removed_date is null)

    While with option B I either have to make self-join to this table with 1st instance is "assigned" and 2nd instance is "removed", or to loop or cursor through all the records because many records don't have "removed".

    Am I right ?

  • ALZDBA (10/21/2009)


    I prefer option A, but I would set the end date to a default value of '9999-12-31' and document it.

    Point is to avoid the NULL.

    Having the actual datetime data will help you in some cases where you have to figure out when something actually changed.

    I agree with this (option A) assuming that the only "actions" are "assigned" and "removed". I try to date-track all my records this way. "date_from" and "date_to" with "date_to" set to "9999-12-31". That way, you can just have a WHERE clause like "WHERE GETDATE() BETWEEN date_from AND date_to", or like you said, put in a date and see what was active at the time.

    If you think there might be more actions (maybe "temporarily disabled" or something) that you have to track, option B might have some merit though. Unless it was absolutely necessary, I'd probably still use option A, and just have a bit flag for other actions and maybe an "date_modified" field.

  • Ok. I know I am late weighing in on this, but there were a few extra points I wanted to make.

    1. The two examples provided do not have the same flexibility. Option A specifies the range of a given contact relationship. If you were at any other info to the row, it would have to apply to the entire range. Option B specifies a point at which a relationship changes. Info recorded at that time relates to that specific change in the relationship. It can be used to track just the assignment and removal of a contact, but could just as easily be expanded to include other types of relationship changes.

    Perhaps what you really need are two tables. One a history of relationship changes and the other your more summarized contact ranges?

    2. You indicated that you could have onnly one contact. What do they do when they are training? When the contact is sick for a day? Do they modify the table immediately? Normally I'd expect to have the application be able to handle a primary and secondary contact.

    Try not to jump into the snowflakes untill the snow has stopped falling 🙂

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply