Is a history table required for maintaining a column's previous value?

  • I have a table with names that can be updated by a monthly process. The business requirement states that we need to know which names changed the last time it was updated, but we only need to know the most recent change.

    Approach 1)

    Table: Item

    ItemID int NOT NULL

    ItemShortName varchar(50) NOT NULL

    ItemLongName varchar(50) NOT NULL

    PreviousItemLongName varchar(50) NOT NULL

    PK: ItemID

    Before running the update process for ItemLongName, we will put the all the existing ItemLongName values into PreviousItemLongName

    Approach 2)

    Table: Item

    ItemID int NOT NULL

    ItemShortName varchar(50) NOT NULL

    ItemLongName varchar(50) NOT NULL

    PK: ItemID

    Table: Item_History

    ItemID int NOT NULL

    ItemUpdateDate datetime NOT NULL

    ItemLongName varchar(50) NOT NULL

    PK: ItemID, ItemUpdateDate

    Before running the update process for ItemLongName, we will put all the current ItemLongName values into the Item_History table.

    It seems like the first approach is possibly violating first normal form (by having 2 long names stored). But the business requirements are not asking to maintain a history of changes to the field, only the last changes made. Does anyone have any suggestions on the best approach? Perhaps something completely different?

  • One question is do they always change every month?

    If not, then the first solution might not work, because you might be looking for "what was the name last month", and end up with either the current name (because it didn't change but the column was updated anyway), or data from longer ago than one month (because it didn't change and thus wasn't updated). Either way, you lose the data on when the change was made, unless you also add a "LastUpdated" column and store that there.

    Keeping it in a log table allows you to track more, and even though right now the business requirement is to only keep the last month's name, odds are you're going to need to report on frequency of changes or something of that sort, somewhere down the road.

    I'd be more inclined to generate an audit log. The chance of scope-creep on this one is pretty high. Plus, I've had really good experiences with audit logs coming in useful for a large number of things. (Enough so that I published articles on the subject on this site.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Another possible method is Type 2 SCD's (Slowly Changing Dimensions). See the following article...

    http://en.wikipedia.org/wiki/Slowly_Changing_Dimension

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • [font="Verdana"]The "type 2 slowly changing dimension" is a good approach for not needing a separate history or audit log table.

    You don't actually have to use a dimensional model to use that same concept (in fact, I have been using it for several decades now in OLTP systems.) I used this as a template for every table that required historical data trails.

    The idea is that you mark each record to indicate whether it is the current record, and from what date it applied. I used to put the following fields on my historical tables:

    IsCurrent bit not null default (1)

    CurrentFrom datetime not null default (getdate())

    CurrentTo datetime not null default ('99991231')

    You can actually make the IsCurrent field a calculated field if you wish, but hey, it's only 1 bit. Also, I filled in the value of CurrentTo so you can easily match to the correct version of a row on the date by saying something like "where mydate between CurrentFrom and CurrentTo" and not need to worry about nulls. Better for indexing too.

    When an update comes in, you do the following:

    1. update the last version of the row being updated to set the IsCurrent field to 0 and the CurrentTo field to getdate()

    2. insert the new version of the row being updated

    And yes, you can do all of that as set based. I'd do the updates first, because step 2 can also insert any new rows.

    Now, check out the MERGE statement on SQL Server 2008, because there's an option there to use an OUTPUT clause, and use that clause as the input to an insert. Seriously rocks! You end up doing step 1 and step 2 all in one statement!

    [/font]

  • Here is my understanding of the Slowly Changing Dimension article as applied to the original example. There are multiple methodologies used to track historical data and each one is specified by a type number (I've left out Type 0):

    --Type 1

    ItemID ItemShortName ItemLongName

    123 PST Pasta

    -- ** after update **

    ItemID ItemShortName ItemLongName

    123 PST Pasta W/Sauce

    --Type 2

    ItemID ItemShortName ItemLongName IsCurrent FromDt ToDt

    123 PST Pasta 0 20090101 20090615

    124 PST Pasta W/Sauce 1 20090616 99991231

    --Type 3

    ItemID ItemShortName ItemLongName PreviousLongName

    123 PST Pasta W/Sauce Pasta

    --Type 4

    [Item]

    ItemID ItemShortName ItemLongName

    123 PST Pasta W/Sauce

    [ItemHistory]

    ItemID ItemLongName CreateDate

    123 Pasta 20090101

    -- Hybrid (my interpretation of BWC's recommendation)

    RowKey ItemID ItemShortName ItemLongName IsCurr FromDt ToDt

    1 123 PST Pasta 0 20090101 20090615

    2 123 PST Pasta w/Sauce 1 20090616 99991231

    My understanding of the difference between the Hybrid and Type 2 is that in the Hybrid you don't have to increment the ItemID which would be useful for integrating the new table with original queries.

    Management seems to want us to take the approach of Type 3 since our current requirement is only to show the last value previous to the update (not necessarily last month's value). But I posed the question on the forum because Type 3 struck me as problematic due to the first normal form violation (although nobody has mentioned it so maybe it's not a valid concern). It also seems limiting in regards to future requirements.

    That leaves Type 2, 4, and the Hybrid. My only concern for our system with Type 2 and the Hybrid is that it seems like all our existing tables with foreign keys against ItemID would have to be re-written (as well as our queries).

    I'm curious about Type 2 and Hybrid; do you have to have some kind of check constraint with a column to protect against date gaps between rows? Or do you simply assume it will be handled by the logic in the stored procedures.

    Meanwhile, thanks for all the great input.

  • 8kb (6/17/2009)


    Management seems to want us to take the approach of Type 3 since our current requirement is only to show the last value previous to the update (not necessarily last month's value). But I posed the question on the forum because Type 3 struck me as problematic due to the first normal form violation (although nobody has mentioned it so maybe it's not a valid concern). It also seems limiting in regards to future requirements.

    [font="Verdana"]I'm not entirely sure that it violates first normal firm, as technically the data isn't repeated: you would store the differences between the last version and the current version, which are (by definition) different.

    I agree: the type 3 approach is limiting for future requirements, but it is also easier to implement.[/font]

    8kb (6/17/2009)


    That leaves Type 2, 4, and the Hybrid. My only concern for our system with Type 2 and the Hybrid is that it seems like all our existing tables with foreign keys against ItemID would have to be re-written (as well as our queries).

    [font="Verdana"]True, to some degree. By creating views that only list the "current" rows, you can avoid having to rewrite queries. But I used this as a design template, so the idea of historical versioning was built in right from the start. You're patching it into an existing design, so minimilising rework is obviously a concern.[/font]

    8kb (6/17/2009)


    I'm curious about Type 2 and Hybrid; do you have to have some kind of check constraint with a column to protect against date gaps between rows? Or do you simply assume it will be handled by the logic in the stored procedures.

    [font="Verdana"]It's actually really hard to build it in as a constraint. You can sort of semi get it with unique constraints. However, my designs tended to use procedural gateways to the database (no direct table data changes), so it was fairly easy to capture a check there. I suppose you could do it in trigger code if you don't want to enforce procedural gateways, but personally I prefer the gateway option.[/font]

  • It looks like the Type 3 approach has won out. But on the flipside, if management changes their mind for new features in the future, I have a much better understanding of new some approaches. Thanks for the detailed explanations...

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply