data design considerations

  • Scenario is that I want to have a primary key to identify a new entry in a main table. Call it "record_id." This will also be a foreign key in many other tables.

    The most important of those tables will hold a complete history of all changes to the main record. Each record in the history table will have both the record_id as well as a unique history_id.

    So I'm envisioning a main table that generates the record_id, and a history table which will FK to the main table as well as generate its own unique internal PK as history_id.

    The thing I'm struggling with is that the "main" table seems like it only has one purpose, which is to generate the PK. All the other data would be stored in the history table because each piece of it might potentially change (this is storing data that is altered by users in something like a workflow application). So the main table is reduced to functioning the way a sequence does in Oracle. It holds no data and there's no real reason even to use it beyond its ability to create the record_id.

    This isn't really a problem, but it seems odd. Every time I think of other fields that might belong in the main table, such as create-date and such, that argument disappears, because that date will be in the first record of the history table as well as in another audit table.

    I get the feeling that something obvious is slipping right past my thought process. Composite key may fit here somehow.

    Though maybe I should just relax and allow this "main" table to hold only the record_id and recognize that its usefulness is limited to that - an SQL Server version of an Oracle sequence.

    Your thoughts are welcome!

  • The main table would still hold the entire current record. Even if the history table contains the current record, you don't want to get it from there because of the additional overhead. That table could have dozens or hundreds (or thousands or ...?) copies of the rows, one for each change. When you need the currently version of a row, which is presumably the most common read, it's vastly less overhead to read it from a table that contains only the current version of rows instead of the entire history of all rows.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Thanks for your thoughts. I'm not sure I agree - with proper indices and queries, don't think that getting the current record data from the history table has to be a performance hit. But it's helpful to consider from multiple perspectives.

    In the meantime, I have identified some attributes that would not change along with other updates, so they could reside in the main table to keep the record_id company.

  • Getting the current record HAS to be a performance hit. What you're talking about is a logging table. I use them all the time and I prefer them for audit tables, but you only go into them when you need point in time data. Why?

    Because of the way indexes work and the mandatory sub-computation to find the most recent record.

    Whenever you use a logging (audit, whatever) table, to find the most recent record you have to go in on the key + Max(DateTimeStamp). So, every time you go in, you have to first aggregate, then pull the current record.

    This gets worse when you want to, say, pull all items with information in a secondary field, say products in category Z. There's two approaches, both... at best... abuse indexing in ways it really doesn't want to go.

    First, you go in, find all recent records, then scan the results.

    Second, you go in and find all records with said category, then go in a second time for most recent record evaluation, then match up the two sets.

    You'll want to keep your histories out of the primary data points if you intend to do anything but work with a single 'record_id' at a time.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks, Evil Kraig F.

    I see your points.

    I had thought to avoid some of that by having a column in the main table which would hold the current (i.e. max) history_id for that record_id and then join the tables on history_id.

    Rather than storing all the data for the current record twice, just get the data out of the history table where the history_id = the one that's in the main table. So every time the history table is updated, so is the main table - but not with the entire row of data, just the history_id. With an index on the history_id in the history table, pulling the current record in this way should not be expensive; and that seems like a correct way to use an index.

    Wrong?

    We will also be doing some flattening/denormalization for reporting purposes, so there won't be a need to crawl the history table on a regular basis.

  • I worked with a design for several years that allowed for versioned data and queries that ran extremely fast. You just have to make sure that you index the tables appropriately and write the queries in an optimal fashion. I've got two articles on Simple-Talk about these mechanisms. The first talks about the index structure[/url]. The second talks about the various methods of querying the versioned data[/url].

    Short answer, instead of a main table and a history table, just have one table that stores everything. The methods outlined above have been in production, successfully, for years.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • There must be more fragmentation in a history table than a current-data-only table.

    For single row lookups, this is not really a performance hit.

    But for table scans, even partial ones, it will be significant. Whether or not it will actually be noticeable depends on your environment: if you have extra capacity in your system to handle the added overhead, you may not notice the performance degradation. You could reduce fillfactor to reduce fragmentation, but that causes potential "dead space" issues, so it's not a panacea.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (3/28/2013)


    There must be more fragmentation in a history table than a current-data-only table.

    For single row lookups, this is not really a performance hit.

    But for table scans, even partial ones, it will be significant. Whether or not it will actually be noticeable depends on your environment: if you have extra capacity in your system to handle the added overhead, you may not notice the performance degradation. You could reduce fillfactor to reduce fragmentation, but that causes potential "dead space" issues, so it's not a panacea.

    Some, but what we found was the if the clustered index was built right, then the storage was pretty consistent. The key structure had two values, the identity of the root object and the version of that object. That very evenly distributed the data in a known pattern. It really did work well and it's been going for about eight years now. I haven't seen it personally in the last two, but it works.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (3/28/2013)


    ScottPletcher (3/28/2013)


    There must be more fragmentation in a history table than a current-data-only table.

    For single row lookups, this is not really a performance hit.

    But for table scans, even partial ones, it will be significant. Whether or not it will actually be noticeable depends on your environment: if you have extra capacity in your system to handle the added overhead, you may not notice the performance degradation. You could reduce fillfactor to reduce fragmentation, but that causes potential "dead space" issues, so it's not a panacea.

    Some, but what we found was the if the clustered index was built right, then the storage was pretty consistent. The key structure had two values, the identity of the root object and the version of that object. That very evenly distributed the data in a known pattern. It really did work well and it's been going for about eight years now. I haven't seen it personally in the last two, but it works.

    But some root objects will change a lot, other root objects not at all. It's essentially impossible to handle that without fragmentation in the table. You could thus have extreme fragmentation in some areas of the table and virtually none or none in others. For example, newer rows would typically get updated much more frequently than very old rows.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (3/28/2013)


    Grant Fritchey (3/28/2013)


    ScottPletcher (3/28/2013)


    There must be more fragmentation in a history table than a current-data-only table.

    For single row lookups, this is not really a performance hit.

    But for table scans, even partial ones, it will be significant. Whether or not it will actually be noticeable depends on your environment: if you have extra capacity in your system to handle the added overhead, you may not notice the performance degradation. You could reduce fillfactor to reduce fragmentation, but that causes potential "dead space" issues, so it's not a panacea.

    Some, but what we found was the if the clustered index was built right, then the storage was pretty consistent. The key structure had two values, the identity of the root object and the version of that object. That very evenly distributed the data in a known pattern. It really did work well and it's been going for about eight years now. I haven't seen it personally in the last two, but it works.

    But some root objects will change a lot, other root objects not at all. It's essentially impossible to handle that without fragmentation in the table. You could thus have extreme fragmentation in some areas of the table and virtually none or none in others. For example, newer rows would typically get updated much more frequently than very old rows.

    Yep. You will get some fragmentation, no question. You get fragmentation with almost any structure. But in this instance the fragmentation was never so horrible that it was noticeable let alone that it brought the system down. Worst problem we had to deal with in the whole thing was keeping statistics up to date, but that was no different than the current issues with an identity column.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply