Tracking Record Changes in History Table

  • I'll preface this by saying the table is terribly constructed and what I'm trying to do is a short-term solution. I've been given a table that is archived quarterly and has about 300 columns totaling around 3 million records. The problem is that the source contains all records whether they are active or inactive and the archived table(s) needs to be saved as such so that the archived table(s) have the same data as the source at the time of quarter end. Out of the 3 millions records, maybe 200,000 of those have changes made to them during the quarter. Therefore there is a lot of redundant, stale data throughout all of the archived tables. My solution is to create one history table and add effective date, expiration date, and current row indicator fields. From there I can create views to match the current quarterly structure while drastically cutting down on redundancy. To do this, I will need to match the source to the history table using the primary key and find if any of the other 300 columns have had changes during the quarter. What is the best method/tool to use to do this?

  • I assume, therefore, that the table doesn't have some kind of "last updated date" field? Considering that you have a inactive/active marker this suggests the inactive rows are history values, which would actually suggest you do know what a row was last updated. I would start by looking there.

    Otherwise comparing 300 columns and seeing if any of them differ will be an incredibly expensive task; especially when you have over 3 million rows. Yuu could end up with logic that means that the data engine needs to compare 900 million values per row  (3 million other values for all 300 columns); and it would need to do that 3 million times (once for each row). OUCH.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Is there any possibility that you could add a RowVersion column to your source table?

    This, in combination with your PK, should allow you to identify rows which contain one or more modifications.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • A last updated field would be too convenient so naturally this doesn't have one. I'm able to make the assumption that a large majority of records are inactive solely based on some of the initial date fields in the row. It's definitely not 100% accurate but more to provide an estimate.

    That's what I was afraid of but was hoping for a miracle. When thinking about it a little differently I have another question. Suppose a daily file with the same structure (300 columns, 3 million rows, and no last modified date field) is used to create a daily reporting table. Is it more beneficial to load the data into SQL and then truncate/insert into the reporting table or load into SQL and then only update/insert changed records based on primary key?

     

  • Phil - From what I'm reading about RowVersion, that could be a potential solution going forward but won't help looking back at all of the archived tables to determine if changes have been made quarter-over-quarter.

  • RonMexico wrote:

    When thinking about it a little differently I have another question. Suppose a daily file with the same structure (300 columns, 3 million rows, and no last modified date field) is used to create a daily reporting table. Is it more beneficial to load the data into SQL and then truncate/insert into the reporting table or load into SQL and then only update/insert changed records based on primary key?  

    If by 'more beneficial' you mean 'faster', I'd say that the answer is 'probably' ... but you'd have to test it.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • RonMexico wrote:

    Phil - From what I'm reading about RowVersion, that could be a potential solution going forward but won't help looking back at all of the archived tables to determine if changes have been made quarter-over-quarter.

    True, but if you fix it for the future, maybe you can accept the one-off hit for the initial load.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phrasing my question a little better after thinking about RowVersion or adding my own data added field, wouldn't I still have to load the file to SQL, compare the source to the target using the primary key, update/insert and changes/new records, and tracking accordingly with RowVersion or date added? To me that seems the same as what I'm trying to do quarterly since it has to compare all fields and all rows to detect changes. I'm wondering if there is a better way to handle that.

  • This was removed by the editor as SPAM

  • It's not the same, see the following example:

    DROP TABLE IF EXISTS #Source;

    CREATE TABLE #Source
    (
    CrapId INT PRIMARY KEY CLUSTERED
    ,VersionCol ROWVERSION NOT NULL
    ,SomeText VARCHAR(50) NOT NULL
    );

    INSERT #Source
    (
    CrapId
    ,SomeText
    )
    VALUES
    (1, 'Some really useful stuff that will not change')
    ,(2, 'This, however, will get overwritten');

    SELECT c.CrapId
    ,c.VersionCol
    ,c.SomeText
    ,Tablename = 'Source'
    FROM #Source c;

    DROP TABLE IF EXISTS #Target;

    CREATE TABLE #Target
    (
    CrapId INT PRIMARY KEY CLUSTERED
    ,VersionCol BINARY(8) NOT NULL
    ,SomeText VARCHAR(50) NOT NULL
    );

    INSERT #Target
    (
    CrapId
    ,VersionCol
    ,SomeText
    )
    SELECT s.CrapId
    ,s.VersionCol
    ,s.SomeText
    FROM #Source s;

    SELECT t.CrapId
    ,t.VersionCol
    ,t.SomeText
    ,Tablename = 'Target'
    FROM #Target t;

    UPDATE #Source
    SET SomeText = 'Some new useful text'
    WHERE CrapId = 2;

    SELECT s.CrapId
    ,s.VersionCol
    ,s.SomeText
    ,Tablename = 'Source'
    FROM #Source s;

    UPDATE t
    SET t.VersionCol = s.VersionCol
    ,t.SomeText = s.SomeText
    FROM #Target t
    JOIN #Source s
    ON s.CrapId = t.CrapId
    WHERE s.VersionCol <> t.VersionCol;

    SELECT t.CrapId
    ,t.VersionCol
    ,t.SomeText
    ,Tablename = 'Target'
    FROM #Target t;

    • This reply was modified 5 years, 5 months ago by  Phil Parkin.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Have a look at this link for a far better example than mine!

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I understand the concept but not how it applies to the situation I'm working with. Suppose today I create a new target table with the VersionCol field you explained above by using a file to populate the data. That specific field is going to be assigned a random value. The disconnect I'm having is tomorrow I will have to import the data file (majority of it is unchanged, some will be new or updated records) into a staging/source table which will be assigned a different, random VersionCol value.  Therefore when I join the source and target tables on ID where the VersionCol does not match, all records will return as a result of the source table being assigned new VersionCol values at import. How do I prevent that from happening?

  • RonMexico wrote:

    I understand the concept but not how it applies to the situation I'm working with. Suppose today I create a new target table with the VersionCol field you explained above by using a file to populate the data. That specific field is going to be assigned a random value. The disconnect I'm having is tomorrow I will have to import the data file (majority of it is unchanged, some will be new or updated records) into a staging/source table which will be assigned a different, random VersionCol value.  Therefore when I join the source and target tables on ID where the VersionCol does not match, all records will return as a result of the source table being assigned new VersionCol values at import. How do I prevent that from happening?

    You never mentioned that the source of the data was a file ... in fact, you heavily implied that it was a table.

    The Rowversion has to be on the source table, of course.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I called it a daily file in my hypothetical scenario of one that has the same structure of the tables. Sorry if that was not clear or got lost in my rambling. If the data is loaded from a file, am I stuck comparing on the ID in order to capture the change tracking?

  • RonMexico wrote:

    I called it a daily file in my hypothetical scenario of one that has the same structure of the tables. Sorry if that was not clear or got lost in my rambling. If the data is loaded from a file, am I stuck comparing on the ID in order to capture the change tracking?

    OK, that stuff about "Suppose a daily file ..." did puzzle me a bit!

    Good, SQL Server is; but do miracles, it cannot.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply