Incremental Load with INSERT, UPDATE and DELETE on Fact Table with 100+ million rows

  • Hi,

    I have to design an incremental load on a SQL Server 2016 fact table with 100+ million rows.  In most data warehouses this wouldn't be a problem since I could take the existing maximum date from the fact table and search the source for records that have been created or updated on or after the maximum date with time included of course.  And then I would insert it into a staging table and run a merge for the new and modified records.  Nonetheless the product owner has informed me that deleted records have to be considered as well.  So if a record is deleted at the source it should also be done in the fact on the next load.  In other words every insert, update and delete has to be carried through.

    This is a problem because the source team (they use CRM dynamics) have told me that no change tracking or CDC is available nor will it be implemented.  If this was a SQL Server source I would use CDC.  Nonetheless this part is non-negotiable unfortunately.  I've thought about this and searched the net and have only come up with the following very inefficient strategy in my data flow task:

    - Use Script Component to query the source and OLE DB to query the SQL Server fact table
    - Use a MERGE JOIN on the two with a full outer join on full Source and Destination Table
    - Do a conditional Split:  New Rows = SourceID != NULL AND FactTableID == NULL, Deleted Rows = SourceID == NULL AND FactTableID != NULL, Updated Rows = SourceID @= NULL AND FactTableID != NULL
    - Use a derived column to distinguish rows (Deleted = 0, Inserted = 1, Updated = 2)
    - Insert into staging table
    - In the control flow run a merge to either insert, update and delete rows

    What bothers me is that this approach relies on fully joining a big source with a big fact table which will take ages.  Given the following considerations:

    - I'm using SQL Server 2016 Enterprise Edition, SSIS
    - The load is to happen once nightly 
    - Source = CRM but CDC or Change Tracking is not available and will not be implemented
    - I cannot use replication or mirroring or anything like that because certain fields are modified, left out when importing from the source to the fact table
    - I am not allowed to use TRUNCATE TABLE und then INSERT because the load has to be incremental

    Is there a better/faster way to do this?  

  • This is a tough requirement, I sympathise.
    MERGE JOIN in SSIS is not something I would recommend for such high volumes.

    Maybe you could split this into two parts: the INSERT/UPDATE bit, followed by a DELETE bit.

    INSERT/UPDATE is standard:
    data from source --> Conditional split --> Inserts to fact table --> updates to staging table, followed by
    UPDATE from the staging table to the fact table.

    The DELETE is more tricky. One idea is
    Primary key data (only) from source --> staging table, followed by
    DELETE FactTable
    WHERE NOT EXISTS (select from StagingTable match on PK)
    If the deletes are very numerous, you may want to batch the deletes to avoid excessive trans log bloat.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • A few thoughts: Understand that CDC is not going to be implemented, but have a close look at the source schema again. It is my understanding that Dynamics CRM contains a Last Modified Date by default in all entities (tables), and deleted records are not deleted from the CRM back-end tables but rather flagged with a State Code of "1" for inactive/deleted records.

    For large fact tables, I would also recommend that you implement partitions. Deleted records can be flagged in your fact table, and moved out to a separate partition during off-line hours and the partition removed. Partitions will also help during your load, as you can split the ETL process up as well as process partitions in parallel.

    As Phil suggested, I would also not recommend performing functions like full outer joins and merges in SSIS. In my opinino, staging the data and using t-sql will perform much better.

  • CRM can be a pain for this.

    I assume that this CRM database is either OnPrem or Cloud PaaS.
    If not then no access to the database will be allowed so you can't use anything else other than the supplied CRM Api to extract data.

    Before I go any further just one bit of clarification on what Martin Schoombee said.
    Versions of CRM prior to CRM 2011 did indeed do a logical delete of the records.

    Starting with 2011 that is no longer the case and the records are physically deleted.

    Back to the process.

    I suspect that the fact that CT or CDC is not available or will not be implemented is because it is not supported by Microsoft CRM Dynamics - neither is using replication (although this can be implemented with certain constraints).

    Depending on the CRM version you have you may be able to turn on CRM tracking itself - CRM 2015 and higher have it out of the box and it is supported. This could enable you to do the deltas in a possibly easier way as it would give you create/insert/deletes in one go.
    This particular functionality does have some issues and has an impact on the CRM database so use with care if available on your CRM version.

    The other option is to use the standard process of identifying new/modified and deletions.

    As you probably know by now CRM used uniqueidentifiers for its primary keys - these will always be unique throughout all the tables on CRM.

    For new/modified field modifiedon is the one to use - every CRM entity (not tables as depending on the CRM version a Entity can be made of several tables) has it and it is always updated on both creation and modifying of any field on the table.
    Once you extract the records from CRM where records have a modified date greater than your last extract and when joining to your fact table you know if a record is new if the primary key of the entity does not exist on your fact table.

    CRM also has another table that contains a list of all deleted records (SubscriptionTrackingDeletedObject) - this table contains the objecttypecode and the primary key value of the deleted records. these are kept on this table for awhile until they are deleted by the cleanup job (Dynamics CRM 2011 DeletionService Async Maintenance Operation)
    - this may be a option to identify the deleted records assuming the tables you are extracting from are ones that do have the records added to this table and assuming you do perform the extract before the cleaning job acts.

    And the final option to identify deletes, as Phil mentioned, is to extract the primary key only from the source and then deleting from the fact table any record that does not exist on the primary key table

    With regards to the modified records - be aware that any modification at all of the record does cause the modified on date to be changed - and many times the modification can be on a field that you do not care about on your fact table so always check if the modification is one that requires you to do a update and/or a SCD change.

  • Thank you very much for your help and responses for this tough requirement.  After many emails sent back and forth the product owner agreed that a sliding window solution is acceptable.  By this I mean that I pick up data from the source where Date >= MAXDATE - 1 month and then do a delete on the corresponding fact table (where Date >= MAXDATE - 1 month as well) if the IDs don't match.  There will be a business rule that data older than one month cannot be deleted.

    Martin Schoombee - Thursday, June 8, 2017 8:38 AM

    For large fact tables, I would also recommend that you implement partitions. Deleted records can be flagged in your fact table, and moved out to a separate partition during off-line hours and the partition removed. Partitions will also help during your load, as you can split the ETL process up as well as process partitions in parallel.

    I considered partitioning but this seems to work well if you are just doing a DELETE or INSERT.  For example:
    - INSERT new records into staging table
    - SWITCH partition to point to fact table

    But my question is how would you do this for INSERT, UPDATE, and DELETE all at once.  As far as I know you can't partition updates.

  • Addicted2SQL - Friday, June 9, 2017 3:57 AM

    But my question is how would you do this for INSERT, UPDATE, and DELETE all at once.  As far as I know you can't partition updates.

    My note on the use of partitions is not so much to switch them in and out (although it would be possible to do that if you decide to soft delete records), but rather to optimize the costly update/delete processes as it would then be possible to multi-thread and process in parallel.

  • Ok a problem has come up with the sliding window strategy.  Say we have a fact table and a dimension table:  FactEntry and DimProduct.  If I were to consider all data for deletion where Date >= MAXDATE(for the specific table) - 90 days, consider the following example (by MAXDATE I mean the last time it was updated in the source):

    DimProduct (MAXDATE = 20160701) - entries not matching source
    (ID, LastUpdated)
    (10, 20160601) - will be deleted
    (20, 20160401) - will be deleted
    (30, 20160301) - will NOT be deleted since LastUpdate < MAXDATE - 90 days

    FactEntry (MAXDATE = 20161001) - entries not matching source
    (ID, ProductID, LastUpdated)
    (100, 10, 20160901) - will be deleted
    (101, 10, 20160801) - will be deleted
    (102, 20, 20160701) - will be deleted
    (103, 20, 20160601) - will NOT deleted since LastUpdate < MAXDATE - 90 days

    The problem is that although DimProduct(20, 20160401) will be deleted, FactEntry(103, 20, 20160601) will not be.  That means that this FactEntry record will refer to a non-existing ProductID and hence be orphaned.  We are not allowed to enforce any foreign key integrity constraints so my question is how do you get around this issue?  I guess you can use the smallest MAXDATE from FactEntry and DimProduct so in this case MAXDATE = 20160701 for both the fact and dimension table.  However what happens if I have a dimension that rarely or sometimes changes that has a MAXDATE = 20140101.  If this is the case then I would be pulling 3 years worth of data for each table.

    Any ideas?

  • I would revise that approach - most likely you only need to do that for the fact table, not the dimensions as those should be by their own nature smaller.

    and did you had a look at the other options mentioned

    And... 100 Millions isn't that much - on my CRM installation we have several tables with over that value - one of them biggest has 600 million rows

    With regards to the deletes one thing you can speak with the other team is to have a database on their server that only holds the keys of the tables you are copying.
    This approach also makes your life easier as it also identifies the inserts and updates

    Something like

    Accountkeys
    (accountid uniqueidentifier
    ,modifiedon datetime
    )

    accountstaging
    (accountid uniqueidentifier
    ,modifiedon datetime
    ,status char(1)
    )

    -- first step on your extract executes a set of sql's to populate the above tables (one per entity you are extracting)

    truncate table accountstaging -- always start process with a clear table
    insert into accountstaging
    select coalesce(ac1.accountid, ack.accountid) as accountid
      , ac1.modifiedon
      , case
       when ac1.accountid is null then 'D'
       when ack.accountid is null then 'I'
       when ack.modifiedon <> ac1.modifiedon then 'U'
       end as status
    from Account ac1
    full outer join Accountkeys ack
    on ack.accountid = ac1.accountid
    where ac1.accountid is null -- gives deleted records
    or ack.accountid is null -- gives new records
    or ack.modifiedon <> ac1.modifiedon -- gives update records

    -- second step on your extract joins main crm table with corresponding staging keys table

    select acs.status -- retrieve this column at this point as it will tell you immediately if it is a update or insert
      , ac1.* --- select only required fields
    from accountstaging acs
    inner join account ac1
    on ac1.accountid = acs.accountid

    Third step retrieves deletes
    select accountid
    from accountstaging
    where status = 'D'

    -- final step updates the keys table so its ready for next extract

    -- could be a merge or a set of 2 statements as per below

    -- delete records from accountkeys table that have been process by this run
    delete ack
    from accountkeys ack
    inner join accountstaging acs
    on acs.accountid = ack.accountid

    insert into accountkeys
    select accountid
      , modifiedon
    from accountstaging acs
    where acs.status <> 'D'

    This assumes that this is on premises as I mentioned before. but you didn't tell us that yet.

  • Thanks for the reply but unfortunately I was told that I can't expect any modifications from the CRM team in the near future and I have to draw up a design strategy pretty soon therefore the sliding window strategy seems to be my only alternative so far.  Your solution seems really cool and I wish I could implement it but the CRM team doesn't have the budget to support me in the next year.  :crying:  And yes the database and CRM are on premises.

    frederico_fonseca - Saturday, June 10, 2017 4:06 AM

    I would revise that approach - most likely you only need to do that for the fact table, not the dimensions as those should be by their own nature smaller.

    Could you elaborate on this?  The problem is that deletes also need to be considered for DimProduct so if I just leave the MAXDATE out and do a MERGE whereby I consider the whole table, I will still have this problem all entries not found in the source would be deleted.  Have I misunderstood what you were trying to say?  In either case I would like to hear more on your thoughts.

  • Addicted2SQL - Sunday, June 11, 2017 12:57 AM

    Thanks for the reply but unfortunately I was told that I can't expect any modifications from the CRM team in the near future and I have to draw up a design strategy pretty soon therefore the sliding window strategy seems to be my only alternative so far.  Your solution seems really cool and I wish I could implement it but the CRM team doesn't have the budget to support me in the next year.  :crying:  And yes the database and CRM are on premises.

    frederico_fonseca - Saturday, June 10, 2017 4:06 AM

    I would revise that approach - most likely you only need to do that for the fact table, not the dimensions as those should be by their own nature smaller.

    Could you elaborate on this?  The problem is that deletes also need to be considered for DimProduct so if I just leave the MAXDATE out and do a MERGE whereby I consider the whole table, I will still have this problem all entries not found in the source would be deleted.  Have I misunderstood what you were trying to say?  In either case I would like to hear more on your thoughts.

    Point 1 - CRM team would not need to support you - only the DBA team in order to create a new database on their SQL Server instance - once that one is done all remaining work is up to you and your own team.
    This solution would probably be the one with less impact on their server in terms of data selection and data transfer to your own server.
    This database would also be quite small based on the number of records you mentioned so space would also not be an issue, and in any case any disk space required should be paid by your own project.

    Point 2 - Dimension tables are normally a lot smaller so you probably could transfer them all every time.
    As for the possible deletes from the dimension tables you would need to consider 2 aspects
    - The "deleted" dimension records from CRM do not have a associated on any of the other fact tables - in this case they can also be deleted from the dimension table
    - The "deleted" dimension records are associated with at least one other record on the DW which is not getting deleted - in this case you need to keep the record.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply