SCD in SSIS

  • vcvcvcHello,

    I am new to ssis and i am a little bit confused about the SCD_Start_Date and SCD_End_Date when we use the SCD dimension wizard. on many examples that i have searched online people always use the "system start time" or "system container " start time. but i am implementing a data warehouse from a scratch but our company have multiple data sources from two or three years ago. so when i upload our data source from this two years into our dw after running the package it will obviously show that, for example, a seller (seller Dimension) changed from department in the day i ran that package but this is wrong because i want to have a reference date that is in the data source. isn't this the logic or am i missing something? how do you guys in BI upload old data source with a reference date?

  • Do you have actual historical data?  I.e. you know what the value was on Jan1, and you also know what the value was on Jan2.  Do you have each change over time captured in your data?

     

  • i do have. it is pretty much the same as this post from this user that i found later. trying to follow from here. https://dba.stackexchange.com/questions/223783/create-slowly-changing-dimension-table-from-repeated-data

    • This reply was modified 2 years, 7 months ago by  rafamaniac.
  • Sorry I didn't reply sooner.  You're in a great position to build up your SCD2 dims.

    SCD2 is designed to detect and capture change over time.  On Monday you get a record that says [Name]='Rick Danger', on Tuesday the same record says [Name]='Batman'.  SCD2 detects the change, deactivates the current record, and makes a new record.  Day after day its discovering and archiving changes.  When you first turn it on, SCD2 is starting from scratch.  It is intended to begin capturing changes to data, and creates your history as you run your process day to day.

    With source data that already HAS historical data, you don't have to do that.  Your source data IS your SCD2 data, essentially.

    Let me be clear on my assumption about your source data - you have historical dimensional information in the source.  Account names, reporting hierarchies, etc - changes are captured in new records as they occurred.  The tables would look like audit tables, tracking any change to the source table.  If that is true then...

    Process each source table as an SCD0 dimension (i.e. no change tracking mechanism).  Your source data will have a date column, you may have to compute an EndDate column (use SQL LEAD and LAG functions to set one row's end date as the next row's start date).  You could add an Active flag column if you like (case when [EndDate] is null then 1 else 0 end).  All other dimension columns should ETL as-is, except for cleanup, trimming strings, etc.  When you're done the dimension should look very similar to the source table.  You could set this up as incremental or trunc and load - either will work fine.

    Whatever reporting tool you're using can use those date columns and/or active column to get the SCD2 benefit.  If fact as a double-check on my thinking, you should be able to do SCD2-type queries against your source data right now, if indeed it is structured like I think it is.  Excel and a pivot table is all you need to validate the approach before you do any work.

    If I've misunderstood the data situation, let me know and I'll suggest a different approach.  Please ask any questions, happy to help.

    Rick

     

    Then,

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply