Slowly changing dimensions using T-SQL MERGE

  • Actually you can do both. Put all the fields you wish to carry history on in the ON clause for matching. This will tell the MERGE you want a new record any time there is something that doesn't match, which is when you have new information in a field you wish to carry history on. For update only fields put an update clause on the fields you wish to update in the when matched clause. For static fields, just leave them in the insert. Don't make them part of the update.

    Easier if I just show you the template I use on my MERGE statements:

    --JA: MERGE source with destination

    MERGE dbo.DimCustomer Destination

    USING --JA: source

    (

    SELECT

    DC.[HistoryCol1]

    ,DC.[UpdateOnlyCol1]

    ,DC.[HistoryCol2]

    ,DC.[HistoryCol3]

    ,DC.[UpdateOnlyCol2]

    ,DC.[StaticCol1]

    FROM dbo.DimCustomerStage DC

    )

    AS [Source]

    (

    [HistoryCol1]

    ,[UpdateOnlyCol1]

    ,[HistoryCol2]

    ,[HistoryCol3]

    ,[UpdateOnlyCol2]

    ,[StaticCol1]

    )

    ON (

    [Source].[HistoryCol1]= Destination.[HistoryCol1]

    AND [Source].[HistoryCol2]= Destination.[HistoryCol2]

    AND [Source].[HistoryCol3]= Destination.[HistoryCol3]

    )

    WHEN MATCHED --JA: here i specify that in order to qualify for update, the binary sum of the two rows must not be equal (a change has occured in source) - checksum is faster than checking every column individually

    AND BINARY_CHECKSUM

    (

    Destination.[UpdateOnlyCol1]

    ,Destination.[UpdateOnlyCol2]

    )

    != BINARY_CHECKSUM

    (

    [Source].[UpdateOnlyCol1]

    ,[Source].[UpdateOnlyCol2]

    )

    THEN UPDATE SET --JA: put only the columns you wish to update here - less = better performance

    Destination.[UpdateOnlyCol1]= [Source].[UpdateOnlyCol1]

    ,Destination.[UpdateOnlyCol2]= [Source].[UpdateOnlyCol2]

    WHEN NOT MATCHED BY TARGET THEN --JA: when no matching id can be found, insert

    INSERT

    (

    [HistoryCol1]

    ,[UpdateOnlyCol1]

    ,[HistoryCol2]

    ,[HistoryCol3]

    ,[UpdateOnlyCol2]

    ,[StaticCol1]

    )

    VALUES

    (

    [Source].[HistoryCol1]

    ,[Source].[UpdateOnlyCol1]

    ,[Source].[HistoryCol2]

    ,[Source].[HistoryCol3]

    ,[Source].[UpdateOnlyCol2]

    ,[Source].[StaticCol1]

    )

    ; --JA: MERGE must be terminated by a semicolon

    Also if you plan on cubing your warehouse I've been told by Will Brown, a Microsoft expert in SSAS, you get some performance benefit from snowflaking your warehouse because then the cubes you build from it don't have to process as many distincts during cube processing. I guess it depends on how much of your querying will be via your cube versus directly from your warehouse.

    -------------------------------------------------------------------------------------------------
    My SQL Server Blog

  • Thank you! Great article!

    I have question about SCD type 2.

    Can we compare all columns in ON clause like this:

    MERGE INTO dbo.Client_SCD2 AS DST

    USING dbo.Client AS SRC

    ON SRC.ID = DST.BusinessKey

    AND DST.ClientName=SRC.ClientName

    AND DST.Country = SRC.Country

    AND DST.Town=SRC.Town

    AND DST.Address1=SRC.Address1

    AND DST.Address2=SRC.Address2

    AND DST.ClientType=SRC.ClientType

    AND DST.ClientSize=SRC.ClientSize

    and then use

    WHEN NOT MATCHED BY SOURCE THEN

    UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday

    rather than having this part of code:

    WHEN MATCHED

    AND IsCurrent = 1

    AND (

    ISNULL(DST.ClientName,'') <> ISNULL(SRC.ClientName,'')

    OR ISNULL(DST.Country,'') <> ISNULL(SRC.Country,'')

    OR ISNULL(DST.Town,'') <> ISNULL(SRC.Town,'')

    OR ISNULL(DST.Address1,'') <> ISNULL(SRC.Address1,'')

    OR ISNULL(DST.Address2,'') <> ISNULL(SRC.Address2,'')

    OR ISNULL(DST.ClientType,'') <> ISNULL(SRC.ClientType,'')

    OR ISNULL(DST.ClientSize,'') <> ISNULL(SRC.ClientSize,'')

    )

  • Probably won't work because what if your key's don't match? I would imagine they won't all be the same key, so you'll then end the time period for every row in your dimension and make all records not active?

    First you have to have a match on a key. At that point you can check to see if the other fields do not match.

    -------------------------------------------------------------------------------------------------
    My SQL Server Blog

  • OK.Thank you. This is the whole code i wanted to use:

    MERGE INTO dbo.Client_SCD2 AS DST

    USING dbo.Client AS SRC

    ON SRC.ID = DST.BusinessKey

    AND DST.ClientName=SRC.ClientName

    AND DST.Country = SRC.Country

    AND DST.Town=SRC.Town

    AND DST.Address1=SRC.Address1

    AND DST.Address2=SRC.Address2

    AND DST.ClientType=SRC.ClientType

    AND DST.ClientSize=SRC.ClientSize

    WHEN NOT MATCHED BY TARGET THEN

    INSERT (BusinessKey, ClientName, Country, Town, County, Address1, Address2, ClientType, ClientSize, ValidFrom, IsCurrent)

    VALUES (SRC.ID, SRC.ClientName, SRC.Country, SRC.Town, SRC.County, Address1, Address2, ClientType, ClientSize, @Today, 1)

    WHEN NOT MATCHED BY SOURCE THEN

    UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday;

    Also i suppose it will have problems with NULL values. And this UPDATE will always go through all records in target table that were deactivated before.

  • I don't see an update statement in your MERGE.

    Not sure about how to handle NULLS. My whole warehouse is NOT NULL.

    -------------------------------------------------------------------------------------------------
    My SQL Server Blog

  • Great article, just what I needed. Thanks

    I made a slight modification in the CSD2 variant, just for security, to avoid ererroneous data, when the procedure is executed on an earlier day, than the last update.

    I admit, this does not happen when using GETDATE() to set today, but my procedure has a date input, so I decided to be sure no updates are made on records which have ValidFrom > @Today.

    So I added a statement to the WHEN MATCHED condition:

    WHEN MATCHED

    AND IsCurrent = 1

    AND ValidFrom <= @Today --Never allow updates to overwrite future records

    AND (ISNULL(DST.ClientName,'') <> ISNULL(SRC.ClientName,'')

    ...

    Thanks once more

    Herman

  • Thanks for that suggestion amenjonathan. What you mention is precisely what we are doing. We are bringing in a copy of the production source tables to a staging server and then using those copies to build our warehouse tables. For a number of reasons we can't use replication, log shipping, etc. mostly per our agreement with the software vendor. Many of these tables are very large so bringing over entire copies of them each day is not desired and would take too long. It is best to update/sync the existing tables with production. I should also mention that we use some of these copied tables to do live reporting and also audit/track changes to some of the more important info so they need to update throughout the day to stay in sync. Our production system stays up 24/7 we have to use means that won't cause any performance issues with it. Since most of the data does not change from day to day, that's where we thought the MERGE command would be very useful to update those table copies on the staging server. It seemed to be a perfect fit, but it just wouldn't perform well enough.

    As for the use of checksum, fortunately we have a last modified datetime field on every record so we really don't need to use that. Performance wise, the inserts and updates run very quick because they are easily identified. The slow part of syncing is making sure you delete the records in the copy that have also been deleted on production. In order to do that I have to pull down a list of all the primary keys from the production table and compare those to the copy to make they still exist. I have sync procedures setup for each table with parameters to control the types of updates allowed. We have a job that runs every 10 minutes to keep them all in sync for inserts and updates. Once every 3 hours it will also perform any deletes since that takes about 4 times as long to run as a sync without them. So far I've been amazed at how well it all works without impacting production at all. We ran traces for quite a while just to make sure. What I found out in doing this is just how much different a query will perform when run locally versus when run over a linked server. There can be huge differences even if the queries are simple.

  • Hi dwalker,

    This is extremely interesting, and thanks for sharing.

    I will try and find the time to look closer at this particular problem, as I had no idea that the speed difference was so great.

    Adam

  • Adam,

    Fantastic article. Your writing style is very concise and informative. I am forwarding this to all my teams, great job!

    Scott

  • Adam,

    Thank you very much for this article. I have been searching around for a more intelligent way to handle massive, bulk-loaded datasets of user information. We have customers that basically send us entire refreshes of their data on daily and weekly frequencies and other that only send us updates. I need to speed up this process tremendously and stop performing delete then insert operations on entire groups of users. Instead, I need to post updates only when data really changes from load to load.

    I was particularly thrilled with some of the speed enhancements suggested by others of using the CHECKSUM( ) over the range of data as opposed to string compares everywhere. I will be testing it out both ways and see if I can quantify the differences I see in performance. Some of our data loads are small 2,000 or less, but we have some problem-children in the 200,000+ range.

    Very good article, and I will make a point to search for more of these from you...your coding style makes it very easy to read and understand the T-SQL code.

    Larry

  • In trying to use this article to tailor a solution for my organization, I have encountered a bit of a snag.

    Our "user" information is kept in two tables which we call UserAuthentication and Users. The UserAuthentication table contains the IDENTITY column and it gets created first, its IDENTITY value is captured after the INSERT with a SCOPE_IDENTITY() and then this value is used when creating a new Users record using an INSERT statement.

    Assuming that I have all of the records to be merged in a staging table which contains fields destined for both the UserAuthentication and Users tables, how might I most efficiently utilize the MERGE statements to create records in both tables? Should I structure a stored procedure to handle this as MERGE against the UserAuthentication table first, then follow it with a MERGE against the Users table?

    Would another solution be to perform the MERGE against the UserAuthentication table, then capture the resulting INSERT records into a temp table that I can use as a basis for INSERTS on the Users table?

  • In the Type 4 example, I see that you have a "ValidFrom" column in Client_SCD4_History. But it never gets populated, since Client_SCD1 doesn't have a ValidFrom date.

    I see you are populating the ValidTo date in the type 4 history table, which is good.

    Presumably, if you're using Type 4, all records in the main table (Client_SCD1) are "currently" valid. It would be nice to see the code that populates the ValidFrom date in the Type 4 history table. But since you're doing Type 4 based on what the Type 1 table needs, plus the history table, you haven't kept the ValidFrom date from the last time the record was updated. The Type 1 table doesn't have it.

    Am I right that there's something missing here?

    David Walker

  • You are quite right, David, this should have been added to ensure completeness. Thanks for noticing this.

  • Hi everyone, and thank you Adam for this great article.

    I'm wondering if it is possible to have a mix approach in one table, something like a Type 1 and 2 SCD in the same table. Suppose that in the customer dimension you have a couple of columns that change very often and some others that rarely do. This is not a typical "slow changing" scenario but if for any reason you have to face this situation what would you do?

    I would like only to update the columns that frequently change (type 1) and record historical values for the other columns (type 2). I think this is possible when the SSIS Slow Change Dimension Transformation, but due to the performance issue is not possible for me to use it. In this transformation you can mark a single column as "Changing Attribute", "Historical Attribute" or "Fixed Attribute".

    Any ideas or clues about how to accomplish that with the MERGE statement. At the moment I've created to tables, one to store the columns that often change and one to be treated as a classic SCD.

    Any comment would be appreciated.

    Paul Hernández
  • Found this when looking for good examples. Great article! I noticed that one of the fields is left out of the WHEN MATCHED clause in the SCD2 version. "Country" is there, but not "County". If you modify any of the counties in the source table, it won't create a new record.

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply