Loading from incrementals and dealing with the discrepancies of projection vs join

  • Given:

    1) A source system composed of two tables, T1 and T2, where T2 foreign keys into T1

    2) A data warehouse which includes a denormalized star-schema-based dimension, T, containing the join of T1 and T2

    3) An incremental loading strategy, which, each day, provides delta(T1) and delta(T2) to the staging system (let's call the staging tables T1s and T2s)

    Take a situation where T1s has rows, but T2s is empty. (IE, back in the OLTP system, new "children" have been added to an existing T1 "parent".)

    One cannot insert into the dimension the result of full or outer joining, T1s and T2s, since either way values required for the denormalized dimension tuple will be absent.

    One possible solution would be to include the existing data warehouse data in the query as the "fallback value". Something along the lines of:

    insert T (...)

    select coalesce(T1s.column1, T.column1), coalesce(T1s.column2, T.column2), coalesce(T2s.column1, T.column3) -- ... etc

    from T1s

    full join T2s on (...)

    left join T on (...)

    However it's not the only solution. One could also use a combination of insert, update and union, or prepare the dimension in an intermediary using historical data from the staging, or from the source, or any number of other possible patterns.

    Which way have people generally found to be the most reliable/highest performing/easiest to mantain? (Pick any combination 😛 )

  • First update your normalized, history-preserving view of the data in the Warehouse with the data that changed. Then populate the denormalized star-schema Mart from that. The Mart may even be simply a layer of views onto the Warehouse.

    Avoid creating denormalized data direct from sources. Doing so invariably leads to loss of integrity and granularity and makes the model much harder to maintain.

  • Yep, that is what is being done.

    Source (T1, T2) > staging (still in shape of source, ie, T1s, T2s) > transformation > denormalized warehouse. It's at the transformation > warehouse step that this question is directed (still applies even if the warehouse is just a logical layer of views, since the same kind of query is needed)

  • Maybe I wasn't very clear. I meant that if you update the normalized warehouse first then populating a mart is as simple as an inner join between those two tables in the warehouse. It doesn't matter which table gets updated last because the warehouse should preserve every change that's of interest. When I said "history-preserving view" I simply meant that the tables in the warehouse are the Subject Oriented, Integrated, Time-variant view of all the data.

  • Sorry, I'm not following. From the way you've described it, I'm thinking that either:

    1) a step is missing, or

    2) the data mart is a complete truncate/reload, not incremental.

    Why I think this:

    If:

    - the denormalized dimension, dim_T, is logically a join of source tables T1 and T2, and

    - source data is moved into staging as daily incrementals, in the form delta_T1 and delta_T2

    then:

    - one cannot join, left join or full join delta_T1 and delta_T2 in order to generate the set required to incrementally update dim_T

    Unless:

    - delta_T1 and delta_T2 are used to update normalized tables warehouse_T1 and warehouse_T2, and then dim_T is truncated and reloaded from the join of warehouse_T1 and warehouse_T2. Is that what you are suggesting?

    Unfortunately that is not always possible. For instance, one of our BI vendors provides an API that imports incremental change files into the star schema. The incremental change files are structured per the denormalized star schema specific to the vendor, not per our source systems (which, of course, the vendor has never seen before).

    So how would you generate the necessary incremental file? This brings us back to the problem of not being able to join, left join or full join delta_T1 and delta_T2.

  • This is the sequence I would expect:

    1. Deltas come into some staging table

    2. Surrogate keys are assigned in staging table

    3. If staging table has a foreign key then that is replaced by the warehouse surrogate key - either by an INNER join to that table in the warehouse or by an INNER join to an intermediate staging table. Assuming the foreign key is enforced at source then this will always be an INNER join to the latest version of the parent row because the parent must have been created at some time in the past even if the referencing (child) rows were modified later.

    4. Table is loaded to warehouse (UPDATE and INSERT or MERGE)

    5. The mart is then the INNER join of the warehouse tables (or OUTER JOIN only if you want to include parent rows for which there are no children). This is where dim_T is created. Mart tables can be populated incrementally by UPDATE/INSERT/MERGE or may exist just as views.

    - one cannot join, left join or full join delta_T1 and delta_T2 in order to generate the set required to incrementally update dim_T

    You don't need to. Foreign keys should get converted to history-preserving surrogate keys as part of the regular staging process by either looking up the parent row in the warehouse or looking it up in some intermediate table from previous loads (the intermediate table can be efficient because it has fewer columns and exists in staging rather than warehouse but it isn't essential).

    - delta_T1 and delta_T2 are used to update normalized tables warehouse_T1 and warehouse_T2, and then dim_T is truncated and reloaded from the join of warehouse_T1 and warehouse_T2. Is that what you are suggesting?

    I wouldn't suggest recreating the entire mart. You can identify what changed just based on the change dates, which should exist in every table, not just the denormalized ones. Maybe you could post some DDL and sample data to illustrate what you mean.

  • Ah, OK I see where you are coming from.

    The problem that I was getting at is not so much the lack of integrity in the deltas so much as the lack of attribute values.

    IE, if in the souce T1 is (a, b) and T2 is (a,c,d), with T2(a) foreign keying into T1(a), then dim_T needs the attributes (a,b,c,d). However, if the delta only contains changes from, say, T2, then the best you can get from deltas alone is (a, null, c, d).

    Replacing the null with the "current" value can't be done from the deltas alone. One could solve this with a construct like the one in my OP:

    select coalesce(delta_T1.a, current_T.a), coalesce(delta_T1.b, current_T.b) -- etc

    from delta_T1

    full join delta_T2 on delta_T1.a = delta_T2.a

    left join (select current row versions from dim_T | current row versions from history-preserving staging) current_T on coalesce(delta_T1.a, delta_T2.a) = current_T.a

    (Left join required using this construct because there may be new rows in T1 that do not yet have representation in current_T)

    This problem is solved by point (3) in your reply ("INNER join to the latest version of the parent row"), since getting the warehouse keys of course also means the other attribute values are available and can be coalesced in as above. So my "standard solution" is pretty much the one you are expecting.

    This is what I was getting at: There are other constructs that can solve this problem, I was wondering if anyone else had a different approach.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply