Datawarehouse Loading

  • I am trying to load data from and OLTP into a datawarehouse using SQL SERVER 2005 Integration Services. I have no staging tables in my datawarehouse and I am trying to populate the surrogate keys in my fact tables.

    Can anyone help!

  • I seem to able to load the data but i get a primary key violation error in the destination database. I used a lookup transformation to lookup the dimension tables and insert the surrogate keys!

    Your help will be much appreciated, thats ANYONE????

  • You have not posted much information, but this should be pretty straight-forward.

    Select your source data using a DataReader or OLEDB Source, look up the surrogate key from your dimension table, and insert the record into your fact table using the ID from the dimension table. By default the lookup component will fail if a matching record is not found.

    If you are getting a primary key violation while inserting into your fact table, it is either a duplicate or a NULL value being put into the primary key field(s) in the fact table.

  • Thanks for the quick reply!

    Actually I have progressed slightly from before!

    This is the process I am following, where I am selecting data from a source and adding a number of lookup transformation which check for the surrogate keys in my destination DW and then the fact table is filled with the surrogate keys! however, for one of the lookups I am getting an error: Row yielded no match during lookup. Funnily, the same lookup transformation query is used for two other lookup transformations, however, I am getting an error with this one!

  • I know you said you had no staging tables, but depending on what you're doing, I suggest you reconsider this idea. SSIS is great (I use it heavily), but for large data imputs like fact tables, there's some special considerations:

    - You obviously need to ensure you have no duplicate keys. this is easy to do via a GROUP BY in SQL. Load the data into a table with no primary key.

    - Have you considered any additional audit checks on the data before it goes into the warehouse? Things like number of expected records?. That's tough to do, if you're stuffing it directly in. (Range checks and field validation can easily be done in SSIS, as the records feed through.)

    - You may have to handle missing dimension keys, in particular if you're creating surrogate keys. Much easier if you stage the data first. Plus you may not want any of the input to go in, unless it's all clean.

    - From an audit perspective, it's helpful to have the "last file loaded" sitting in a staging table. In my case, I clear the staging table at the begining of the load (not the end). Of course, if your load cycle is small (i.e. daily or hourly), the potential benefits diminish.

    In a data warehouse, I suggest you should use slightly different strategies for your data tables vs your dimension tables. They aren't exactly the same thing.

  • Your comments are highly valued! I have actually done the following:-

    (1) I am extracting data from an accounting system using SSIS and loading it into a SQL Server 2005 mini-datawarehouse

    (2) I have used a DFT which includes a source (the accounting system), a couple of lookups to load surrogate keys into the fact tables and a conditional split transformation as well as the final destination.

    This mini datawarehouse is actually meant to be the staging area before the data is loaded into the mega-dw which is used for corporate reporting. I know this approach is not conventional, however, since the mega-dw is basically a dashboard (KPI dashboar) which means it is very summarised, I thought I would create a mini staging aka dw for internal reporting and then load from this using simple queries into the final mega-dw!

    Since my previous posts I have progressed pretty nicely.

    My problem currently is with loading account information (accountid, parentid etc) into an account dimension where I can't just append new accounts to the dimension, it always loads all the accounts!

    I am not completely happy though with SSIS especially with the way sources and destinations are soooo rigid, for example, I am loading accounting transactions into 8 different fact tables based on account classification (i am using different tables for performance reasons) and as a result I have to define 8 different destinations! Any change I make in the original query collecting from source, forces me to change every single destination and this is just time consuming!

  • Persist, it gets better.

    To load dimension information, I recommend you use the slowly changing dimension transform. I use it for all dimension tables, and it works find. Caution: although this transform allows you to create customizations, don't. Any regeneration of the transform (i.e. due to changes) wipes your custom changes. Better to do the custom work in other transforms before or after. Microsoft needs to take another look at this wizard.

    With regard to your 8 data tables and 8 transforms (I'm assuming they are exactly the same), two comments:

    - design a package that only does this step. Abstract the table name and source (and anything else you need) out to package variables. Then call this package from your master package, changing the variables. Packages can (and often should be) much smaller objects than you might otherwise expect. (Took me while for this to click in. It's normal.)

    - If the 8 tables are some sort of attempt to improve SQL Server performance (as opposed to a data design), then consider instead using MSAS. As well (or instead), consider partitioning the table along the 8 groups (you may need to add a partition column). You can still cluster the records based on your business key, and I think you'll be surprised at how well this works.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply