Loading 7 millions of data into minutes..how is it possible?

  • Here I have one challenge to load 7 millions of data from temporary table to Fact table followed with dimensional table load.

    What I have :- Table with 50 columns having 7 millions of data

    What I want to achieve :- 1. Load 15 dimensional tables from above data.

    2. Load 1 Fact Table from from above data. While loading fact table reference dimensional tables for Keys (IDs from dimensional table)

    Currently approach :-

    1. Truncate dimensional table before each load. (this is required as we may received refreshed data on each load)

    2. Load dimension table

    3. Load Fact table using left join with 15 dimension table :crazy:.

    4. I em leveraging columnstore index as well 😀

    Now with increasing amount of data we have started facing performance and time issues. this process is taking more than 4 hrs to load data :w00t:.

    Need quick solution (PM is banging on my head :alien:)

    Abhijit - http://abhijitmore.wordpress.com

  • So do you truncate your fact table every load too?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Yes

    I know the best practices say about delta load but we have business requirement to load each time we receive bulk data.

    Abhijit - http://abhijitmore.wordpress.com

  • Are you able to break down the load times for the different components?

    I'm guessing that it's the fact table reload that takes most time. Have you tried using (cached) lookups rather than left joins?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • are you using SSIS or T-SQL to load the dimension/Fact table?

    and did you tried to disable your index before loading data ?

  • We tried both SSIS and T-SQL, both options are taking time. With T-SQL yes dropping index before load and creating after load.

    As you correctly mentioned the time take is while loading FACT table.

    As per my understanding the problem area here is loading dimension and referencing them again while loading FACT table on each load.

    Abhijit - http://abhijitmore.wordpress.com

  • The columnstore index only slows down your load (but it improves query performance afterwards of course).

    Loading the dimensions should be quickly I guess (you haven't given us much information). Make sure the buffers are wide when you have a lot of columns (DefaultBufferSize property).

    When loading the fact table, use lookups to find the dimension keys. When you do everything in the SSIS pipeline, you can stream the data which is much faster. Use SQL queries in the lookup components to fetch the data and use only the columns you need.

    Make sure you are using the fast load option in the OLE DB destination.

    With a normal server this should be possible under 30 minutes.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • When you fill the FACT Table, how you are making Source,

    Like a T_SQL staging table joining with dimensions to get the dimension id.

    or u r using Lookups? because there might be a case that, you are having problem when you fetch the data from the source.

    hopefully you might have implemented that but just for information you can also use Fast Load option. if you are using it then you can ignore this one :-).

  • Sounds like you're (in general) doing the 'right' thing and using surrogate keys for dimension rows and then using lookups/joins to populate the appropriate FK (surrogate) to the fact.

    but - why?

    "Normally" i would be the first to tell you to use that approach, but that is when we keep the dimension intact and either insert or update the values within it. If you're truncating the dimension and fact *every* time, then why bother to use surrogate keys? Simply load your dimensions using the natural keys, then load the fact using the natural keys. If you wanted to, you could set a default on the dimension key columns in the fact to be like '-1' so that if you have nulls, you can at least identify that this was a missed value. You'd then want to add a single row to each dimension to hold that 'unknown' member (with a key of -1). Having this default would mean you don't need to left join (in fact, don't join at all) to the dimension table/s when loading the fact. Obviously also negates the need for cached lookups in SSIS too.

    FWIW, tearing down the dimensions and fact every time seems like overkill to me, but, you're there an dknow the requirements, so will assume that the solution fits the requirement.

    Steve.

  • Are you loading the fact table from parallel data flows and running them concurrently?

  • Were you disabling your indexes when employing SSIS? Or only when using t-sql?

    ----------------------------------------------------

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply