Loading 7 millions of data into minutes..how is it possible?

Question

Post reply

Loading 7 millions of data into minutes..how is it possible?

Abhijit More

SSCertifiable

Points: 6897
More actions
April 22, 2014 at 12:54 am

#396639

Here I have one challenge to load 7 millions of data from temporary table to Fact table followed with dimensional table load.
What I have :- Table with 50 columns having 7 millions of data
What I want to achieve :- 1. Load 15 dimensional tables from above data.
2. Load 1 Fact Table from from above data. While loading fact table reference dimensional tables for Keys (IDs from dimensional table)
Currently approach :-
1. Truncate dimensional table before each load. (this is required as we may received refreshed data on each load)
2. Load dimension table
3. Load Fact table using left join with 15 dimension table :crazy:.
4. I em leveraging columnstore index as well 😀
Now with increasing amount of data we have started facing performance and time issues. this process is taking more than 4 hrs to load data :w00t:.
Need quick solution (PM is banging on my head :alien:)
Abhijit - http://abhijitmore.wordpress.com

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 247203 More actions · Answer 1

Phil Parkin

SSC Guru

Points: 247203

April 22, 2014 at 1:22 am

#1707627

So do you truncate your fact table every load too?

Abhijit More SSCertifiable Points: 6897 More actions · Answer 2

Yes

I know the best practices say about delta load but we have business requirement to load each time we receive bulk data.

Abhijit - http://abhijitmore.wordpress.com

Phil Parkin SSC Guru Points: 247203 More actions · Answer 3

Are you able to break down the load times for the different components?

I'm guessing that it's the fact table reload that takes most time. Have you tried using (cached) lookups rather than left joins?

Umer Akram SSC-Insane Points: 22209 More actions · Answer 4

are you using SSIS or T-SQL to load the dimension/Fact table?

and did you tried to disable your index before loading data ?

Abhijit More SSCertifiable Points: 6897 More actions · Answer 5

We tried both SSIS and T-SQL, both options are taking time. With T-SQL yes dropping index before load and creating after load.

As you correctly mentioned the time take is while loading FACT table.

As per my understanding the problem area here is loading dimension and referencing them again while loading FACT table on each load.

Abhijit - http://abhijitmore.wordpress.com

Koen Verbeeck SSC Guru Points: 259197 More actions · Answer 6

The columnstore index only slows down your load (but it improves query performance afterwards of course).

Loading the dimensions should be quickly I guess (you haven't given us much information). Make sure the buffers are wide when you have a lot of columns (DefaultBufferSize property).

When loading the fact table, use lookups to find the dimension keys. When you do everything in the SSIS pipeline, you can stream the data which is much faster. Use SQL queries in the lookup components to fetch the data and use only the columns you need.

Make sure you are using the fast load option in the OLE DB destination.

With a normal server this should be possible under 30 minutes.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Umer Akram SSC-Insane Points: 22209 More actions · Answer 7

When you fill the FACT Table, how you are making Source,

Like a T_SQL staging table joining with dimensions to get the dimension id.

or u r using Lookups? because there might be a case that, you are having problem when you fetch the data from the source.

hopefully you might have implemented that but just for information you can also use Fast Load option. if you are using it then you can ignore this one :-).

stevefromOZ SSC-Forever Points: 43646 More actions · Answer 8

Sounds like you're (in general) doing the 'right' thing and using surrogate keys for dimension rows and then using lookups/joins to populate the appropriate FK (surrogate) to the fact.

but - why?

"Normally" i would be the first to tell you to use that approach, but that is when we keep the dimension intact and either insert or update the values within it. If you're truncating the dimension and fact *every* time, then why bother to use surrogate keys? Simply load your dimensions using the natural keys, then load the fact using the natural keys. If you wanted to, you could set a default on the dimension key columns in the fact to be like '-1' so that if you have nulls, you can at least identify that this was a missed value. You'd then want to add a single row to each dimension to hold that 'unknown' member (with a key of -1). Having this default would mean you don't need to left join (in fact, don't join at all) to the dimension table/s when loading the fact. Obviously also negates the need for cached lookups in SSIS too.

FWIW, tearing down the dimensions and fact every time seems like overkill to me, but, you're there an dknow the requirements, so will assume that the solution fits the requirement.

Steve.

onejbsmith Valued Member Points: 62 More actions · Answer 9

Are you loading the fact table from parallel data flows and running them concurrently?

MMartin1 One Orange Chip Points: 27867 More actions · Answer 10

Were you disabling your indexes when employing SSIS? Or only when using t-sql?

----------------------------------------------------