practical advice arounding indexing and ETL?

Question

practical advice arounding indexing and ETL?

KoldCoffee

SSCoach

Points: 18971
More actions
February 4, 2015 at 4:47 am

#294497

Must do a first time import millions of rows into a table followed by a daily import of much less (just the differentials since day before) after that. Am already using a stage table so that no datatype conversions, derivations, or runtime logic (except for merge/update statement) is needed when populating final table.
**problem is indexing**. I need indexes on the table because it will be heavily queried. Those same indexes make loading the table very slow (hours). If I remove the indexes during loading of table then I must add afterward. Adding clustered and non clustered index to a table having millions of rows also takes a very long time (hours).
So, when is best time to index a table that is heavily queried and that is updated nightly with new data without adding hours to the process?

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Nevyn SSChampion Points: 14210 More actions · Answer 1

KoldCoffee (2/4/2015)
Must do a first time import millions of rows into a table followed by a daily import of much less (just the differentials since day before) after that. Am already using a stage table so that no datatype conversions, derivations, or runtime logic (except for merge/update statement) is needed when populating final table.
**problem is indexing**. I need indexes on the table because it will be heavily queried. Those same indexes make loading the table very slow (hours). If I remove the indexes during loading of table then I must add afterward. Adding clustered and non clustered index to a table having millions of rows also takes a very long time (hours).
So, when is best time to index a table that is heavily queried and that is updated nightly with new data without adding hours to the process?

I would start by asking how you know it is the indexes that make loading the table very slow?

Also, are you saying the first time import is slow, or the differential, or both?

KoldCoffee SSCoach Points: 18971 More actions · Answer 2

Hi Nevyn, In the early design phase of etl solution and I simply cut hours off the etl when I removed indexes. This allowed me to take a look at the results much faster. Now that I am querying I am having to readd the indexes (clustered and non clustered) that I removed earlier and it is taking hours as well, because the tables are full.

I have compared with and without but I haven't quantified exactly. Surely, someone else knows that indexes on tables causes writes to slow down. I just want to know strategies others have used to minimize load slowness during ETL and maximize query speeds after ETL.

Martin Schoombee SSCoach Points: 19128 More actions · Answer 3

KoldCoffee (2/4/2015)
Must do a first time import millions of rows into a table followed by a daily import of much less (just the differentials since day before) after that. Am already using a stage table so that no datatype conversions, derivations, or runtime logic (except for merge/update statement) is needed when populating final table.
**problem is indexing**. I need indexes on the table because it will be heavily queried. Those same indexes make loading the table very slow (hours). If I remove the indexes during loading of table then I must add afterward. Adding clustered and non clustered index to a table having millions of rows also takes a very long time (hours).
So, when is best time to index a table that is heavily queried and that is updated nightly with new data without adding hours to the process?

Could you please clarify whether these indexes are on the staging table or destination table?

Alan Burstein SSC Guru Points: 61141 More actions · Answer 4

I just left a longer reply and got an error when I attempted to post it :angry:

Anyhow, take a look at

SQL University: Advanced Indexing – Indexing Strategies[/url]

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Nevyn SSChampion Points: 14210 More actions · Answer 5

KoldCoffee (2/4/2015)
Hi Nevyn, In the early design phase of etl solution and I simply cut hours off the etl when I removed indexes. This allowed me to take a look at the results much faster. Now that I am querying I am having to readd the indexes (clustered and non clustered) that I removed earlier and it is taking hours as well, because the tables are full.
I have compared with and without but I haven't quantified exactly. Surely, someone else knows that indexes on tables causes writes to slow down. I just want to know strategies others have used to minimize load slowness during ETL and maximize query speeds after ETL.

Yes, indexes do slow inserts.

I wouldn't necessarily expect them to slow inserts to the points where inserts of millions of rows took hours, though.

Yes, I have seen recommendations of dropping indexes before a table load and recreating after, but I think generally just when the whole table is being reloaded, not on an incremental load.

And you clearly have an issue regardless. We can't magically make index building faster for you. So if your upsert into indexed tables takes too long, and rebuilding indexes takes too long, you need to find another option. So good to challenge your assumptions.

In this particular case I am wondering about your design pattern. By having the upsert done in one big merge from staging, all the changes are in one transaction. And that transaction would have to manage all the index changes too. I could certainly see that causing a performance issue.

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 6

What is the basis of the daily imports? Are they temporal in nature in that if the larger table had an index based on a particular datetime column, that it wouldn't become fragmented due to the import or are you doing a full "upsert/merge" where data could be updated rather than just added to the large table?

In other words, is the existing data in the large table ever changed or not? If so, how far back do the changes go?

I'm asking because I may have a solution for this.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

KoldCoffee SSCoach Points: 18971 More actions · Answer 7

Hi Nevyn, yes I'm doing one big merge statement from stage to final table. So, you are saying there is better approach. Will you please say more about rhe merge statement design element? Prior to the merge I'm loading flat files to staging tables which do not use indexes and only enforce data types.

Martin Schoombee SSCoach Points: 19128 More actions · Answer 8

KoldCoffee (2/4/2015)
Hi Nevyn, yes I'm doing one big merge statement from stage to final table. So, you are saying there is better approach. Will you please say more about rhe merge statement design element? Prior to the merge I'm loading flat files to staging tables which do not use indexes and only enforce data types.

Based on your response above: I think if you are loading the entire history as a once off, removing the indexes (or disabling them) is recommended.

For daily loads thereafter, unless you are inserting/updating a large amount of data, there should be no need to remove the indexes. If you are inserting/updating a large amount of records on a daily basis, you should maybe be looking at partitioning your destination table and switching in the partitions if that's possible.

Nevyn SSChampion Points: 14210 More actions · Answer 9

KoldCoffee (2/4/2015)
Hi Nevyn, yes I'm doing one big merge statement from stage to final table. So, you are saying there is better approach.

Well, I am saying I suspect there might be, but I really know very little of your environment, the data you are loading, etc, so it is tough to say for sure.

Will you please say more about rhe merge statement design element? Prior to the merge I'm loading flat files to staging tables which do not use indexes and only enforce data types.

Its not so much the merge itself as doing everything in one transaction. And to be clear, I don't KNOW that that is the problem, it is just something I'd want to investigate.

Partitioning (as hinted at by Jeff and suggested by Martin) could also help dramatically if it is an option (although from your description it doesn't sound like it).

Can you describe the problem a bit more? Obviously the merge is there for your daily incremental load. Some pertinent questions:

How many rows are typically inserted (new) in a day?

How many rows are typically updated?

Are any deleted?

How big is the table after the initial load?

How many indexes are we talking about?

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 10

Nevyn (2/4/2015)
KoldCoffee (2/4/2015)
Hi Nevyn, yes I'm doing one big merge statement from stage to final table. So, you are saying there is better approach.
Well, I am saying I suspect there might be, but I really know very little of your environment, the data you are loading, etc, so it is tough to say for sure.
Will you please say more about rhe merge statement design element? Prior to the merge I'm loading flat files to staging tables which do not use indexes and only enforce data types.
Its not so much the merge itself as doing everything in one transaction. And to be clear, I don't KNOW that that is the problem, it is just something I'd want to investigate.
Partitioning (as hinted at by Jeff and suggested by Martin) could also help dramatically if it is an option (although from your description it doesn't sound like it).
Can you describe the problem a bit more? Obviously the merge is there for your daily incremental load. Some pertinent questions:
How many rows are typically inserted (new) in a day?
How many rows are typically updated?
Are any deleted?
How big is the table after the initial load?
How many indexes are we talking about?

And, temporarily speaking, how far back in time are rows updated during the "merge"?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Martin Schoombee SSCoach Points: 19128 More actions · Answer 11

I'll add one more question, if I may 🙂

Are you using the merge component in SSIS (I hope not), or a merge statement in T-SQL?

KoldCoffee SSCoach Points: 18971 More actions · Answer 12

OK, hi nevyn, jeff, martin.

Answers:

There are 14 files daily which translates to 14 tables. I will give numbers for the three biggest tables (settings_dimension/keyword/group).

How many rows are typically inserted (new) in a day? 52 million, 12 million, 3.5 million

How many rows are typically updated? 1 million, 300 thousand, 55 thousand

Are any deleted? no

How big is the table after the initial load? 52 million, 12 million, 3.5 million

How many indexes are we talking about? 30, with the one most heavily indexed being the biggest table with 5 indexes, the second biggest with 4, and the third with 2.

how far back in time are rows updated during the "merge"? if a campaign is active since beginning of relationship with vendor, than any of it's attributes can potentially require updating.

other information:

For first time load, the single largest table (upsert/merge from stage to final) takes one hour to load and one hour to add the 5 indexes after it's been loaded with data.

Differentials a whole different story: the whole ETL from flat file to stage (SSIS) to final tables (upsert/merge) goes a lot faster - about an hour...

Martin Schoombee SSCoach Points: 19128 More actions · Answer 13

KoldCoffee (2/4/2015)
OK, hi nevyn, jeff, martin.
Answers:
There are 14 files daily which translates to 14 tables. I will give numbers for the three biggest tables (settings_dimension/keyword/group).
How many rows are typically inserted (new) in a day? 52 million, 12 million, 3.5 million
How many rows are typically updated? 1 million, 300 thousand, 55 thousand
Are any deleted? no
How big is the table after the initial load? 52 million, 12 million, 3.5 million
How many indexes are we talking about? 30, with the one most heavily indexed being the biggest table with 5 indexes, the second biggest with 4, and the third with 2.
how far back in time are rows updated during the "merge"? if a campaign is active since beginning of relationship with vendor, than any of it's attributes can potentially require updating.
other information:
For first time load, the single largest table (upsert/merge from stage to final) takes one hour to load and one hour to add the 5 indexes after it's been loaded with data.
Differentials a whole different story: the whole ETL from flat file to stage (SSIS) to final tables (upsert/merge) goes a lot faster - about an hour...

An hour to load 52 million records (depending on how much transformation is taking place), is not that bad. The only way I can think of speeding that up is to separate the data into multiple parallel streams, processing and loading them at the same time.

Nevyn SSChampion Points: 14210 More actions · Answer 14

KoldCoffee (2/4/2015)
OK, hi nevyn, jeff, martin.
Differentials a whole different story: the whole ETL from flat file to stage (SSIS) to final tables (upsert/merge) goes a lot faster - about an hour...

Is that with the indexes or without them?