September 22, 2015 at 11:29 pm
Hey guys,
I'm trying to improve the loading of some tables with large amounts of data that forms part of an ETL. I was going to try removing any indexes before the inserting to help speed up the process, but I had some questions on whether or not I should include the clustered index (assuming one exists).
I was originally planning on including a step to disable all indexes on the destination table using the following:
ALTER INDEX ALL ON MyTable DISABLE
Once the load had finished I'd simply rebuild all the indexes.
Does this seem like the right approach? Or should I simply disable the non-clustered indexes?
September 23, 2015 at 12:49 am
Hi,
A clustered index is a physically ordered table (heap) itself. If you disable it, it will become inaccessible.
Here is a pretty neat article:
http://blog.sqlauthority.com/2010/04/29/sql-server-disable-clustered-index-and-data-insert/
In short: do not disable the clustered indexes.
________________________________________________________________
"Accept conditions as they exist, or accept the responsibility for changing them."
September 23, 2015 at 12:55 am
Thanks for the reply.
I had seen that article but also read a few which suggested disabling alll the indexes.
I guess for now I'll just concentrate on disabling the NC indexes.
September 23, 2015 at 3:43 am
Chris-475469 (9/22/2015)
I'm trying to improve the loading of some tables with large amounts of data
Dunno if this is relevant, but can you pre-sort the imported data into Clustered Index Key Order, and then provide the appropriate hint to the import process that the data is pre-sorted?
I've only ever done this with BCP so don't know if it works with other types of bulk import.
I would definitely consider testing if dropping / re-creating the non clustered indexes helps.
September 23, 2015 at 3:54 am
Another option is to drop indexes instead of disabling. Drop the non-clustered index, then drop the clustered index. Do your load, then recreate the clustered indexes followed by the non-clustered indexes.
John
September 23, 2015 at 6:59 am
Also, remember that if the load process is doing UPDATE/DELETE in addition to INSERT, that the indexes, especially any primary keys, may be very necessary to assist in finding the data to be modified or removed.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
September 23, 2015 at 4:59 pm
Thanks for all the feedback.
The table I have the biggest issue with only has a clustered index and one NC index. There are around 13M rows that get INSERTED and this is taking around 78 minutes so I'll have to test removing all indexes first and re-applying afterwards.
I imagine there is also resource contention because of other large tables being loaded at the same time.
September 23, 2015 at 5:39 pm
Don't drop or disable the clustered index.
As someone already said, if you disable the clustered index, none of the data will be available.
If you drop the clustered index, the footprint of the table will double because the HEAP will be created before the CI is dropped. Then, when you do recreate the CI, same thing will happen. And BOTH evolutions take a pretty good bit of time.
Tell us more about the ETL process that does this load. What are you using? SSIS, Bulk Insert, BCP, OpenQuery, OpenRowset, some 3rd party tool or ????
--Jeff Moden
Change is inevitable... Change for the better is not.
September 23, 2015 at 5:47 pm
Thanks for pointing that out, Jeff.
The process is using SSIS but I believe it has been set up quickly just to get it working, not necessarily with performance in mind. I'm also slightly limited in how much I can do because there is currently no DEV environment.
At the moment there are several large tables that all get loaded at the same time so I have some suspicions that this could be a resourcing issue. The DFTs all use fast load and take a table lock but there is no value for 'Rows per batch' specified so I might have to do some testing around this where I can.
September 23, 2015 at 7:46 pm
Chris-475469 (9/23/2015)
Thanks for pointing that out, Jeff.The process is using SSIS but I believe it has been set up quickly just to get it working, not necessarily with performance in mind. I'm also slightly limited in how much I can do because there is currently no DEV environment.
At the moment there are several large tables that all get loaded at the same time so I have some suspicions that this could be a resourcing issue. The DFTs all use fast load and take a table lock but there is no value for 'Rows per batch' specified so I might have to do some testing around this where I can.
What is the source of the data? CSV, TSV, Fixed Field Format, {ugh!} XML, {quadruple UGH!}EDI, other tables, or ???
--Jeff Moden
Change is inevitable... Change for the better is not.
September 23, 2015 at 7:50 pm
Sorry I should have specified that.
It's an ADO.NET connection to an Ingress database and the query just grabs data from one table according to a date condition that is indexed.
September 25, 2015 at 12:09 am
So I've been experimenting with this ETL and I'm struggling to get much of a performance gain. I'm currently testing one of the tables that has a little over 2M rows and it's taking around 40 minutes to complete. This is a simple DFT that goes from Ingres to SQL.
I've tried playing around with the values for DefaultBufferMaxRows but I'm still getting a total time of around 40 minutes. When I trace what is happening I also can't get a bigger batch size of a around 4k rows.
I tried logging the process using the BufferSizeTuning event in SSIS but the output isn't very clear:
User:BufferSizeTuning,SERVER,USER,Data Flow Task,{7A3C49D5-00B8-485A-9186-E180FA5DB913},{6F9B3AB8-BC33-4B3A-A402-634564373AD2},25/09/2015 4:00:03 PM,25/09/2015 4:00:03 PM,0,0x,Rows in buffer type 0 would cause a buffer size greater than the configured maximum. There will be only 4354 rows in buffers of this type.
PackageEnd,SERVER,USER,Package,{C4717818-F485-438A-98BD-D7ED44DE557B},{6F9B3AB8-BC33-4B3A-A402-634564373AD2},25/09/2015 4:00:16 PM,25/09/2015 4:00:16 PM,0,0x,End of package execution.
Just wondering if anyone has any ideas on this or am I just getting the maximum performance?
September 28, 2015 at 11:09 pm
I thought I'd provide a brief update to this issue.
This appears to have been some kind of issue using an ADO.NET data source in my SSIS package. After changing this to an ODBC source the performance improved dramatically.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply