December 18, 2006 at 7:58 am
Hi,
I was wondering which is more efficient, creating a table with a clustered index and then populating it, or creating a table, populating it and then adding a clustered index afterwards.
My application accepts multiple text input files from our customer, each containing rows with an account number. These get loaded to simple non-indexed tables for basic reporting and then are DTSed to the production tables when they pass basic referential integrity checks. Some files are primary (in other words they should contain data with a unique account number) and others are supporting (they contain data that is supplimentary to the primary files) and therefore should have an account number that exists in one of the primary files.
The question I need to answer in my application is, do the supporting files have a valid account number? Before I accept this data into production I want to check to see if the supporting files conatin valid account numbers across all the files. To do this I figured I'd create a scratch table with a single column containing the account numbers from all primary files and compare the account numbers in the supporrting files to that table.
There may be over 1 million account numbers.
Which do you all think would have the best performance, creating the scratch table with a clustered index on the account number and then loading the data, or creating the table, loading the data to it and then adding the clustered index?
Chuck
December 18, 2006 at 8:08 am
Hi Again,
I forgot to mention that the account numbers in the many primary files may be all over the range of account numbers . In other words, input file #1 may have accounts 10,20,30,40, etc. and file number 2 may have account numbers 11,21,31,41, etc.
It's this mix of account numbers in the various files that makes me wquestion the best way to build the table, and load data.
Thanks!
Chuck
December 18, 2006 at 8:49 am
In my experience it is better to create the index after loading the data for large data loads. However 1 million records in a single columns tables isn't too bad. In my previous company I had too load tables of 80 million rows plus from time to time into the DW. If I did not remove the indexes beforehand the tables would never finish loading. However it was possible to create the index afterwards.
As usual with these things you never know until you give it a try. It shouldn't take long to try both scenario's.
Regards
Daniel
December 18, 2006 at 10:21 am
"As usual with these things you never know until you give it a try"
Actually, it is known that you better create the indexes after. During INSERT, DELETE, UPDATE operations, indexes are also updated. This is why it is usually suggested to avoid adding indexes to tables which are highly dynamic.
As mentioned above, the difference might not be noticeable in small tables, but will show up in huge tables. In conclusion: Always add indexes after data is loaded!
December 18, 2006 at 10:26 am
Thanks for the info. While it seemed like adding the index after was a better alternative, I wasn't sure.
A little prototyping and benchmarking is always a good solution!
Chuck
December 20, 2006 at 1:46 pm
Chuck,
Because you specifically asked whether it was more efficient to load the table before or after the clustered index was created, I have to take exception with some of the other recommendations.
Let's assume first that we are talking about only the clustered index, of which there can be only one per table. Now, understand that the leaf level of the clustered index consists of the data rows. So if you load the 1 million rows then create the clustered index, you will be "recreating" the data rows. Books online states that you need available space in your database equal to 1.2 times the total size of the table to accomodate "copying" the data pages to the clustered index leaf level as well as to build the index's pages. So the net time it takes to load the unindexed table then create the clustered index will without a doubt be greater than loading an empty table with the clustered index already created.
My suggestion then, if considering just a clustered index, is to create the table with the clustered index. Then to minimize fragmentation, have the data to be inserted into the table sorted in the order of the clustered index column(s).
In the case of loading a table with multiple indexes, however, I agree that it is better to not have any indexes other than the clustered index on the table while loading a substantial amount of data. Our recommended scenario in this case is:
1. Create the table with its clustered index.
2. Load the data
3. Create any additional nonclustered indexes.
Hope this helps!
Regards,
Ronna
Ronna Williams
December 20, 2006 at 2:10 pm
Hi Ronna,
Thanks for the indepth description of the issue. What you say makes sense.
Note that I'm only creating a clustered index, and I'm doing this on this "scratch" table to order the data for a more efficient means to validate account numbers in other tables.
When you say order the data prior to loading, I don't think that's going to be possible for me, because the data to be loaded is contained in multiple text files and I'm using DTS to load them to the new scratch table. So, I'd need to load to still another table prior to loading to the scratch table.
The problem I'm facing is that I have a variable number of text files, each containing account numbers that may or may not be sorted. Further, the account numbers in any of the text files may be all across the possible range of account numbers - file 1 may have account 1,3,5, ... and file 2 may have 2,4,6, etc., so I cannot even guarantee that the account numbers across all files are in any kind of order.
All that said, would it be more efficient to use a non-clustered index in this case? I'd be interested in your thoughts.
Thanks again,
Chuck
December 22, 2006 at 5:37 am
"Note that I'm only creating a clustered index, and I'm doing this on this "scratch" table to order the data for a more efficient means to validate account numbers in other tables."
Depends on what the validation looks like, but IMHO clustered index is not necessary for that. Clustered index helps especially if you work with ranges (e.g. date_created BETWEEN .... , or with conditions using < or >.
Also I understand that this table is only a staging table which will be dropped when the process is finished, so I would suggest to try it with nonclustered index (created after the table is filled with data). You'll see whether it works well or not... I'd say it should work.
December 22, 2006 at 8:07 am
You say you're sorting the data, in that case a clustered index will achieve that, using a secondary index will require you to use an order by statement which may or may not add to the work done when extracting the data.
I figure you'll have to test it, and don't foget to test scale up, and then make a decision. Depending on what your times are there may be other steps you can take, time and speed are relative, if you have an eight hour window and the process takes an hour - no real problem, if you have an eight hour window and the process currently takes seven hours then you may have a problem.
[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/
December 26, 2006 at 9:16 am
I agree with Vladan.
If this is a staging table to be refreshed then used for validation, it appears that loading the text files into a table with no index, then adding a NON-clustered index should suffice. That's assuming validation is random, using equate hits.
Regards,
Ronna
Ronna Williams
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply