Performance for loading data

Question

Performance for loading data

Prakash Heda

SSCertifiable

Points: 5539
More actions
May 21, 2002 at 3:36 am

#158456

CASE 1
Usual steps load data
1. table is created
2. data is inserted
3. clustered index created
CASE 2
1. table is created
2. clustered index created
3. data is inserted(sorted by clustered key)
Additional Information:
1. The table data size is 5 GB and around 4 million rows.
2. In the case1 it took the 26 minutes to load data
3. In the case2 it took same time (26 minutes) to load data
How this two cases affects database and performance for each of this .....
1. internal fragmantation of data
2. external fragmantation of data
3. Which case requires more processing power (8 CPU)
4. Which case requires more memory (4GB RAM)
5. Which case requires more disk usage (Raid array of 6 disks)
6. Which case requires more paging(16 GB paging seperated in 4 disks 4 GB each)
Prakash
Prakash Heda
Lead DBA Team - www.sqlfeatures.com
Video sessions on Performance Tuning and SQL 2012 HA

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Antares686 SSC Guru Points: 125444 More actions · Answer 1

1) Internal fragmentation in regards to an intial load is usually limited and not much of an issue. However if the data is going into a table not matching close to what the clustered index is you may see a higher number of page splits based on the fill factors ability to support.

2) External fragmentation is bad for a DB and should take measures to avoid. This is the reason when you plan your DB you should plan the size to accept the data you plan on being inserted right away and for at least a month if possible. However the other thing to consider is file growth. If you are going to let the file grow you should pick a value that will give you a long period between growths so SQL does not spend an uneccessary time and delay any insert, update or delete while it grows. Many will tell you pick a percentage of growth but this means SQL has to take extra time to calculate the next growth size and will vary (larger) each time. By using a fixed amount of growth no calculation has to be done and you can have a better idea of how much your DB is growing over a period of time. These growths are generally the root issue with file fragmentation which will cause the pointer to the file to have to change allocation areas on the HD when reading the file thus slowing you down, the more file fragments the slower your true response time (however on a really good drive this may not be all that perceptable). Planning initial size and growths will limit this as much as possible.

3,4,5,6) Based on your test case 2 will have generally the highest requirements if the data does not match up with what is being clustered on. Reason is the data has to find it's insertion point and if causes a record to be pushed from one page to another may chain down thru the pages causing performance hit on each of those items while the work is done. However if you data is going in, in the order of the clustered column then case 1 will have the highest impact as the table is a heap to begin with, then once doen and you add the clustered index it will have to verify the data as a whole not individually. However settings for the server can also cause an imbalance in what gets hit worse. For instance did you do this

From SQL BOL

quote:
To enable Windows 2000 Advanced Server or Windows 2000 Datacenter Server to support more than 4 GB of physical memory, you must add the /pae parameter to the boot.ini file.

If not then you may see 2 GB in use while the other items are off the scale, but you have bottlenecked your server when this is the case. Making sure your server is optimized is a major consideration when running SQL. Also consider RAID if RAID5 when the inserts occurry you actually have multiple writes and reads as the data is added to the file and the parity information is written, if possible you should invest in setting up a RAID10 array as this is a cross between 0 stripped set and 1 mirrored and thus it write once to the strip reads and once to the mirrored strip, far less drive access than RAID5 configuration (although RAID10 is more costly since you loose more than 1 drives storage size, but you do gain performance on inserts, updates and deletes).

From MSDN library

quote:
RAID 10 provides the performance benefits of disk striping with the disk redundancy of mirroring. RAID 10 provides the highest read/write performance of any of the RAID levels at the expense of using twice as many disks.

So based on the run times I would say your issue with which is better is really unneeded but there are other things to consider to get maximum bennefit.

"Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

Andy Warren SSC Guru Points: 119902 More actions · Answer 2

Overall the 2nd solution is better...IF..you can drop the indexes. Not always possible in production. Faster is better, I would guess (have not measured) that it would use less disk and cpu to build the index once at the end rather than keeping it correct for each insert.

I agree about auto grow, sorta. Not sure how long it could take to calculate the amount based on a percentage it needs to grow, but to me the more important issue is the scale. Growing a 100mb mb by 10 mb is ok to not enough, growing a 10g db by 1g is a lot - something you probably dont want to do during the day. If it was 1T....?! Right now sizing this is done on a shoot from the hip basis and without a useful built in tool to help you "predict" it I dont think its worth a lot of effort. For what its worth on my average db's I have set to grow by 100m, some very small ones that I know will not grow much are set to 10m.

Andy

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter

Prakash Heda SSCertifiable Points: 5539 More actions · Answer 3

Thanks Antares & Andy,

What I am interested knowing in is what is better CASE 1 or CASE 2 without any other changes.

All other aspects are taken care, the confusion is if data is inserted in sorted(clustered key) manner then wouldn't it will be having less fragmentation and requires less disk usage.

As I understand about indexes if after loading data we create a clustered index then it will create a new table for sorting the data and this will lead to high fragmentation.

We do not have any fragmentation tools. So I would like to do it in such a way that it does least fragmentation.

Prakash

Prakash Heda
Lead DBA Team - www.sqlfeatures.com
Video sessions on Performance Tuning and SQL 2012 HA