Using INSERT INTO…SELECT to Bulk Load Data with Minimal Logging

Question

Using INSERT INTO…SELECT to Bulk Load Data with Minimal Logging

Dave Mason

SSCrazy

Points: 2751
More actions
May 26, 2010 at 9:17 am

#219366

Please accept my apology if this post is in the wrong forum.
I'm trying to bulk load two tables with mimimal logging. I started by setting the recovery model to Bulk-logged. Here's the pseudo-code for each table:
TRUNCATE TABLE table_name
INSERT INTO table_name WITH(TABLOCK)
(Column_Names)
SELECT
(Column_Names)
FROM OPENROWSET (
BULK 'data_file_path',
FORMATFILE = 'format_file_path') alias
WHERE condition
ORDER BY column_with_nonclustered_index
I can backup the transaction log and shrink it down to 50 MB. After running the code for the two tables, the transaction log has grown to about 560 MB, which is about the same size as the primary data file (.mdf file).
Each table has exactly one index, which is non-clustered. (There are no primary keys, and no unique constraints.) I tried dropping the index before the insert statement, and then recreating the index after the insert statement as follows:
TRUNCATE TABLE table_name
DROP nonclustered_index
INSERT INTO table_name WITH(TABLOCK)
(Column_Names)
SELECT
(Column_Names)
FROM OPENROWSET (
BULK 'data_file_path',
FORMATFILE = 'format_file_path') alias
WHERE condition
ORDER BY column_with_nonclustered_index
CREATE nonclustered_index
Once again, I backup the transaction log and shrink it down to 50 MB. After running the code for the two tables, the transaction log stays at 50 MB.
In my mind, the 2nd scenario achieved my goal of bulk loading the two tables with minimal logging. But I don't know why I had to drop/create the indexes. I'd prefer not to do this. This documentation got me headed in the right direction:
http://msdn.microsoft.com/en-us/library/ms190422.aspx
But the question remains: If the tables are truncated before the bulk load, why do I have to drop/create indexes? (I am using SQL 2005, and the documentation is for SQL 2008.)
What am I missing?
Dave MasonSeminole County, FL

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Dave Mason SSCrazy Points: 2751 More actions · Answer 1

Followup...

I'm really reluctant to drop and recreate indexes. Let's say another dba or developer changes the index (add/remove columns to/from the index, change the FILLFACTOR, etc). My code would nullify those changes because I hard-coded the steps to create the index. Not good!

A better option is to disable the index and rebuild it later. (Thanks to Andy Warren for his article about Disabling Indexes[/url]!)

Here's my revised pseudo-code:

TRUNCATE TABLE table_name

ALTER INDEX nonclustered_index ON table_name DISABLE

INSERT INTO table_name WITH(TABLOCK)

(Column_Names)

SELECT

(Column_Names)

FROM OPENROWSET (

BULK 'data_file_path',

FORMATFILE = 'format_file_path') alias

WHERE condition

ORDER BY column_with_nonclustered_index

ALTER INDEX nonclustered_index ON table_name REBUILD

If I get really motivated, I can programatically determine the non-clustered indexes and disable/rebuild them.

Still, after reading the documentation on MSDN, I'm confused as to why any of this matters when the tables are truncated first...

Dave MasonSeminole County, FL

UMG Developer SSChampion Points: 13482 More actions · Answer 2

Dave Mason (5/26/2010)
Still, after reading the documentation on MSDN, I'm confused as to why any of this matters when the tables are truncated first...

It's been a while, but in case you haven't figured it out, the reason that the log files stay so much smaller when you either DROP or DISABLE the non-clustered indexes is that if the indexes are active while you are loading data into the table SQL Server has to log all of the updates to the indexes. So as a record is inserted it has to log the page changes, and page splits. When you DROP or DISABLE the index first, and then CREATE or REBUILD it after the data is in, you are essentially creating a brand new index.

The two main good side-effects of doing this are that:

* the fragmentation in your indexes will be much less, making for better performance.

* It is normally faster to insert the data and then index. (It is faster to create the index all at once than to modify it row by row to keep it up to date.)

I hope that helps explain things.