June 27, 2008 at 9:23 am
I am doing a bulk insert of 132,873,975 rows that are ordered by zip code. The receiving table has 2 indexes on it and a 3rd possible clustered index. The data coming in is ordered by zip code, so my thought is to have a clustered index on the zip code that would be 9 bytes wide so that the bulk insert writes could push through contiguously on disk.
However, my second thought is that I could lose the clustered index all together, but with this strategy I would now have 2 indexes referencing a heap. Would maintenance of these 2 indexes go up during or immediately following the bulk insert since they would now be referencing a heap?
I think the way it is currently working the operation is being logged, but this additional I/O is not hurting because the log file is separated. However, should I expect far better performance if I got the operation to not be logged because of something else like CPU/Memory/other?
June 30, 2008 at 11:26 am
I would cluster the index on Zip, disable the other two indexes during the import and re-enable them after, and definitely would not use bulk logging on something that size. Not because of speed, but because of what it's going to do to the log file.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
June 30, 2008 at 1:47 pm
Recreting the index on 132,873,975 Rows can be a good option agains doing a BCP with them on, shich will consume more time for sure..
Are you using the /b option while bulk copy.(batch). As far as i am concerned, I remove the Non-clustered index and keep the Clustered index and then do a Bulk Load.
Maninder
www.dbanation.com
June 30, 2008 at 2:29 pm
The bulk insert is being done by the developers in C# and they haven't been getting it to be minimally logged as evidenced by the log file that grew to 60GB on the last load. There are several prerequisites for the minimal logging and I don't know all the details about their operation. The database is in simple recovery mode, so I think they must have gone with one huge batch.
My recommendation to them on Friday was to place the clustered index on zip like you guys suggest, but why not drop all indexes and just recreate the clustered index and then the other two after the data is loaded?
Anyway, the way they are currently doing the load (to a heap w/ 2 regular indexes) is giving very poor performance. It is definitely better to drop/load/build the regular indexes. However, could you tell me why this isn't the case with the clustered index.
Thanks
July 1, 2008 at 3:47 am
July 1, 2008 at 8:48 am
SrikanthSv (7/1/2008)
With Data as huge as 132 million it would always be better to drop the indexes and recreate them after the data is loaded.
In 2005, you can accomplish much the same thing by disabling and enabling. It's just easier to do, since you don't actually have to make sure your create scripts are synchronized. That's why I suggest disable/enable, instead of drop/create.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 1, 2008 at 8:53 am
Toby White (6/30/2008)
...My recommendation to them on Friday was to place the clustered index on zip like you guys suggest, but why not drop all indexes and just recreate the clustered index and then the other two after the data is loaded?Anyway, the way they are currently doing the load (to a heap w/ 2 regular indexes) is giving very poor performance. It is definitely better to drop/load/build the regular indexes. However, could you tell me why this isn't the case with the clustered index.
If the data is going to be clustered on the Zip and the import file is already ordered by the Zip, there's not really a reason to drop and re-create the clustered index, so far as I know. I'd have to test with and without to see what it really does, but if you don't have a clustered index when you load the data, and then re-create it, it might take longer than the delay of having it on the table when you load it. Worth testing, most likely.
The reason I say this is that, without a clustered index when you load the data, it may not actually end up in the sequence that the clustered index will be in. That means, if it loads as a heap, then you add a clustered index, it might have to rebuild the whole table. That could cost you more time than you save by having it load into a heap.
On the non-clustered indexes, seriously look at disable/enable, instead of drop/build. Saves you from accidentally ending up with contradictory create statements in multiple places.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 2, 2008 at 7:20 am
Thank you all very much for your posts. I will indeed recommend disable/enable for the indexes. I wanted to explore the question of whether the cluster on zip should be there during load just a little further. As for whether the initial bulk insert should be done on a heap or the zip cluster should already exist - it makes sense that we will probably have to test it with volume to get a Definitive(ish) answer on that.
However, the next step is loading a daily file of roughly 1 million more records. This additional data will also be ordered by zip. My thought is that inserting this into the table with all the resting data (120million and growing rows) will work a little better if I have fillfactor of maybe 90% because less page splitting will occur (versus fill of say 95+%) as the incoming records are inserted into the pages of the clustered index. And then maybe rebuild the clustered index every week or every month depending on execution time. The daily load seems moderately straight forward to me, no need for disabling/enabling or anything fancy, although I could consider partition function/schema on the table. If you see any gotchas/pitfalls please let me know - and thanks again for your commentary/suggestions thus far.
July 2, 2008 at 11:58 am
Is it possible for you to have some sort of LoadDate or BatchID in the clustered index? If so, then that will really expedite your re-indexing process. If not, then a heap may be your best bet.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 2, 2008 at 1:21 pm
You should pass the "-h ORDER() " hint to your bulkload statement 😉
* Noel
July 2, 2008 at 1:47 pm
Rule of thumb says "Never BCP directly into the final table." Applies for this, as well... especially this. Load the million rows into a new staging table, then add indexes on it, then do the merge with the big table.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 2, 2008 at 2:02 pm
Not trying to be obtuse, but can you expand on this a bit Jeff? What issues are there with BCPing directly into the final table? Also, what do you mean by merge?
BTW--don't forget to add the MVP title to your signature 🙂
July 2, 2008 at 2:22 pm
Inserting it into a staging table means you don't have to mess around with making the clustered index match the order in the import file. You can use a clustered index that's more useful to your most common selects or most common range-selects.
I'm not sure why he suggests indexing the staging table. I've never bothered with that, because when I use that method, I dump the data right back out of the staging table as soon as I have it in the persistent table.
Edit: Actually, if doing an "upsert", where some of the rows are used to update existing rows in the persitent table, indexes would make sense. But if it's a straight up insert, it won't matter.
What he means by merge is moving the data from the staging table to the final table. (I call the final table "the persisted table", because that's where I hold onto the data more permanently.)
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 2, 2008 at 2:27 pm
I think Jeff is reffering to ALTER TABLE SWITCH ... for partitioned tables
* Noel
July 2, 2008 at 2:50 pm
I think Jeff is reffering to ALTER TABLE SWITCH ... for partitioned tables
Yes, this is what I thought also, but the OP had not mentioned using SQL Server 2005 partitioning. It would make sense to use this to switch in the daily 1M inserts and that would be the only reason why the 'staging' table would need to have the indexing set up on it as it would need to match the destination table.
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply