Bulk Insert and Index Considerations

  • John Rowan (7/2/2008)


    I think Jeff is reffering to ALTER TABLE SWITCH ... for partitioned tables

    Yes, this is what I thought also, but the OP had not mentioned using SQL Server 2005 partitioning. It would make sense to use this to switch in the daily 1M inserts and that would be the only reason why the 'staging' table would need to have the indexing set up on it as it would need to match the destination table.

    Yeah, that would make sense.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • John Rowan (7/2/2008)


    Not trying to be obtuse, but can you expand on this a bit Jeff? What issues are there with BCPing directly into the final table? Also, what do you mean by merge?

    BTW--don't forget to add the MVP title to your signature 🙂

    The biggest issue is that you are inserting what boils down to unvalidated data. If you're baking and you see an unlabeled bag of sugar that wasn't there before, do you use it as if it were sugar? Every good cook will tell you "No"... verify what's in the bag. In cooking, you do that by tasting it (unless you happen to have a chemical engineer in the house that brings things home). In DB's you do that by validating. In DB's, validating usually means a staging table.

    Other dangers are that people don't always setup for it correctly... you want to try to rollback a million row BULK INSERT if more than 10 rows fail? On a production table? On an OLTP table? Also, Bulk Insert will cause locking... and escalated locking. You want to make customers wait to use the table for a bulk insert of unverified data?

    There's also no opportunity to do a dupe check if you bulk insert into the final table. Sure, PK and, perhaps, an ignore dupes will take care of all that... on a million rows, it'll take some pretty good additional time, though.

    I dunno... for all the reasons I stated, and more, doing a bulk insert directly into a production table instead of a staging table is like drinking from an unknown cup in a urology clinic. "Hey HOMER! Iss gotta be beer... iisss gut foam on it!"...

    I don't think so. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • noeld (7/2/2008)


    I think Jeff is reffering to ALTER TABLE SWITCH ... for partitioned tables

    Not what I was referring to... some folks just feel that they've gotta put an index on things. 😉 For a million row staging table, I might add an index, too, but only if I'm going to merge the data into a permanent table. Index on an IDENTITY column comes in real handy for dupe checks, for eample. Index on a join column to do an update helps, as well.

    The big thing that I'm trying to get to is, if you have indexes, whether it's on a staging table or not, the table must be empty to even have a shot at minimal logging according to all the requirements for mimimal logging in Books Online. The other requirement is that you have to have TABLOCKX on the table according to BOL.

    I guess that' the other reason I use staging tables... I can put them in a sandbox DB that has the Bulk Insert or Simple recovery mode set and make sure the data is good and that I actually want to load the file long before any real logging on my main DB takes place.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Makes sense, thanks Jeff.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (7/2/2008)


    Not trying to be obtuse, but can you expand on this a bit Jeff? What issues are there with BCPing directly into the final table? Also, what do you mean by merge?

    Almost forgot... Gus is correct. What I mean by a "Merge" is a real live "UpSert"... Insert rows from staging to perm if they don't already exist and, if they do exist, update the perm table with data from the staging table.

    Gus is also correct... if all you're going to do is insert directly from the staging table to the perm table, then there's NO need for any indexes. Typically, I do some heavy duty validations including dupe checks within and dupecheck against the perm table and an index will frequently make that move along better.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Toby White (6/27/2008)


    I am doing a bulk insert of 132,873,975 rows that are ordered by zip code. The receiving table has 2 indexes on it and a 3rd possible clustered index. The data coming in is ordered by zip code, so my thought is to have a clustered index on the zip code that would be 9 bytes wide so that the bulk insert writes could push through contiguously on disk.

    However, my second thought is that I could lose the clustered index all together, but with this strategy I would now have 2 indexes referencing a heap. Would maintenance of these 2 indexes go up during or immediately following the bulk insert since they would now be referencing a heap?

    I think the way it is currently working the operation is being logged, but this additional I/O is not hurting because the log file is separated. However, should I expect far better performance if I got the operation to not be logged because of something else like CPU/Memory/other?

    Not sure we answered all your questions, Toby... Let us know if we didn't.

    Basically, working on a staging table before you insert/update a perm table would be the best way to go in this particular situation, I believe.

    Also, you need to check the rules for Minimal logging... you CAN have indexes, but only if the table is empty. You also have to have exclusive use of the table which usually means a TABLOCK if you're using Bulk Insert. AND, as you probably know, the DB cannot be in the FULL recovery mode. There are a couple of other limitations, as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes Jeff, thank you for the commentary/explanation/suggestions. Using a staging table will definitely be the way to go for the daily load. I guess whether the incoming records needed to be inserterd or upserted in the "persistant" table a staging table would be the way to go. The difference is that with straight inserts there would be no need to enable/create any indexes on the staging table after the load.

    My only question/challenge to what you are saying, which you supported with the cook analogy, is that if I am going to need to store all the records in a source table regardless of the data in the record then I am not sure what going to a staging table first buys me. I totally agree with everything you have said in general, but in this case (which actually isn't mine, it's just theoretical because I have upserts in my particular case), the bcp to the staging table won't be logged, but the inserts into the "Persistant" table will be, so what have I gained supposing the requirement is that ALL the incoming data needs to go to a source table?

    Thanks

  • In this case... maybe/probably nothing...

    Just so you know about the logging, BCP will be logged if there are indexes on the perm table so it won't make a difference if it goes to a staging table or direct, in this case because you eventually have to copy staging to perm just like you said... BCP has to meet all the same rules as Bulk Insert to be minimally logged, otherwise, it's fully logged. Your perm table will prevent that.

    The real key is, can you absolutely guarantee that there will always be sugar in the unmarked bag? And, do you want to take that chance? I usually don't but will admit there's always an exception. I just don't think this would be an exception... I don't know what could go wrong for sure, but I wouldn't be willing to take a chance on having to restore a backup or maybe rebuild 130M+ rows. 🙂

    Either way... thanks for the feedback and let us know how it works out! This is good stuff!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi im about to do similar transfer some large tables to a new database.

    YOu mention bcp as this uses the log...Do you have an alternative method to use when transferring large tables....

    Cheers

  • BCP uses the log only if you've not met the requirements for "minimal logging". Otherwise, it will minimally log... just like BULK INSERT which is actually faster than BCP.

    Please refer to the following Books Online URL's for all the details on how to best do bulk loads of data...

    This explains how to qualify a load for "minimal logging"...

    ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/udb9/html/bd1dac6b-6ef8-4735-ad4e-67bb42dc4f66.htm

    This gives a fantastic overview of everything having to do with bulk loads... lot's of great performance hints, too!

    ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/udb9/html/ba0a30de-1a6a-4108-98a2-db7b4918cfc0.htm

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply