table rebuild efficiency - viaBCP

  • Im going through a process in a rather large database (~2tb) to improve the data consumption and index health of the larger tables.

    in short, the indexes are too bad to purge efficiently, the data quantities too high to practically reindex them. had they been purged from the begining life would be grand, but...

    the theory is, in about the same time we could reindex, we could rename the original table, BCP the data to be retained out, recreate the table, bcp the data in.

    this way in a single maintenance window we achieve the needed purges and have nice fresh good indices.

    the problem I am running into during tests is with regards to BCP efficiency.

    Ive got a bcp export running now that had written about 85million rows to flat file in the first 20 minutes. In the hour since then, its gotten through about another 20 million rows.

    ive seen this with several of the larger tables Ive tested this process on. As the file grows, BCP gets slower and slower to write to it.

    any suggestions?

  • I'm a little confused because your post first states rebuilding indexes would be impractical because of their size then you are entertaining the idea of bcp-out/bcp-in the whole table which would force you to build indexes you like it or not.

    I would schedule a maintenance window for each large table, one at a time.

    During the window I would rebuild clustered index then rebuild non-clustered indexes on the affected table.

    As a result of this process you would end up having your base table reorganized and all your indexes rebuilt.

    Just my two cents.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • we are in desperate need of purging as well.

    The objective is to only rebuild the indices on the data we plan to retain.

    we have ~5 years of transactional data and only need 45 days in some of these tables.

    so for example, one table started with 200million rows, and after this process was complete, we were down to 36 million.

    simply reindexing only covers 1/2 of our need... purging AND reindexing. and could potentially take longer than the bcp out/in method im testing now.

  • LAW1143 (2/3/2009)


    we are in desperate need of purging as well.

    The objective is to only rebuild the indices on the data we plan to retain.

    we have ~5 years of transactional data and only need 45 days in some of these tables.

    so for example, one table started with 200million rows, and after this process was complete, we were down to 36 million.

    simply reindexing only covers 1/2 of our need... purging AND reindexing. and could potentially take longer than the bcp out/in method im testing now.

    In your case study you purged about 82% of the data, is that correct?

    In cases like this one I would go for reverse purging, don't delete... insert! Let me show you how.

    You have your 200 million rows AAA table.

    Create BBB table as select from AAA where you-get-your-36-million rows

    Rename table AAA as AAA_old

    Rename table BBB as AAA

    Rebuild all your indexes on your brand new 36 millions rows table.

    Did you get the idea?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Paul, thanks for the suggestion.

    we've tried that route and not had great results either...

    what it looks like I am going to have to do is further scripting on my bcp out statement to write data to multiple output files, maybe one per month or something.

  • Did you set the BCP parameter to commit every so many rows? If not, everything is in one huge transaction, and that would explain your slow down as the transaction log blow up to huge size.

    I recommend committing the data about every 100,000 rows as a starting point.

    The BCP parameters are documented in SQL Server Books Online.

  • I have a -b10000 parameter on my bcpout, but I didnt believe that it adhered to that parameter (just had not commented it out yet)

    is this the parm you are speaking of? Am I incorrect in thinking its not used for bcp output cmds?

    ive got a -b10000 parm on the bcpin as well which inserts in batches of 10k rows.

    im going to have to do some subsequent tests to find the 'sweet spot' for the batch size upon insert.

  • from BOL:

    -b batch_size

    Specifies the number of rows per batch of imported data. Each batch is imported and logged as a separate transaction that imports the whole batch before being committed. By default, all the rows in the data file are imported as one batch. To distribute the rows among multiple batches, specify a batch_size that is smaller than the number of rows in the data file. If the transaction for any batch fails, only insertions from the current batch are rolled back. Batches already imported by committed transactions are unaffected by a later failure.

    Do not use this option in conjunction with the -h"ROWS_PER_BATCH = bb" option.

  • oh, and the database is in simple recovery to keep log sizes down and make things move more quickly.

    that was the problem with doing it via the method Paul mentioned, the transaction sizes get to be too large for the single transaction.

    we did one for another environment several months ago in that fashion... I dont recall the table size before/after and the row counts. this method was supposed to be an improvement... maybe not... but I had that one setup to do one transaction per day worth of data to prevent doing it all in one big transaction.

  • You should also do this to speed up the load:

    1. Script out all the non-clustered indexes on the table.

    2. BCP the data out in the order of the clustered index on the new table. Make sure you use a query, not the table name, for the BCP out to ensure the order of the data file.

    3. Truncate the table.

    4. Drop all the indexes on the table, except for the clustered index, before you start the import.

    5. BCP the data in. from the file.

    6. Re-create the indexes you dropped before starting the load using the scripts you created.

    The slow down is likely from massive index fragmentation, so this should prevent that.

  • what Im doing is effectively the same as what you describe (only renaming the existing table and its objects rather than truncating)...

    the bcp out has a query, and an order by clause that matches the columns in the clustered index.

    I could probably gain by not adding the non-clustered indices to the newly created table till after the data is in, but we're not to that point yet (at least with this table)

  • what Im doing is effectively the same as what you describe (only renaming the existing table and its objects rather than truncating)...

    the bcp out has a query, and an order by clause that matches the columns in the clustered index.

    I could probably gain by not adding the non-clustered indices to the newly created table till after the data is in, but we're not to that point yet (at least with this table)

  • LAW1143 (2/3/2009)


    what Im doing is effectively the same as what you describe (only renaming the existing table and its objects rather than truncating)...

    the bcp out has a query, and an order by clause that matches the columns in the clustered index.

    I could probably gain by not adding the non-clustered indices to the newly created table till after the data is in, but we're not to that point yet (at least with this table)

    Sounds like you are doing everything except the real point of what I was suggesting, which is to remove the non-clustered indexes and create them again after the load so that the load is not slowed down by massive index fragmentation.

  • by 'the load' do you mean the extraction TO bcp or the load to table FROM?

  • For some reason

    SELECT INTO sounds more efficient to me. You are doing "both" bcp-out-and-in in a minimally logged fashion!

    Then drop old rename new and apply indexes ...but then again... I could be wrong.


    * Noel

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply