Updating blocks of data in a DW

  • Hi,

    I hope you can help. Our company are running a datawarehouse and there are half a dozen tables with 15 Million+ rows of data. We have the tables partitioned. If we need to update one of the columns in one of these tables, it can take a very long time. Is there a way of updating in batches? Say 200,000 rows at a time? The tables are partitioned into months. Normally I would just do an update table A.......where month = 'June' then do the same for each month. However we would like to kick a few of them off but also allow a break in between and run a check that the insert was successful and if not, stop the process. I can also see that we could use the partition ID as the selection criteria however, we ideally would like to have this as a dynamic script that will automatically determine how many partitions there are for that table and move through them accordingly. The script would then be able to be used on the other tables.

    I have looked into partition switching but from what I have read, that seems more for inserting new data or for archiving rather than updating existing data.

    On a side note, the tables are indexed but the recreation of the indexes takes about 4 hours so if we can avoid doing that it would save a lot of time.

    Does anyone have a way of doing this? Has what I have typed made sense?

    Jamie

  • You can write a script that will do a waitfor, or even run a command in between updates

    update a

    set x = y

    where month = 1

    -- check errors

    go

    waitfor ()

    go

    update b

    set x = y

    where month = 2

    etc.

    If you need more checking, or user interaction, I'd have a notification sent to someone when the process finishes and then have them check. Give them a series of scripts to run, have them run in order. Or jobs that delete themselves when they're done running to prevent someone from running one again.

  • Steve,

    Thanks for your quick reply. The waitfor will be very useful. However, the main challenge is how to get the update done in batches so if it does fail, firstly so we dont need to start over again and secondly so we can manage our update windows better. We will know which batch of updates failed, investigate why, and then do it again from that point.

  • Are you updating every row in a partition of just a select amount of rows within a partition? If you are updating every record then the switch statement may help as you could loop through each partition switching them out to an empty table, then drop the indexes on the table update the column that you need to update and switch the table back into the old partition.

    Depending on the volume of data, this could increase performance by a lot.

    Here is a simple way of looping through each partition on a table:

    SELECT @Partition = (SELECT TOP 1 $PARTITION.PartitionFunc(PartitionCol)

    FROM MyPartitionedTable ORDER BY PartitionCol)

    SELECT @MaxPartition = (SELECT TOP 1 $PARTITION.PartitionFunc(PartitionCol)

    FROM MyPartitionedTable ORDER BY PartitionCol DESC)

    WHILE (@Partition <= @MaxPartition)

    BEGIN

    --Do you thing here

    SET @Partition = @Partition + 1

    END

  • Partitions might make more sense in that case. Sorry, wasn't sure what you were trying to accomplish from the first post.

    No matter how you do it, you probably want to log the work done in each loop. That way you know what is done and what isn't. So have another table that tracks which batch of rows you've done based on some value. It's a little more work to develop, but easier than trying to outer join, or sub-select from two large tables to try and figure out where you failed.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply