Batch Update Performance

  • Hi all,

    I have a batch UPDATE I'm trying to use to populate a new column for historic data on some fact tables in our data warehouse (up to around 1,000,000,000 rows per table). The code is structured as follows:


    SET NOCOUNT ON;

    DECLARE @TotalRowCount BIGINT = 0
    DECLARE @TotalRowCountMessage VARCHAR(20)
    DECLARE @RowsAffected BIGINT = 0

    WHILE (1 = 1)

    BEGIN

      BEGIN TRAN

      UPDATE TOP (100000)
            F
      SET
            NewID = ISNULL(D.NewID, -1)
        FROM
            dbo.FactTable F
            LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID
        WHERE
            F.NewID IS NULL

        SET @RowsAffected = @@ROWCOUNT

      IF @RowsAffected = 0

      BEGIN

       COMMIT TRAN

       BREAK

      END

      COMMIT TRAN

        SET @TotalRowCount = @TotalRowCount + @RowsAffected
        SET @TotalRowCountMessage = CAST(@TotalRowCount AS VARCHAR(20))
        RAISERROR(@TotalRowCountMessage, 0, 1) WITH NOWAIT
      WAITFOR DELAY '00:00:01'

    END

    GO

    I have added an index to the D.OldID column (with D.NewID included) to try and speed up the lookup for each row and the database was already in simple recovery mode so logging should not be an issue.

    The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000. Could anyone suggest why this might be? Memory issues perhaps? Any ideas what I could do to ensure the performance is consistent throughout the entire execution?

    Thanks

  • Do you have an IDENTITY column as the PK of your fact table? If so, use that to drive your looping and remove the TOP (100000)

    UPDATE   F
    SET
       NewID = ISNULL(D.NewID, -1)
      FROM
       dbo.FactTable F
       LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID
    WHERE
    F.Id between X and Y and F.NewId is NULL

    (X and Y increment with every loop iteration, of course)

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Ah of course! That would make much more sense...why didn't I think of that.... :ermm: Haha

    I'll give that a try - thanks very much!

  • mg_dba - Thursday, March 1, 2018 5:19 AM

    Hi all,

    I have a batch UPDATE I'm trying to use to populate a new column for historic data on some fact tables in our data warehouse (up to around 1,000,000,000 rows per table). The code is structured as follows:


    SET NOCOUNT ON;

    DECLARE @TotalRowCount BIGINT = 0
    DECLARE @TotalRowCountMessage VARCHAR(20)
    DECLARE @RowsAffected BIGINT = 0

    WHILE (1 = 1)

    BEGIN

      BEGIN TRAN

      UPDATE TOP (100000)
            F
      SET
            NewID = ISNULL(D.NewID, -1)
        FROM
            dbo.FactTable F
            LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID
        WHERE
            F.NewID IS NULL

        SET @RowsAffected = @@ROWCOUNT

      IF @RowsAffected = 0

      BEGIN

       COMMIT TRAN

       BREAK

      END

      COMMIT TRAN

        SET @TotalRowCount = @TotalRowCount + @RowsAffected
        SET @TotalRowCountMessage = CAST(@TotalRowCount AS VARCHAR(20))
        RAISERROR(@TotalRowCountMessage, 0, 1) WITH NOWAIT
      WAITFOR DELAY '00:00:01'

    END

    GO

    I have added an index to the D.OldID column (with D.NewID included) to try and speed up the lookup for each row and the database was already in simple recovery mode so logging should not be an issue.

    The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000. Could anyone suggest why this might be? Memory issues perhaps? Any ideas what I could do to ensure the performance is consistent throughout the entire execution?

    Thanks

    "The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000." Once the cached data has been exhausted, the remaining data must be retrieved from disk. If you have enough RAM to cache it all - and it's quite a big table - you might want to look into "cache priming". It can take significantly less time to read all your data into RAM then process it, rather than reading/processing small chunks at a time.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks ChrisM, that's an interesting idea. I've just tried it using Phil's suggestion and it updated 20,000,000 rows on a cold cache in almost exactly 4 minutes (and only running on a small test server), which I don't think is too awful. Performance was much more consistent too, so I'll give it a try against a whole table and see how it goes.

    Thanks both for your help.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply