Batch Update Performance

Question

Batch Update Performance

mg_dba

SSCrazy

Points: 2679
More actions
March 1, 2018 at 5:19 am

#327703

Hi all,
I have a batch UPDATE I'm trying to use to populate a new column for historic data on some fact tables in our data warehouse (up to around 1,000,000,000 rows per table). The code is structured as follows:
SET NOCOUNT ON;
DECLARE @TotalRowCount BIGINT = 0 DECLARE @TotalRowCountMessage VARCHAR(20) DECLARE @RowsAffected BIGINT = 0
WHILE (1 = 1)
BEGIN
BEGIN TRAN
UPDATE TOP (100000) F SET         NewID = ISNULL(D.NewID, -1)     FROM dbo.FactTable F         LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID     WHERE F.NewID IS NULL
    SET @RowsAffected = @@ROWCOUNT
IF @RowsAffected = 0
  BEGIN
   COMMIT TRAN
   BREAK
  END
COMMIT TRAN
    SET @TotalRowCount = @TotalRowCount + @RowsAffected     SET @TotalRowCountMessage = CAST(@TotalRowCount AS VARCHAR(20))     RAISERROR(@TotalRowCountMessage, 0, 1) WITH NOWAIT WAITFOR DELAY '00:00:01'
END
GO
I have added an index to the D.OldID column (with D.NewID included) to try and speed up the lookup for each row and the database was already in simple recovery mode so logging should not be an issue.
The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000. Could anyone suggest why this might be? Memory issues perhaps? Any ideas what I could do to ensure the performance is consistent throughout the entire execution?
Thanks

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246985 More actions · Answer 1

Do you have an IDENTITY column as the PK of your fact table? If so, use that to drive your looping and remove the TOP (100000)

UPDATE F SET NewID = ISNULL(D.NewID, -1) FROM dbo.FactTable F LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID WHERE F.Id between X and Y and F.NewId is NULL
(X and Y increment with every loop iteration, of course)

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

mg_dba SSCrazy Points: 2679 More actions · Answer 2

Ah of course! That would make much more sense...why didn't I think of that.... :ermm: Haha

I'll give that a try - thanks very much!

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

mg_dba - Thursday, March 1, 2018 5:19 AM
Hi all,
I have a batch UPDATE I'm trying to use to populate a new column for historic data on some fact tables in our data warehouse (up to around 1,000,000,000 rows per table). The code is structured as follows:
SET NOCOUNT ON;
DECLARE @TotalRowCount BIGINT = 0 DECLARE @TotalRowCountMessage VARCHAR(20) DECLARE @RowsAffected BIGINT = 0
WHILE (1 = 1)
BEGIN
BEGIN TRAN
UPDATE TOP (100000) F SET         NewID = ISNULL(D.NewID, -1)     FROM dbo.FactTable F         LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID     WHERE F.NewID IS NULL
    SET @RowsAffected = @@ROWCOUNT
IF @RowsAffected = 0
  BEGIN
   COMMIT TRAN
   BREAK
  END
COMMIT TRAN
    SET @TotalRowCount = @TotalRowCount + @RowsAffected     SET @TotalRowCountMessage = CAST(@TotalRowCount AS VARCHAR(20))     RAISERROR(@TotalRowCountMessage, 0, 1) WITH NOWAIT WAITFOR DELAY '00:00:01'
END
GO
I have added an index to the D.OldID column (with D.NewID included) to try and speed up the lookup for each row and the database was already in simple recovery mode so logging should not be an issue.
The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000. Could anyone suggest why this might be? Memory issues perhaps? Any ideas what I could do to ensure the performance is consistent throughout the entire execution?
Thanks

"The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000." Once the cached data has been exhausted, the remaining data must be retrieved from disk. If you have enough RAM to cache it all - and it's quite a big table - you might want to look into "cache priming". It can take significantly less time to read all your data into RAM then process it, rather than reading/processing small chunks at a time.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mg_dba SSCrazy Points: 2679 More actions · Answer 4

Thanks ChrisM, that's an interesting idea. I've just tried it using Phil's suggestion and it updated 20,000,000 rows on a cold cache in almost exactly 4 minutes (and only running on a small test server), which I don't think is too awful. Performance was much more consistent too, so I'll give it a try against a whole table and see how it goes.

Thanks both for your help.