March 1, 2018 at 5:19 am
Hi all,
I have a batch UPDATE I'm trying to use to populate a new column for historic data on some fact tables in our data warehouse (up to around 1,000,000,000 rows per table). The code is structured as follows:
SET NOCOUNT ON;
DECLARE @TotalRowCount BIGINT = 0
DECLARE @TotalRowCountMessage VARCHAR(20)
DECLARE @RowsAffected BIGINT = 0
WHILE (1 = 1)
BEGIN
BEGIN TRAN
UPDATE TOP (100000)
F
SET
NewID = ISNULL(D.NewID, -1)
FROM
dbo.FactTable F
LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID
WHERE
F.NewID IS NULL
SET @RowsAffected = @@ROWCOUNT
IF @RowsAffected = 0
BEGIN
COMMIT TRAN
BREAK
END
COMMIT TRAN
SET @TotalRowCount = @TotalRowCount + @RowsAffected
SET @TotalRowCountMessage = CAST(@TotalRowCount AS VARCHAR(20))
RAISERROR(@TotalRowCountMessage, 0, 1) WITH NOWAIT
WAITFOR DELAY '00:00:01'
END
GO
I have added an index to the D.OldID column (with D.NewID included) to try and speed up the lookup for each row and the database was already in simple recovery mode so logging should not be an issue.
The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000. Could anyone suggest why this might be? Memory issues perhaps? Any ideas what I could do to ensure the performance is consistent throughout the entire execution?
Thanks
March 1, 2018 at 6:01 am
Do you have an IDENTITY column as the PK of your fact table? If so, use that to drive your looping and remove the TOP (100000)
UPDATE F
SET
NewID = ISNULL(D.NewID, -1)
FROM
dbo.FactTable F
LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID
WHERE
F.Id between X and Y and F.NewId is NULL
(X and Y increment with every loop iteration, of course)
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
March 1, 2018 at 6:44 am
Ah of course! That would make much more sense...why didn't I think of that.... :ermm: Haha
I'll give that a try - thanks very much!
March 1, 2018 at 6:52 am
mg_dba - Thursday, March 1, 2018 5:19 AMHi all,I have a batch UPDATE I'm trying to use to populate a new column for historic data on some fact tables in our data warehouse (up to around 1,000,000,000 rows per table). The code is structured as follows:
SET NOCOUNT ON;DECLARE @TotalRowCount BIGINT = 0
DECLARE @TotalRowCountMessage VARCHAR(20)
DECLARE @RowsAffected BIGINT = 0WHILE (1 = 1)
BEGIN
BEGIN TRAN
UPDATE TOP (100000)
F
SET
NewID = ISNULL(D.NewID, -1)
FROM
dbo.FactTable F
LEFT JOIN dbo.DimTable D ON D.OldID = F.OldID
WHERE
F.NewID IS NULLSET @RowsAffected = @@ROWCOUNT
IF @RowsAffected = 0
BEGIN
COMMIT TRAN
BREAK
END
COMMIT TRAN
SET @TotalRowCount = @TotalRowCount + @RowsAffected
SET @TotalRowCountMessage = CAST(@TotalRowCount AS VARCHAR(20))
RAISERROR(@TotalRowCountMessage, 0, 1) WITH NOWAIT
WAITFOR DELAY '00:00:01'END
GO
I have added an index to the D.OldID column (with D.NewID included) to try and speed up the lookup for each row and the database was already in simple recovery mode so logging should not be an issue.
The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000. Could anyone suggest why this might be? Memory issues perhaps? Any ideas what I could do to ensure the performance is consistent throughout the entire execution?
Thanks
"The update completes the first ~10,000,000 rows in very little time, but then it slows down hugely and takes a further 20mins+ to complete the next 100,000." Once the cached data has been exhausted, the remaining data must be retrieved from disk. If you have enough RAM to cache it all - and it's quite a big table - you might want to look into "cache priming". It can take significantly less time to read all your data into RAM then process it, rather than reading/processing small chunks at a time.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 1, 2018 at 7:34 am
Thanks ChrisM, that's an interesting idea. I've just tried it using Phil's suggestion and it updated 20,000,000 rows on a cold cache in almost exactly 4 minutes (and only running on a small test server), which I don't think is too awful. Performance was much more consistent too, so I'll give it a try against a whole table and see how it goes.
Thanks both for your help.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply