Table Migration script copying over more rows than intended

  • All,

    I am attempting to perform a table migration for one table (tableA) from my source database (dbsrc) into my destination database (dbdest) based in my primary key clustered index column (tableAID), which is the identity column.

    I am getting my MIN tableAID into a variable (@MinID) and my MAX tableAID into a variable (@MaxID) and using a while loop with batches of 25,000 being inserted at a time.  The caveat here is that my source table has data in it up to Dec 16th 2023, but I only want to bring data over to the destination table up to Oct 31st 2023.

    My MinID is 1000000 and my MaxID is 59755723 (the very last record for Oct 31st 2023.  I have verified the MaxID is for Oct 31st 2023 and my very next value is Nov 1st 2023 (which I dont want in my destination table)

    The issue is, in my destination table, I am pulling in data for November and December 2023 and I dont want this.

    My code for this table migration is:

    DECLARE @MinID BIGINT
    ,@MaxID BIGINT
    ,@Batch INT = 25000;


    --Get the very first MIN ID
    SELECT @MinID = MIN(tableAID)
    FROM dbsrc.dbo.tableA;


    --Get the MAX ID for EndDate
    SELECT @MaxID = MAX(tableAID)  --59755723
    FROM dbsrc.dbo.tableA
    WHERE CAST(LastUpdateDateTime AS DATE) <= DATEADD(MONTH, DATEDIFF(MONTH, -1, GETDATE()) - 2, -1);


    WHILE @MinID <= @MaxID
    BEGIN
    BEGIN TRY


    INSERT INTO dbdest.dbo.tableA
    SELECT a.*
    FROM dbsrc.dbo.tableA a
    WHERE tableAID >= @MinID AND tableAID < @MinID + @Batch;


    SELECT @MinID = @MinID + @Batch;


    END TRY


    BEGIN CATCH
    SELECT ERROR_MESSAGE()
    END CATCH
    END

    My question here is, how can I ensure that in the destination table, I truly only migrate data up until my MaxID (Oct 31st 2023)?

    Thank you for any and all help here!

  • your issue is that while you are checking the variables against @maxid on the loop, you aren't checking it on the where clause - so you can get up to 25k - 1 rows that are over the @maxID value (assuming your current increment of 25k)

    and I have to say that 25k rows is a rather small batch size - unless your server is crap I would use a higher values (100k to 500k or even 1 million)

     

    see changes to your code below - unless I messed it up it should work.

    DECLARE @MinID BIGINT
    ,@MaxID BIGINT
    ,@workid bigint -- new variable
    ,@Batch INT = 25000;


    --Get the very first MIN ID
    SELECT @MinID = MIN(tableAID)
    FROM dbsrc.dbo.tableA;


    --Get the MAX ID for EndDate
    SELECT @MaxID = MAX(tableAID) --59755723
    FROM dbsrc.dbo.tableA
    WHERE CAST(LastUpdateDateTime AS DATE) <= DATEADD(MONTH, DATEDIFF(MONTH, -1, GETDATE()) - 2, -1);


    WHILE @MinID <= @MaxID
    BEGIN
    BEGIN TRY

    set @workid = case when @minID + @batch > @MaxID then @MaxID else @MinID + @Batch end;


    INSERT INTO dbdest.dbo.tableA with (tablock) -- added so it may allow for bulklogging - handle with care if the destination table is a columnstore table and this is executed within an explicit transaction - may need to be removed in this case due to a SQL Server but (not yet resolved in sql 2019 although MS says it is)
    SELECT a.*
    FROM dbsrc.dbo.tableA a
    WHERE a.tableAID >= @MinID AND a.tableAID < @workid; -- changed to use new calculated variable


    SELECT @MinID = @MinID + @Batch; -- leave as is


    END TRY


    BEGIN CATCH
    SELECT ERROR_MESSAGE()
    END CATCH
    END
  • Thank you for the reply, fredrico_fonseca...

    Unfortunately, I am still pulling in data past Oct 31st 2023 with this additional logic, so Im still digging in.  With regards to the batch size, once I get the data nailed down, I am going to play with increasing the batch size.

    Thanks Again!

  • I don't see why there should be a cast-iron association between tableAID and LastUpdateDateTime, assuming that any row, old or new, can be updated (and thus have its LastUpdateDateTime modified).

    Instead, you'll have to explicitly exclude these rows from your INSERT query:

    INSERT INTO dbdest.dbo.tableA with (tablock)
    SELECT a.*
    FROM dbsrc.dbo.tableA a
    WHERE a.tableAID >= @MinID AND a.tableAID < @workid and -- add date filter condition here

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • stop playing with copying the data and focus on getting the filtering correct.

    maybe the column you are using isn't the right one to use - maybe it is only in a very specific point in time, and only if during a explicit transaction.

    the code change I gave assumed you already had all your remaining criteria tested and you were only looking into the remaining issues.

    also do note that if your column "LastUpdateDateTime " contains time values, your filter will exclude all entries that happened in the 31st after 00:00:00

     

  • Thank you all for the help!  Phil's suggestion worked for me, I was trying to avoid it as there is no leading index on the date columns.

    I as well need to work on getting my date logic to be Oct 31st 23:59:59.999.

    Thanks again!

  • GBeezy wrote:

    Thank you all for the help!  Phil's suggestion worked for me, I was trying to avoid it as there is no leading index on the date columns.

    I as well need to work on getting my date logic to be Oct 31st 23:59:59.999.

    Thanks again!

    If that proves to be a performance killer, it might be worth testing doing the copy as you were previously and then performing a delete on the target table to remove the unwanted rows which were copied over ... perhaps after adding an index on the date column.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil, adding the date logic to the WHERE clause really didnt affect performance at all.  I thought about deleting records after the INSERT as well, but again, was trying to avoid that.  I believe what I have is sufficient, so thank you again for the help.  I am going to test the different batch sizes as fredrico suggested.

  • GBeezy wrote:

    Thank you all for the help!  Phil's suggestion worked for me, I was trying to avoid it as there is no leading index on the date columns.

    I as well need to work on getting my date logic to be Oct 31st 23:59:59.999.

    Thanks again!

    I'm going to recommend not using that particular logic.  Instead, use a date criteria of < '20231101'.  And add an index to the original table to support it.

    Last but not least, you could be creating a huge log file with all of this, even if you were in the SIMPLE recovery model.  At the very least, your "chunking" method is making and guaranteeing that things run about twice as slow.

    Please post the CREATE TABLE for both the source and destination tables including any and all indexes and let's see if we can get this into a "Minimally Logged" state.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply