Big data: transfer x-number of rows per batch?

  • Super_Grover (10/24/2016)


    Okay we're trying to solve this but I'll mark Phil's post as solution.

    Thanks again!

    If you find that you cannot run this as a distributed transaction, it should be possible to tweak the code.

    Won't be quite as robust, but should get the job done.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Since you've posted to a SQL 2012 forum, you might want to use a feature that was introduced in SQL 2012 specifically for paging: OFFSET/FETCH.

    DECLARE @pg INT = 1,

    @pg_size INT = 1000

    @rows INT;

    SET @rows = @pg_size;

    WHILE @rows = @pg_size

    BEGIN

    INSERT INTO [P2WIDB002].[CS_ConversieAanlevering_VWB].[dbo].[MULTIMED_INHOUD]

    SELECT * FROM [NLDSLASSQLTST].[CS_ConversieAanlevering].[dbo].[MULTIMED_INHOUD]

    ORDER BY doc_number

    OFFSET @pg_size * (@pg - 1) ROWS

    FETCH NEXT @pg_size ROWS ONLY;

    SELECT @rows = @@ROWCOUNT, @pg += 1;

    END

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Another consideration is whether the process/progress could/should be interrupted mid-stream. Some additional flexibility might come from using a real table instead of temp where you could add a field for datetime when the records were transferred. Then you could pull batches of records from the set with no transferred datetime. This way, you could split the operation up over more than one day or if a batch got interrupted, delete them from the destination and refill. If the server catches on fire, you could stop your transfers and deal with it.

  • Another thing to strongly consider when transferring such large amounts of data is "Minimal Logging" and the proper use of Trace Flag 610, which has been available since 2008 R2. It can save not only a relatively huge amount of time, but a huge amount of log file space and, sometimes, MDF space depending on when you build the Clustered Index).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You should use bcp

  • wtren (10/25/2016)


    You should use bcp

    Please justify this comment.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Bcp can control batch size directly, write very fast, have very little disturb and burden to product database, after that you can copy data to the destine and do your data loading slowly, and no need to read data from source again if you have any trouble, it's a different strategy.

  • I was going to recommend SSIS for the same reasons, but it seemed like the conversation was focused on the solution we were discussing. It also sounds like a one - off, so I'm not sure how much of a learning curve is acceptable or the time frame.

  • yes, it's split the process into two step, and do it one by one, as I say, it's just a different strategy to do a work, depend on your requirement.

  • If you willing to go the CLR route, and the LOB is compressible, you can use compression functionality from System.IO.Compression to compress into a staging table locally, push smaller staging table across network, then inflate at the destination.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • Super_Grover (10/23/2016)


    Hi all,

    I must transfer a big table (231k rows not that big, but 735GB big because of blob) to another location by INSERT INTO SELECT FROM...

    It appears that first the whole source table is being loaded before any record is actually inserted into the destination table. But there's not enough diskspace because tempdb just grows and grows.

    ...

    If you're just selecting from one table into another table, then I'm thinking it shouldn't dump into tempdb, unless you're doing an ORDER BY.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Phil Parkin (10/25/2016)


    wtren (10/25/2016)


    You should use bcp

    Please justify this comment.

    Although it was a bit short, it's not a bad idea. I've done similar before. Instead of messing around with loops and all manner of other tricks, you BCP the data out as native format (very fast) and then BCP or BULK INSERT it back in where you want it. If you hold your mouth just right while you're doing it (especially with TF 610), you can even do it in a minimally logged fashion with the Clustered Index already in place.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes, very fast, for multiple tables, you can also open several windows and download different table simultaneously.

  • wtren (10/26/2016)


    Yes, very fast, for multiple tables, you can also open several windows and download different table simultaneously.

    Except, in most SAN environments, that will actually slow things down because you can't control which spindles will come into play and head-thrashing frequently becomes a performance problem for such loads. Even if parallel loads work very fast for today, they might not tomorrow.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SAN normally provide cache to ease it, and SAN is always random io pattern.

Viewing 15 posts - 16 through 30 (of 43 total)

You must be logged in to reply to this topic. Login to reply