Get over 500,000 records into a table

  • Hello Everyone, hope everyone is well and staying safe!

    I have an issue that I've been working on all week.  I hope someone can put a different perspective on my process to get it to succeed.  I currently have my blinders on and cannot think of anything else to try.

    I have the following scenario:  Database (360GB), Query that has 3 inner joins and one of those tables is 206 GB in size.  That table also contains a varbinary(max) column that hold a blob.  The record count returned by the query is 583,393 which needs to be inserted into a table within the same database.

    I have tried the following:

    - Perform a SELECT INTO

    - Tried the transfer with SSIS Data Flow task, Fast Load (adjusting the "Row per batch" and "Maximum insert commit size")

    - Tried the Export Wizard

    - requested extra disk space, up to a TB

    - requested more memory allocation 64GB for a total of 96GB

    - set max memory set to 92GB.

    Unfortunately, I cannot get it to complete.  I get to approx. 396,000 row insert into the table and it crashes with a memory error and this error.

    Information 0x4004800c: Data Flow Task 1: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 0 buffers were considered and 0 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

    (SQL Server Import and Export Wizard)

    Information 0x4004800f: Data Flow Task 1: Buffer manager allocated 3 megabyte(s) in 1 physical buffer(s).

    (SQL Server Import and Export Wizard)

    Information 0x40048010: Data Flow Task 1: Component "Source - Query" (1) owns 3 megabyte(s) physical buffer.

    (SQL Server Import and Export Wizard)

    Error 0x80070070: Data Flow Task 1: (SQL Server Import and Export Wizard)

    Error 0xc0208266: Data Flow Task 1: (SQL Server Import and Export Wizard)

    Error 0xc0208265: Data Flow Task 1: Failed to retrieve long data for column "FileData".

    (SQL Server Import and Export Wizard)

    Error 0xc020901c: Data Flow Task 1: There was an error with Source - Query.Outputs[OLE DB Source Output].Columns[FileData] on Source - Query.Outputs[OLE DB Source Output]. The column status returned was: "DBSTATUS_UNAVAILABLE".

    (SQL Server Import and Export Wizard)

    Error 0xc0209029: Data Flow Task 1: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The "Source - Query.Outputs[OLE DB Source Output].Columns[FileData]" failed because error code 0xC0209071 occurred, and the error row disposition on "Source - Query.Outputs[OLE DB Source Output].Columns[FileData]" specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.

    (SQL Server Import and Export Wizard)

    Error 0xc0047038: Data Flow Task 1: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on Source - Query returned error code 0xC0209029. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing. There may be error messages posted before this with more information about the failure.

    (SQL Server Import and Export Wizard)

    Any advice and input would be greatly appreciated.  Thanks in advance Rob.

  • don't use SSIS for it - if the database is on same server just do a straight insert into it - if it is on a different server then use powershell or C# with a bulkinsert object. (powershell example here https://devblogs.microsoft.com/scripting/use-powershell-to-copy-a-table-between-two-sql-server-instances/

     

    SSIS when dealing with LOBS will be very very slow and will always allocate swap and memory to deal with them in a way that is not healthy for any system.

  • Thanks Frederico for your response.  I did try "Select Into" and "Insert Into" earlier but never finished within a 5 hr. window so I finally killed the query.  I haven't tried it since the server team allocated more disk space and memory so I just ran the insert and it's been executing for about 25 minutes as I type this message.  Hopefully it will be successful.

  • I have to ask... why are you duplicating more than a half million rows from a 206GB table into the SAME database???  If we knew that, we might be able to make a better suggestion that doing the duplication.

    That, notwithstanding and according to the error you got, it DID make it more than half way.  Sooooooooo.... "Divide'n'Conquer".  Don't do it with a single query.  Do it in two or three.

    Also, what recovery model are you using during this evolution?  And, does the target need a clustered index?  If the database is in the FULL Recovery Model, try taking a log backup, shift the database to the BULK LOGGED Recovery Model, do the copy, shift back to the FULL Recovery Model, and take another log backup to minimize the period where you can't do a Point-in-Time restore.  That might allow you to do it all in a single copy thanks to Minimal Logging.

    If you come back with whether or not the new table needs a Clustered Index or not, I can show you a trick to populate the table with Minimal Logging even though the Clustered Index is already in, place.  It would also be handy if you posted the CREATE TABLE (including all indexes and constraints) for the original source table as well as the result of a "SAMPLED" run of sys.dm_db_index_physical_stats() for the original table.

    Right now, email notifications as to responses on posts like this one are broken and I don't check this site for responses but once every 2 to 6 hours, so be patient.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hello Jeff, well this is actually a migration project of a vendor database we are porting the data to a new software product.  They want this data in the table below.  I was only following the instruction from the PM...which I don't think he had a clue how much data the database contained but now he does.  This was supposed to be done last weekend but still fighting it because they said it is just a SELECT INTO table query.

    I have backed up the database because PROD resides on a cluster so we could not perform all the disk and memory allocations without an impact.  I restored the database on one of our DEV server, which is where I am doing all the work.  I don't need any logging so the database is in Simple recovery model...all I need is to get the data described in the initial post into the table for the vendor.  If you are looking for the clustered indexes on the table in which the data is being selected, I would need pull this from 4 separate tables.  Below is the table structure in which the data is going.

    CREATE TABLE [dbo].[UserDocuments](

    [fk_SysId] [INT] NOT NULL,

    [UpdatedBy] [NVARCHAR](101) NOT NULL,

    [DocumentName] [NVARCHAR](100) NOT NULL,

    [Keyword] [NVARCHAR](200) NOT NULL,

    [FileName] [NVARCHAR](100) NOT NULL,

    [FileData] [VARBINARY](MAX) NOT NULL,

    [DateCreated] [DATETIME2](7) NOT NULL,

    [DocumentId] [INT] NOT NULL

    )ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

    GO

    I would be grateful if you can explain how to break it up or anyway to get this accomplished.

  • I would try doing the insert in batches based on the primary key or another integer.  Create a while loop based off the key field and insert a smaller set of data within the while loop.  I've done this in the past for large inserts and it works well.  You can probably find an example on this site. or another site.  Hope this helps!

  • --First, I'd strongly urge you to separate the binary data from the main table, which you can do using sp_tableoption.

    --Page-compressing the main table will save considerable space, although it may somewhat slower to load.

    --Really all those strings should be encoded but I don't think you really want to do that at this point.

    CREATE TABLE [dbo].[UserDocuments](
    [fk_SysId] [INT] NOT NULL,
    [UpdatedBy] [NVARCHAR](101) NOT NULL,
    [DocumentName] [NVARCHAR](100) NOT NULL,
    [Keyword] [NVARCHAR](200) NOT NULL,
    [FileName] [NVARCHAR](100) NOT NULL,
    [FileData] [VARBINARY](MAX) NOT NULL,
    [DateCreated] [DATETIME2](7) NOT NULL,
    [DocumentId] [INT] NOT NULL
    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
    GO
    EXEC sys.sp_tableoption 'dbo.UserDocuments', 'large value types out of row', 1 --<<--
    --if fk_sysId by itself is not unique, add DocumentId OR DateCreated.
    CREATE UNIQUE CLUSTERED INDEX [UserDocuments__CL]
    ON dbo.UserDocuments ( fk_SysId )
    WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR = 99, SORT_IN_TEMPDB = ON )
    ON [PRIMARY];

    --Second, PRE-ALLOCATE enough space to the log space to handle the full load(s). This will drastically speed up the load vs the log having to dynamically extend.

    --Finally, I (we?) would like to see the actual query that loads the table. Assuming the query can write the rows as they are processed, I'm not sure why SQL is running out of memory trying to do the load. Looking at the query might help determine this. Is there something in the query that forces SQL to process everything before it starts to load the new table?

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Here is the query....provided by the vendor.

    SELECT tblCon.fk_SysId,

    CONCAT (tblPasswd.FName, ' ', tblPasswd.LName) AS UpdatedBy,

    tblDoc.DocumentName,

    tblDoc.Keyword,

    tblDoc.FileName,

    tblDoc.FileData,

    tblDoc.DateCreated,

    tblDoc.DocumentId

    FROM tblCon

    INNER JOIN tblConDoc ON tblCon.ContextId = tblConDoc.fk_ContextId

    INNER JOIN tblDoc ON tblConDoc.fk_DocumentId = tblDoc.DocumentId

    INNER JOIN tblPasswd ON tblDoc.LastUpdatedBy = tblPasswd.user_id;

    • This reply was modified 2 years, 11 months ago by  rmahon.
  • karen.ferrara wrote:

    I would try doing the insert in batches based on the primary key or another integer.  Create a while loop based off the key field and insert a smaller set of data within the while loop.  I've done this in the past for large inserts and it works well.  You can probably find an example on this site. or another site.  Hope this helps!

    This.

    When record count of the SELECT = 0, exit the loop.  TEST FIRST in DEV. Make sure you're not inserting the same group over & over.

    INSERT INTO My_Table

    SELECT top 100000 FLD1, FLD2, FLD3

    from Other_Table OTH

    Where not exists (SELECT * from My_Table MY where MY.KEY = OTH.KEY)

    ORDER BY KEY fields  --- Can be useful when testing to make sure you get the next group.

     

     

    • This reply was modified 2 years, 11 months ago by  homebrew01.
    • This reply was modified 2 years, 11 months ago by  homebrew01.
    • This reply was modified 2 years, 11 months ago by  homebrew01.
    • This reply was modified 2 years, 11 months ago by  homebrew01.
    • This reply was modified 2 years, 11 months ago by  homebrew01.
    • This reply was modified 2 years, 11 months ago by  homebrew01.
    • This reply was modified 2 years, 11 months ago by  homebrew01.
  • That load query is fairly straightforward.  I think SQL should be able to write output as it is generated, reducing memory requirements.  Would have to see the query plan to be 100% sure.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Hello, thanks you all for providing input.  I am going through all the suggestions.  I'm attempting homebrew01 suggestion now.  A quick note to ScottPletcher, yes look pretty straight forward...I started the exact query this morning while I tended to other task and went back and looked (actually forgot about it) in 1.5 hours and has only select 25,000 rows of the roughly 590,000 needed.  The DEV machines specs are as follows:

    4 cores

    96 GB or memory - 92GB to SQL

    Data drive > 1 TB

    Log drive > 260 GB

     

  • Did you pre-allocate enough log space to handle the entire insert?

    I had a load task that took 4+ hours.  I noticed the initial log size was very small.  After I pre-allocated the log, it finished in 20 mins and change.

    Also, make sure SQL doesn't have to do a sort for the INSERT.  That could force SQL to produce, and store, the entire result set before starting the INSERTs.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Hello Eveyone.  Here is the latest. I decided to try the insert of all the records WITHOUT the incredibly large field [varbinary(max) column...aka the blob of the actual pdf document]. It inserted the full amount 583,393 in less than 10 seconds.  So, we all know the column listed above it the main reason for the amount of time to load, the amount of space taken on the database, the amount memory to process! I am now trying to update the field of the loaded table with the [varbinary(max) column...aka the blob of the actual document] values from the tblDocuments table. I am doing this in batches of 70,000 however I believe I am going to kill the process as it has been running for 1hr. 30min. and the log file is beginning to run out of disk space. I am going to edit my script and try smaller increments.  Does anyone have any suggestion on updating one column in a database in batches that would run more efficient?  Update script...

    DECLARE @Rowcount INT = 1;

    WHILE (@Rowcount > 0)

    BEGIN

    PRINT @Rowcount;

    UPDATE TOP (70000)

    ud

    SET ud.FileData = d.FileData

    FROM dbo.UserDocuments ud

    INNER JOIN dbo.tblDocuments d ON ud.DocumentId = d.DocumentId;

    SET @Rowcount = @@ROWCOUNT;

    CHECKPOINT;

    END;

  • That's a lot of images at one time.  I suggest starting at 10K or even 5K and see how that goes first.

    Btw, be sure to use sp_tableoption before UPDATEing the images.  That will make the image load process vastly more efficient.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Hello Scott, can you elaborate on the sp_tableoption, please?

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply