Bulk Insert and Inserts so slow ! Any suggestions

  • Hi All,

    I have a situation describe below:

    A bulk insert job runs with the code below:

    BULK INSERT  [#TempHolder] FROM '\\MyPath' WITH ( FIELDTERMINATOR = '~', DATAFILETYPE= 'CHAR', ROWTERMINATOR = '\n', KEEPIDENTITY  , TABLOCK )

    The table is created: MedicService20060601

    A Constraint is created below:

    alter table MedicService20060601 add  constraint Test_Const Primary Key Clustered( StartDate , ServiceType , CDRFingeUnique_Service_ID)

    Then I run an Insert into the table with the code

    INSERT MedicService20060601(Medic_ID,

       Unique_Service_ID,

       MedicType,

       StartDate, 

       StartTime,

       ServiceType,

       ServiceStartDate,

       Location_id,Contact_ID,Controller_ID,ServiceMeth,Client_id,

       Rate_Card_ID,

       NseMedic_ID,

       Invoice_id,

       Status,

       Comp_Cost,

       Charge_Benefit)

      SELECT

       p.Medic_ID,

       p.Unique_Service_ID,

       p.MedicType,

       p.StartDate,

       p.StartTime, 

       p.ServiceType,

       Convert(DateTime,p.ServiceStartDate),

       p.spLocation_id,

       p.Contact_ID,

       p.Controller_ID,

       p.spServiceMeth,

       p.spClient_id,

       p.spRate_Card_ID,

       p.spNseMedic_ID,

       p.spInvoice_id,

       p.spStatus,

       p.spComp_Cost,

       p.spCharge_Benefit

      From #TempHolder P

      Left join MedicService20060601 MyInsert ON p.ServiceStartDate = MyInsert.ServiceStartDate

             and p.MedicType = MyInsert.MedicType

             and p.Unique_Service_ID = MyInsert.Unique_Service_ID

      where MyInsert.ServiceStartDate IS NULL

      and p.MedicType & 8 <> 0

      and p.Medic_ID = 948193   

      and RTRIM(Convert(Char(10), p.ServiceStartDate , 112 )) = '20060601' ORDER BY  p.ServiceStartDate, p.MedicType,p.Unique_Service_ID

     

    MedicService20060601 has the following constraints

     

    In this scenerio, almost 15 million  rows will get inserted into MedicService20060601 . but its terribly slow !

     

    Any Suggestions welcome

     


    Kindest Regards,

    John Burchel (Trainee Developer)

  • I'm sure the problem is not the Bulk Insert, but the Insert..Select/join statement.  That statement is joining and inserting 15 million rows in a single transaction!  I suggest that you look at two options. 

    First, it appears you are doing the Insert..Select/join because you want to eliminate duplicates based on ServiceStartDate, MedicType, and Unique_Service_ID.  How about removing those duplicate rows from the temp table by deleting them using the join.  Then you can insert more cleanly by doing a straight Insert..Select without the join on 15 million rows.

    But you will still have 15 million inserts in a single transaction.  How about batching the inserts.  Perhaps there is something in the data where you can make quick 10,000 row sets, and insert those rows wrapping BEGIN TRAN and COMMIT TRAN around them.  If no appropriate data field exists to do that, add am identity column to the temp table after the import (and after the delete above), then use that to insert in 10k row transaction blocks.

    Edited: I noticed your other similar posts.  If this is the same issue, the problem is not BULK INSERT into a #temp table, but the logged INSERT..SELECT/join.  Bulk Insert into a temp table will not be affected by primary keys on another table.  Very large Bulk Inserts can be made faster using BATCHSIZE parameter to make smaller batches.  INSERTS will be logged no matter what the recovery model; the inserts will be written to tlog until the transaction is completed, then if Simple recovery, the log records will be removed from the tlog. But the tlog will grow to hold those 15 million inserts, so if you are shrinking the log, SQL Server has to grow it again, extent by extent.  Best is to leave the log large to hold all the transactions.

    Hope this helps



    Mark

  • Thanks for your post Mark,

    Its true that I dont want duplicates geting into the table, but the thing is that the insert goes straight to the temp table, so leaves me with no room to remove duplicates.

    Also, If i were to do a transaction that inserts 10,000 rows at a time, how does that work, can you provide a sample code ?

    How about the primary key Clustered index, does that stay or go ? Does it have any effect ?

    Thanks in advance


    Kindest Regards,

    John Burchel (Trainee Developer)

  • Re: deleting duplicates.  I mean to delete the duplicates from the temp table after the bulk insert, by performing the join.

    DELETE #TempHolder 
      From #TempHolder P INNER join MedicService20060601 MyInsert 
      ON p.ServiceStartDate = MyInsert.ServiceStartDate 
             and p.MedicType = MyInsert.MedicType
             and p.Unique_Service_ID = MyInsert.Unique_Service_ID 
      where p.MedicType & 8 <> 0 
      and p.Medic_ID = 948193    
      and RTRIM(Convert(Char(10), p.ServiceStartDate , 112 )) = '20060601' 
    ALTER TABLE #TempHolder
      ADD COLUMN (rowid int identity(1,1))
    DECLARE @range_begin int
    DECLARE @range_end int
    
    DECLARE @rowcnt int
    SET @range_begin = 1
    SET @range_end = 10000
    SET @rowcnt = SELECT COUNT(*) from #tempholder
    IF (@range_end > @rowcnt) SET @range_end = @rowcnt
    WHILE @range_begin < @rowcnt
      BEGIN
        BEGIN TRAN
         INSERT MedicService20060601(Medic_ID,
           Unique_Service_ID,
           MedicType,
           StartDate, 
           StartTime,
           ServiceType,
           ServiceStartDate,
           Location_id,Contact_ID,Controller_ID,ServiceMeth,Client_id,
           Rate_Card_ID,
           NseMedic_ID,
           Invoice_id,
           Status,
           Comp_Cost,
           Charge_Benefit)
          SELECT 
           p.Medic_ID,
           p.Unique_Service_ID,
           p.MedicType,
           p.StartDate,
           p.StartTime, 
           p.ServiceType,
           Convert(DateTime,p.ServiceStartDate),
           p.spLocation_id,
           p.Contact_ID,
           p.Controller_ID,
           p.spServiceMeth,
           p.spClient_id,
           p.spRate_Card_ID,
           p.spNseMedic_ID,
           p.spInvoice_id,
           p.spStatus,
           p.spComp_Cost,
           p.spCharge_Benefit
          From #TempHolder P
          WHERE rowid BETWEEN @range_begin AND @range_end
        COMMIT TRAN
        SET @range_begin = @range_end + 1
        SET @range_end = @range_end + 10000
        IF (@range_end > @rowcnt) SET @range_end = @rowcnt
      END

    This way, your INSERT is not performing a joined query just to get a set of data to insert, your #TempHolder table is all prepared ready to go.  And the range variables will break this down into 10000 row batches.

    RE Clustered Index: Depends on the what is the clustered index.  If the clustered index is an id field or date, so that all 15 million rows are added to the end of the table, then there will likely be no effect greater than neccessary for adding table space.  If the clustered index is something like last name, where the 15 million rows will get added into the middle of the existing rows, then you will likely have a lot of page splits, page data fragmentation, and very poor performance.  This is certainly something to consider.



    Mark

  • Mark,

    We have been dealing with the batch loading issue too.  I like your method.

    Regarding the BEGIN/COMMIT TRAN piece, somebody suggested to us that we simply issue the CHECKPOINT statement instead after the INSERT/SELECT.  Any thoughts regarding that?

    Thanks much.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply