July 7, 2006 at 3:49 am
Hi All,
I have a situation describe below:
A bulk insert job runs with the code below:
BULK INSERT [#TempHolder] FROM '\\MyPath' WITH ( FIELDTERMINATOR = '~', DATAFILETYPE= 'CHAR', ROWTERMINATOR = '\n', KEEPIDENTITY , TABLOCK )
The table is created: MedicService20060601
A Constraint is created below:
alter table MedicService20060601 add constraint Test_Const Primary Key Clustered( StartDate , ServiceType , CDRFingeUnique_Service_ID)
Then I run an Insert into the table with the code
INSERT MedicService20060601(Medic_ID,
Unique_Service_ID,
MedicType,
StartDate,
StartTime,
ServiceType,
ServiceStartDate,
Location_id,Contact_ID,Controller_ID,ServiceMeth,Client_id,
Rate_Card_ID,
NseMedic_ID,
Invoice_id,
Status,
Comp_Cost,
Charge_Benefit)
SELECT
p.Medic_ID,
p.Unique_Service_ID,
p.MedicType,
p.StartDate,
p.StartTime,
p.ServiceType,
Convert(DateTime,p.ServiceStartDate),
p.spLocation_id,
p.Contact_ID,
p.Controller_ID,
p.spServiceMeth,
p.spClient_id,
p.spRate_Card_ID,
p.spNseMedic_ID,
p.spInvoice_id,
p.spStatus,
p.spComp_Cost,
p.spCharge_Benefit
From #TempHolder P
Left join MedicService20060601 MyInsert ON p.ServiceStartDate = MyInsert.ServiceStartDate
and p.MedicType = MyInsert.MedicType
and p.Unique_Service_ID = MyInsert.Unique_Service_ID
where MyInsert.ServiceStartDate IS NULL
and p.MedicType & 8 <> 0
and p.Medic_ID = 948193
and RTRIM(Convert(Char(10), p.ServiceStartDate , 112 )) = '20060601' ORDER BY p.ServiceStartDate, p.MedicType,p.Unique_Service_ID
MedicService20060601 has the following constraints
In this scenerio, almost 15 million rows will get inserted into MedicService20060601 . but its terribly slow !
Any Suggestions welcome
July 7, 2006 at 6:57 am
I'm sure the problem is not the Bulk Insert, but the Insert..Select/join statement. That statement is joining and inserting 15 million rows in a single transaction! I suggest that you look at two options.
First, it appears you are doing the Insert..Select/join because you want to eliminate duplicates based on ServiceStartDate, MedicType, and Unique_Service_ID. How about removing those duplicate rows from the temp table by deleting them using the join. Then you can insert more cleanly by doing a straight Insert..Select without the join on 15 million rows.
But you will still have 15 million inserts in a single transaction. How about batching the inserts. Perhaps there is something in the data where you can make quick 10,000 row sets, and insert those rows wrapping BEGIN TRAN and COMMIT TRAN around them. If no appropriate data field exists to do that, add am identity column to the temp table after the import (and after the delete above), then use that to insert in 10k row transaction blocks.
Edited: I noticed your other similar posts. If this is the same issue, the problem is not BULK INSERT into a #temp table, but the logged INSERT..SELECT/join. Bulk Insert into a temp table will not be affected by primary keys on another table. Very large Bulk Inserts can be made faster using BATCHSIZE parameter to make smaller batches. INSERTS will be logged no matter what the recovery model; the inserts will be written to tlog until the transaction is completed, then if Simple recovery, the log records will be removed from the tlog. But the tlog will grow to hold those 15 million inserts, so if you are shrinking the log, SQL Server has to grow it again, extent by extent. Best is to leave the log large to hold all the transactions.
Hope this helps
Mark
July 7, 2006 at 7:47 am
Thanks for your post Mark,
Its true that I dont want duplicates geting into the table, but the thing is that the insert goes straight to the temp table, so leaves me with no room to remove duplicates.
Also, If i were to do a transaction that inserts 10,000 rows at a time, how does that work, can you provide a sample code ?
How about the primary key Clustered index, does that stay or go ? Does it have any effect ?
Thanks in advance
July 7, 2006 at 2:36 pm
Re: deleting duplicates. I mean to delete the duplicates from the temp table after the bulk insert, by performing the join.
DELETE #TempHolder From #TempHolder P INNER join MedicService20060601 MyInsert ON p.ServiceStartDate = MyInsert.ServiceStartDate and p.MedicType = MyInsert.MedicType and p.Unique_Service_ID = MyInsert.Unique_Service_ID where p.MedicType & 8 <> 0 and p.Medic_ID = 948193 and RTRIM(Convert(Char(10), p.ServiceStartDate , 112 )) = '20060601'
ALTER TABLE #TempHolder ADD COLUMN (rowid int identity(1,1))
DECLARE @range_begin int DECLARE @range_end int DECLARE @rowcnt int SET @range_begin = 1 SET @range_end = 10000 SET @rowcnt = SELECT COUNT(*) from #tempholder IF (@range_end > @rowcnt) SET @range_end = @rowcnt
WHILE @range_begin < @rowcnt BEGIN BEGIN TRAN INSERT MedicService20060601(Medic_ID, Unique_Service_ID, MedicType, StartDate, StartTime, ServiceType, ServiceStartDate, Location_id,Contact_ID,Controller_ID,ServiceMeth,Client_id, Rate_Card_ID, NseMedic_ID, Invoice_id, Status, Comp_Cost, Charge_Benefit) SELECT p.Medic_ID, p.Unique_Service_ID, p.MedicType, p.StartDate, p.StartTime, p.ServiceType, Convert(DateTime,p.ServiceStartDate), p.spLocation_id, p.Contact_ID, p.Controller_ID, p.spServiceMeth, p.spClient_id, p.spRate_Card_ID, p.spNseMedic_ID, p.spInvoice_id, p.spStatus, p.spComp_Cost, p.spCharge_Benefit From #TempHolder P WHERE rowid BETWEEN @range_begin AND @range_end COMMIT TRAN SET @range_begin = @range_end + 1 SET @range_end = @range_end + 10000 IF (@range_end > @rowcnt) SET @range_end = @rowcnt END
This way, your INSERT is not performing a joined query just to get a set of data to insert, your #TempHolder table is all prepared ready to go. And the range variables will break this down into 10000 row batches.
RE Clustered Index: Depends on the what is the clustered index. If the clustered index is an id field or date, so that all 15 million rows are added to the end of the table, then there will likely be no effect greater than neccessary for adding table space. If the clustered index is something like last name, where the 15 million rows will get added into the middle of the existing rows, then you will likely have a lot of page splits, page data fragmentation, and very poor performance. This is certainly something to consider.
Mark
June 29, 2007 at 1:36 pm
Mark,
We have been dealing with the batch loading issue too. I like your method.
Regarding the BEGIN/COMMIT TRAN piece, somebody suggested to us that we simply issue the CHECKPOINT statement instead after the INSERT/SELECT. Any thoughts regarding that?
Thanks much.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply