November 7, 2002 at 2:19 pm
Hey there,
Here's the massive import scenario.
Every day I get 3 files from a client. Using Bulk Insert, I import them into 3 staging tables in a SQL 7.0 database. One of the files produces on the range of 50,000 rows, one on the range of 100,000 rows and the third on the range of 1,500,000 rows.
From these 3 staging tables, I move the newly imported data into 3 parallel staging tables. The only difference between the two tables is that the second set of staging tables have an ID column. The reason that this is important and the reason that this becomes complicated is the following stipulation:
You cannot consider the data ready to summarize until all three files from the client for a given day have all been successfully imported.
Once the data has been imported another application, summarizes the data from the three tables into another table on another server. It accesses the data in the tables through 3 views (listed below). When it completes summarizing the data sets, it deletes the rows which it just summarized (using the stored procedure usp_ClearSummarizedData). This process is repeated iteratively.
For instance on 11-01-2002 I may have recieved and imported file1, file2, but not file3
on 11-02-2002 I may have recieved and imported file2, file3, but not file1
on 11-03-2002 I may have recieved and imported file1, file3, but not file2
on 11-04-2002 I recieved and imported all three files
In this scenario I would only be allowed to summarize data from 11-04-2002. However, I would have to tiptoe around all the data imported on 11-01, 11-02, and 11-03.
The views listed below describe what data is allowed to be summarized.
create table tPrimaryStage1 (col1 varchar(255), col2 varchar(255), col3 varchar(255))
create table tPrimaryStage2 (col1 varchar(255), col2 varchar(255), col3 varchar(255))
create table tPrimaryStage3 (col1 varchar(255), col2 varchar(255), col3 varchar(255))
create table tSecondaryStage1 (ID int, col1 varchar(255), col2 varchar(255), col3 varchar(255))
create table tSecondaryStage2 (ID int, col1 varchar(255), col2 varchar(255), col3 varchar(255))
create table tSecondaryStage3 (ID int, col1 varchar(255), col2 varchar(255), col3 varchar(255))
/* The [Status] field is set to 1 when all three files for a given day [Date] have been successfully imported into the SecondaryStaging tables.*/
create table tImportStatus (ID int IDENTITY, [Date] varchar(20), [Status] bit)
create view DataSet1
AS
select * from tSecondaryStage1 S
join tImportStatus I on I.ID = S.ID
where I.Status = @readyForSummarization
GO
create view DataSet2
AS
select * from tSecondaryStage2 S
join tImportStatus I on I.ID = S.ID
where I.Status = @readyForSummarization
GO
create view DataSet3
AS
select * from tSecondaryStage3 S
join tImportStatus I on I.ID = S.ID
where I.Status = @readyForSummarization
GO
create procedure usp_ClearSummarizedData
AS
delete from tSecondaryStage1
from tSecondaryStage1 S
join tImportStatus I on I.ID = S.ID
where I.Status = 1
delete from tSecondaryStage2
from tSecondaryStage2 S
join tImportStatus I on I.ID = S.ID
where I.Status = 1
delete from tSecondaryStage3
from tSecondaryStage3 S
join tImportStatus I on I.ID = S.ID
where I.Status = 1
GO
Create CLUSTERED INDEX [IX_Stage1] ON [tSecondaryStage1]([ID]) WITH FILLFACTOR = 90 ON [PRIMARY]
Create CLUSTERED INDEX [IX_Stage2] ON [tSecondaryStage2]([ID]) WITH FILLFACTOR = 90 ON [PRIMARY]
Create CLUSTERED INDEX [IX_Stage3] ON [tSecondaryStage3]([ID]) WITH FILLFACTOR = 90 ON [PRIMARY]
This all works logically, however, the insert into the secondary staging table, the views that select out information and the delete procedure (usp_ClearSummarizedData) which is run after day(s) have been summarized all take an enormously long time (10 minutes to perform a select or a delete).
Is this the nature of dealing with 1.5 million rows?
Any ideas how to speed this up?
November 7, 2002 at 2:57 pm
Actually I think you issue is the nature of dealing with DELETE statment more so.
Bulk Insert will not be logged to the transaction log so the process will go much quicker than a regular INSERT.
However, a DELETE is a logged transaction so they all get written to the Transaction log and it grows and grows until done, then depending on it you have truncate on checkpoint set for the DB it may truncate itself. Now if the DELETE is removing all the data you may consider using TRUNCATE TABLE instead. It is not a logged trnasaction and will occurr in a short period of time. But even if you have a few to keep you might just want to move them to another table truncate then move back and truncate the table you used to store them during this.
November 7, 2002 at 3:21 pm
Thanks for that suggestion... I was kind of hoping that I could use some type of conditional truncate on that table, but I guess not.
I use the truncate table command when removing from the table tPrimaryStage.
I know that immediately after I bulk insert into it, I run an "insert into tSecondaryStage select @ID, * from tPrimaryStage" statement.
Then immediately afterwards I truncate the table tPrimaryStage.
Overall, do you think that it would be faster to complete the delete from tSecondaryStage
via method 1:
1.) delete from tSecondaryStage where ID = @ID
or via method 2:
1.) insert rows that I want to keep into tHoldingPattern
2.) truncate table tSecondaryStage
3.) insert all the rows from tHoldingPattern back into tSecondaryStage
I suppose that there's only one way to find out... I'll test it out and let you know.
Another costly time issue in the whole process:
Bulk Inserting into the tPrimaryStage table takes about 6 minutes (for approx 1 million rows)
Then Inserting that data into tSecondaryStage with the ID column takes about 4.5 minutes
Is there any faster/tricky way you can think of to add the data in the PrimaryTable to the SecondaryTable?
I have been toying around with this:
1.) get the next ID that should be applied to this dataset
2.) using Perl, append the id as the first column in each row
3.) bulk insert directly into the secondary staging table bypassing the primarystaging table altogether
Thoughts?
Thanks,
Joel
November 7, 2002 at 4:01 pm
quote:
Overall, do you think that it would be faster to complete the delete from tSecondaryStagevia method 1:
1.) delete from tSecondaryStage where ID = @ID
or via method 2:
1.) insert rows that I want to keep into tHoldingPattern
2.) truncate table tSecondaryStage
3.) insert all the rows from tHoldingPattern back into tSecondaryStage
That all depends on the amount of remianing data. I would not put an index on the holding table to provide fastest input. Also consider what I answer for the other question.
quote:
Is there any faster/tricky way you can think of to add the data in the PrimaryTable to the SecondaryTable?I have been toying around with this:
1.) get the next ID that should be applied to this dataset
2.) using Perl, append the id as the first column in each row
3.) bulk insert directly into the secondary staging table bypassing the primarystaging table altogether
Consider how an index, especially clustered on works.
As you enter data it is put into the table in the order of the clustered index.
Now once the page fills then the page splits, this is consuming resources majorly when you do a lot of rows.
What I suggest is script out dropping your clustered index, do the bulk insert, then add the clustered index back.
What this means is all the data is loaded first before the sorting has to be done, not at the same time.
Note: this does not always help, but in most cases it can make a difference.
Also, if you can take the first table out of the process, as you stated you were playing with, should also offer a performance gain.
And another lesser item that can offer huge gains. Make sure you database is not set to auto_shrink and try to make sure the size and free space grow so that fewer waits occurr for it to grow again. This will take a bit of monitoring changes in size and getting an idea of what is best for you.
Edited by - antares686 on 11/07/2002 4:02:22 PM
November 7, 2002 at 5:12 pm
To follow up on the experimentation:
The strange thing about this is that the time stamp (milliseconds) that I printed here indicates that the first delete method runs much faster than the 2nd delete method
However, if you look at a query execution plan, it would have you think that the first method would take 97% of the time while the other would take up 3%... clearly not true.
What's up with that?
-Joel
Here's the script:
drop table tTester1
drop table tTester2
drop table tHoldingPattern
create table tTester1 (ID int, col1 int)
create table tTester2 (ID int, col1 int)
create table tHoldingPattern (ID int, col1 int)
insert into tTester1
select 1, a.ones + b.tens + c.hundreds + d.thousands + e.tenthousands + f.hundredthousands from
(select 0 as ones UNION select 1 UNION select 2 UNION select 3 UNION select 4 UNION select 5 UNION select 6 UNION select 7 UNION select 8 UNION select 9) as a
join (select 0 as tens UNION select 10 as tens UNION select 20 UNION select 30 UNION select 40 UNION select 50 UNION select 60 UNION select 70 UNION select 80 UNION select 90) as b on 1=1
join (select 0 as hundreds UNION select 100 UNION select 200 UNION select 300 UNION select 400 UNION select 500 UNION select 600 UNION select 700 UNION select 800 UNION select 900) as c on 1=1
join (select 0 as thousands UNION select 1000 UNION select 2000 UNION select 3000 UNION select 4000 UNION select 5000 UNION select 6000 UNION select 7000 UNION select 8000 UNION select 9000) as d on 1=1
join (select 0 as tenthousands UNION select 10000 UNION select 20000 UNION select 30000 UNION select 40000 UNION select 50000 UNION select 60000 UNION select 70000 UNION select 80000 UNION select 90000) as e on 1=1
join (select 0 as hundredthousands UNION select 100000 UNION select 200000 UNION select 300000 UNION select 400000 UNION select 500000 UNION select 600000 UNION select 700000 UNION select 800000 UNION select 900000) as f on 1=1
update tTester1
set ID = 2 where col1 >199999
update tTester1
set ID = 3 where col1 >399999
update tTester1
set ID = 4 where col1 >599999
update tTester1
set ID = 5 where col1 >799999
insert into tTester2
select * from tTester1
create clustered index [IX_tTester1] on tTester1 ([ID]) with fillfactor=90
create clustered index [IX_tTester2] on tTester2 ([ID]) with fillfactor=90
declare @timeStamp datetime, @showTimeStamp int
select @timeStamp = convert(varchar,getDate(),109), @showTimeStamp = 1
if (@showTimeStamp = 1)
print datediff(ms,@timeStamp,convert(varchar,getDate(),109))
set @timeStamp = convert(varchar,getDate(),109)
--method 1
print 'starting ''delete from tTester1 where ID =5'''
delete from tTester1 where ID =5
if (@showTimeStamp = 1)
print datediff(ms,@timeStamp,convert(varchar,getDate(),109))
set @timeStamp = convert(varchar,getDate(),109)
--method 2
print 'starting ''insert into tHoldingPattern'''
insert into tHoldingPattern
select * from tTester2
where ID !=5
if (@showTimeStamp = 1)
print datediff(ms,@timeStamp,convert(varchar,getDate(),109))
set @timeStamp = convert(varchar,getDate(),109)
print 'starting ''truncate table tTester2'''
truncate table tTester2
if (@showTimeStamp = 1)
print datediff(ms,@timeStamp,convert(varchar,getDate(),109))
set @timeStamp = convert(varchar,getDate(),109)
print 'starting ''insert into tTester2
select * from tHoldingPattern'''
insert into tTester2
select * from tHoldingPattern
if (@showTimeStamp = 1)
print datediff(ms,@timeStamp,convert(varchar,getDate(),109))
set @timeStamp = convert(varchar,getDate(),109)
truncate table tHoldingPattern
November 7, 2002 at 5:25 pm
I'll look at tomorrow when I am at a machine with a server.
November 8, 2002 at 6:33 am
See if this helps with reasoning
800000
200000
quote:
To follow up on the experimentation:The strange thing about this is that the time stamp (milliseconds) that I printed here indicates that the first delete method runs much faster than the 2nd delete method
However, if you look at a query execution plan, it would have you think that the first method would take 97% of the time while the other would take up 3%... clearly not true.
First, relative batch percentages are not always greta things to use and I just ingore. I prefer to use time.
Now as for the first delete method being faster than the second let's look at my outputs with your code and I will explain why.
Method one outputs
0
starting 'delete from tTester1 where ID =5'
(200000 row(s) affected)
7753
And method two outputs
10
starting 'insert into tHoldingPattern'
(800000 row(s) affected)
20740
starting 'truncate table tTester2'
30
starting 'insert into tTester2
select * from tHoldingPattern'
(800000 row(s) affected)
43473
If you will not the number of records affected in each. Method one only affects 200000 records during processing, but method two affects 800000 to insert in holding table then those same 800000 to move back to tTester2.
So method one has to deal with a lot less records either way. In comparison thou, if you were truncating all the records then you can see where truncating tTester2 instead of delete runs much faster.
You could build a method however to make a determination based on counts which would be best to do. Note: I have not palyed to in depth this is the theorectical concept.
Ex.
First get the percentage count to affect
DECLARE @per AS INT
SELECT @per = CAST(cast(SUM(CASE WHEN [ID] = 5 THEN 1 ELSE 0 END) as decimal(10,2)) / COUNT([ID]) * 100 as int), COUNT([ID]) FROM tTester1
IF @per > 76
BEGIN
--DO INSERT INTO HOLDING
--TRUNCATE TABLE
--MOVE HELD DATA BACK
--TRUNCATE HOLDING
END
ELSE
BEGIN
--DELETE FROM TABLE
END
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply