March 28, 2015 at 6:50 pm
Dear all,
Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George
March 28, 2015 at 7:10 pm
March 29, 2015 at 1:38 am
George Wei (3/28/2015)
Dear all,Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George
Are there indexes in place to support the join in step 3? Can you post the actual execution plan for a SELECT version of the DELETE?
SELECT (keys for table_a) from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
March 29, 2015 at 1:47 am
Quick suggestion, use the MERGE (Transact-SQL) statement
😎
March 29, 2015 at 2:10 pm
George Wei (3/28/2015)
Dear all,Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George
Except for the merge process you describe above, it the table static? In other words, is there anything else other than the process above that causes any insert/updates/deletes to the table at all?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 30, 2015 at 1:20 am
Effectively step 3 is redundant why delete the rows in the source, and then reinsert them, you're doubling the work load. Its should just be a case of
1) Import data
2) Update Changed Data
3) Insert New Data.
I agree with Eirikur, It sounds like a good candidate for MERGE.
I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.
MERGE
(
Select *
, HASHBYTES('MD5',col1+'|'+col2+'|'+CONVERT(VARCHAR(100),col3)+.....) CheckSum
FROM table_a
) Dest
USING
(
Select *
, HASHBYTES('MD5',col1+'|'+col2+'|'+CONVERT(VARCHAR(100),col3)+.....) CheckSum
FROM table_b
) Source
ON Source.Key=Dest.Key A
WHEN MATCHED AND Source.CheckSum<>Dest.CheckSum THEN
UPDATE SET
Dest.col1=Source.col1
,Dest.col2=Source.col
,Dest.col3=Source.col2
--:::::::::::::::::
,Dest.CheckSum=Source.CheckSum
WHEN NOT MATCHED BY TARGET THEN
INSERT (Key, col1,col2,col3,.........,Checksum)
VALUES (Source.Key,Source.col1,Source.col3,.......,Source.CheckSum);
----
@pietlinden, there will be no benefits of partitioning and then using Switch, as you cannot guarantee that the data being altered will all be in the same partition frame, and 10 million rows really doesn't justify the additional overhead of partitioning.
_________________________________________________________________________
SSC Guide to Posting and Best Practices
March 30, 2015 at 10:12 am
Jason,
I've changed the code from delete & insert to MERGE, and I'll get the test result tomorrow.
Why do you prefer CLUSTERED index to NON-CLUSTERED one? I guess it would cost more I/O while importing data from the data file, for data in data file is not ordered.
Jason-299789 (3/30/2015)
Effectively step 3 is redundant why delete the rows in the source, and then reinsert them, you're doubling the work load. Its should just be a case of1) Import data
2) Update Changed Data
3) Insert New Data.
I agree with Eirikur, It sounds like a good candidate for MERGE.
I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.
----
@pietlinden, there will be no benefits of partitioning and then using Switch, as you cannot guarantee that the data being altered will all be in the same partition frame, and 10 million rows really doesn't justify the additional overhead of partitioning.
March 30, 2015 at 10:28 am
Jason-299789 (3/30/2015)
I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.
"It Depends". Unless the key is ever increasing, Inserts could suffer dramatically due to page splits no matter how they are accomplished.
And I'm not so sure that MERGE will provide anything substantial in the area of performance. To be honest, I'm also not sure that it won't, so it's worth a try even if my hypothesis is that it won't. I don't use MERGE in SQL Server because there have been way too many reported problems with it and I can tolerate doing "l' fashioned", yet effective "upserts" until that the reported problems seriously decline.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 30, 2015 at 10:51 am
George Wei (3/28/2015)
Dear all,Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George
A separate step 3 doesn't seem necessary. Can't you just change step2 to only insert non-outdated data to begin with? Wasted overhead to insert the rows just to immediately delete them.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
March 30, 2015 at 11:20 am
Jeff Moden (3/30/2015)
Jason-299789 (3/30/2015)
I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like."It Depends". Unless the key is ever increasing, Inserts could suffer dramatically due to page splits no matter how they are accomplished.
And I'm not so sure that MERGE will provide anything substantial in the area of performance. To be honest, I'm also not sure that it won't, so it's worth a try even if my hypothesis is that it won't. I don't use MERGE in SQL Server because there have been way too many reported problems with it and I can tolerate doing "l' fashioned", yet effective "upserts" until that the reported problems seriously decline.
This looks like being close to the ideal case for a MERGE, given that few precautions are in place such as the source set must be distinct, no conflicting match conditions etc.. If one knows and observes those prerequisites, the MERGE works like a charm.
😎
March 30, 2015 at 12:29 pm
Most ETL jobs from source to target want to perform both inserts for new rows and updates for existing rows having the same primary key. However, if you only need to insert new rows (rows having a new primary key), and you are not concerned with updating existing rows, then this fairly straightforward.
One option is to set IGNORE_DUP_KEY = ON on the table's primary key, which means that any incoming rows with an existing primary key are simply ignored, meaning they are not inserted. Once that is in place, you can just slam the records home using BCP utility or INSERT.. SELECT..
This would probably be the most optimal in terms of performance, but carefully consider the implications of this. I'll admit to being a reductionist and by default I will implement the simplist solution unless or until the underlying assumptions change, because I've found that approach to be most optimal in terms of performance usually.
A creative use of IGNORE_DUP_KEY
http://sqlblog.com/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
March 30, 2015 at 9:49 pm
What do you mean "quickly"?
is it "within shortest possible time" or "with shortage possible data access outage"?
_____________
Code for TallyGenerator
March 30, 2015 at 11:36 pm
I must update existing rows if key matches, so a MERGE statement is suitable for me.
Eric M Russell (3/30/2015)
Most ETL jobs from source to target want to perform both inserts for new rows and updates for existing rows having the same primary key. However, if you only need to insert new rows (rows having a new primary key), and you are not concerned with updating existing rows, then this fairly straightforward.One option is to set IGNORE_DUP_KEY = ON on the table's primary key, which means that any incoming rows with an existing primary key are simply ignored, meaning they are not inserted. Once that is in place, you can just slam the records home using BCP utility or INSERT.. SELECT..
This would probably be the most optimal in terms of performance, but carefully consider the implications of this. I'll admit to being a reductionist and by default I will implement the simplist solution unless or until the underlying assumptions change, because I've found that approach to be most optimal in terms of performance usually.
A creative use of IGNORE_DUP_KEY
http://sqlblog.com/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply