How to quickly update/insert 3M records in a large table?

Question

How to quickly update/insert 3M records in a large table?

George Wei

SSC Enthusiast

Points: 185
More actions
March 28, 2015 at 6:50 pm

#316749

Dear all,
Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63506 More actions · Answer 1

This article might give you some ideas... Partitioning in SQL Server[/url]

It sounds like this is a good use for it, but I'm not a partitioning expert... and it depends on the version of SQL Server you're using.

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 2

George Wei (3/28/2015)
Dear all,
Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George

Are there indexes in place to support the join in step 3? Can you post the actual execution plan for a SELECT version of the DELETE?

SELECT (keys for table_a) from table_a inner join table_b on table_a.key_fileds = table_b.key_fields

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Eirikur Eiriksson SSC Guru Points: 182915 More actions · Answer 3

Quick suggestion, use the MERGE (Transact-SQL) statement

😎

Jeff Moden SSC Guru Points: 1004402 More actions · Answer 4

George Wei (3/28/2015)
Dear all,
Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George

Except for the merge process you describe above, it the table static? In other words, is there anything else other than the process above that causes any insert/updates/deletes to the table at all?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 5

Effectively step 3 is redundant why delete the rows in the source, and then reinsert them, you're doubling the work load. Its should just be a case of

1) Import data

2) Update Changed Data

3) Insert New Data.

I agree with Eirikur, It sounds like a good candidate for MERGE.

I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.

MERGE

(

Select *

, HASHBYTES('MD5',col1+'|'+col2+'|'+CONVERT(VARCHAR(100),col3)+.....) CheckSum

FROM table_a

) Dest

USING

(

Select *

, HASHBYTES('MD5',col1+'|'+col2+'|'+CONVERT(VARCHAR(100),col3)+.....) CheckSum

FROM table_b

) Source

ON Source.Key=Dest.Key A

WHEN MATCHED AND Source.CheckSum<>Dest.CheckSum THEN

UPDATE SET

Dest.col1=Source.col1

,Dest.col2=Source.col

,Dest.col3=Source.col2

--:::::::::::::::::

,Dest.CheckSum=Source.CheckSum

WHEN NOT MATCHED BY TARGET THEN

INSERT (Key, col1,col2,col3,.........,Checksum)

VALUES (Source.Key,Source.col1,Source.col3,.......,Source.CheckSum);

----

@pietlinden, there will be no benefits of partitioning and then using Switch, as you cannot guarantee that the data being altered will all be in the same partition frame, and 10 million rows really doesn't justify the additional overhead of partitioning.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

George Wei SSC Enthusiast Points: 185 More actions · Answer 6

Jason,

I've changed the code from delete & insert to MERGE, and I'll get the test result tomorrow.

Why do you prefer CLUSTERED index to NON-CLUSTERED one? I guess it would cost more I/O while importing data from the data file, for data in data file is not ordered.

Jason-299789 (3/30/2015)
Effectively step 3 is redundant why delete the rows in the source, and then reinsert them, you're doubling the work load. Its should just be a case of
1) Import data
2) Update Changed Data
3) Insert New Data.
I agree with Eirikur, It sounds like a good candidate for MERGE.
I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.
----
@pietlinden, there will be no benefits of partitioning and then using Switch, as you cannot guarantee that the data being altered will all be in the same partition frame, and 10 million rows really doesn't justify the additional overhead of partitioning.

Jeff Moden SSC Guru Points: 1004402 More actions · Answer 7

Jason-299789 (3/30/2015)
I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.

"It Depends". Unless the key is ever increasing, Inserts could suffer dramatically due to page splits no matter how they are accomplished.

And I'm not so sure that MERGE will provide anything substantial in the area of performance. To be honest, I'm also not sure that it won't, so it's worth a try even if my hypothesis is that it won't. I don't use MERGE in SQL Server because there have been way too many reported problems with it and I can tolerate doing "l' fashioned", yet effective "upserts" until that the reported problems seriously decline.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 101118 More actions · Answer 8

George Wei (3/28/2015)
Dear all,
Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.
The initial solution is:
1 Create a table (table_b) which structur is as the same as table_a
2 Use BCP to import updated records into table_b
3 Remove outdated data in table_a:
delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields
4 Append updated or new data into table_a:
insert into table_a select * from table_b
As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?
George

A separate step 3 doesn't seem necessary. Can't you just change step2 to only insert non-outdated data to begin with? Wasted overhead to insert the rows just to immediately delete them.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Eirikur Eiriksson SSC Guru Points: 182915 More actions · Answer 9

Jeff Moden (3/30/2015)
Jason-299789 (3/30/2015)
I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.
"It Depends". Unless the key is ever increasing, Inserts could suffer dramatically due to page splits no matter how they are accomplished.
And I'm not so sure that MERGE will provide anything substantial in the area of performance. To be honest, I'm also not sure that it won't, so it's worth a try even if my hypothesis is that it won't. I don't use MERGE in SQL Server because there have been way too many reported problems with it and I can tolerate doing "l' fashioned", yet effective "upserts" until that the reported problems seriously decline.

This looks like being close to the ideal case for a MERGE, given that few precautions are in place such as the source set must be distinct, no conflicting match conditions etc.. If one knows and observes those prerequisites, the MERGE works like a charm.

😎

Eric M Russell SSC Guru Points: 125562 More actions · Answer 10

Most ETL jobs from source to target want to perform both inserts for new rows and updates for existing rows having the same primary key. However, if you only need to insert new rows (rows having a new primary key), and you are not concerned with updating existing rows, then this fairly straightforward.

One option is to set IGNORE_DUP_KEY = ON on the table's primary key, which means that any incoming rows with an existing primary key are simply ignored, meaning they are not inserted. Once that is in place, you can just slam the records home using BCP utility or INSERT.. SELECT..

This would probably be the most optimal in terms of performance, but carefully consider the implications of this. I'll admit to being a reductionist and by default I will implement the simplist solution unless or until the underlying assumptions change, because I've found that approach to be most optimal in terms of performance usually.

A creative use of IGNORE_DUP_KEY

http://sqlblog.com/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Sergiy SSC Guru Points: 110208 More actions · Answer 11

What do you mean "quickly"?

is it "within shortest possible time" or "with shortage possible data access outage"?

_____________
Code for TallyGenerator

George Wei SSC Enthusiast Points: 185 More actions · Answer 12

I must update existing rows if key matches, so a MERGE statement is suitable for me.

Eric M Russell (3/30/2015)
Most ETL jobs from source to target want to perform both inserts for new rows and updates for existing rows having the same primary key. However, if you only need to insert new rows (rows having a new primary key), and you are not concerned with updating existing rows, then this fairly straightforward.
One option is to set IGNORE_DUP_KEY = ON on the table's primary key, which means that any incoming rows with an existing primary key are simply ignored, meaning they are not inserted. Once that is in place, you can just slam the records home using BCP utility or INSERT.. SELECT..
This would probably be the most optimal in terms of performance, but carefully consider the implications of this. I'll admit to being a reductionist and by default I will implement the simplist solution unless or until the underlying assumptions change, because I've found that approach to be most optimal in terms of performance usually.
A creative use of IGNORE_DUP_KEY
http://sqlblog.com/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx