How to quickly update/insert 3M records in a large table?

  • Dear all,

    Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.

    The initial solution is:

    1 Create a table (table_b) which structur is as the same as table_a

    2 Use BCP to import updated records into table_b

    3 Remove outdated data in table_a:

    delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields

    4 Append updated or new data into table_a:

    insert into table_a select * from table_b

    As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?

    George

  • This article might give you some ideas... Partitioning in SQL Server[/url]

    It sounds like this is a good use for it, but I'm not a partitioning expert... and it depends on the version of SQL Server you're using.

  • George Wei (3/28/2015)


    Dear all,

    Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.

    The initial solution is:

    1 Create a table (table_b) which structur is as the same as table_a

    2 Use BCP to import updated records into table_b

    3 Remove outdated data in table_a:

    delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields

    4 Append updated or new data into table_a:

    insert into table_a select * from table_b

    As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?

    George

    Are there indexes in place to support the join in step 3? Can you post the actual execution plan for a SELECT version of the DELETE?

    SELECT (keys for table_a) from table_a inner join table_b on table_a.key_fileds = table_b.key_fields


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Quick suggestion, use the MERGE (Transact-SQL) statement

    😎

  • George Wei (3/28/2015)


    Dear all,

    Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.

    The initial solution is:

    1 Create a table (table_b) which structur is as the same as table_a

    2 Use BCP to import updated records into table_b

    3 Remove outdated data in table_a:

    delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields

    4 Append updated or new data into table_a:

    insert into table_a select * from table_b

    As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?

    George

    Except for the merge process you describe above, it the table static? In other words, is there anything else other than the process above that causes any insert/updates/deletes to the table at all?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Effectively step 3 is redundant why delete the rows in the source, and then reinsert them, you're doubling the work load. Its should just be a case of

    1) Import data

    2) Update Changed Data

    3) Insert New Data.

    I agree with Eirikur, It sounds like a good candidate for MERGE.

    I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.

    MERGE

    (

    Select *

    , HASHBYTES('MD5',col1+'|'+col2+'|'+CONVERT(VARCHAR(100),col3)+.....) CheckSum

    FROM table_a

    ) Dest

    USING

    (

    Select *

    , HASHBYTES('MD5',col1+'|'+col2+'|'+CONVERT(VARCHAR(100),col3)+.....) CheckSum

    FROM table_b

    ) Source

    ON Source.Key=Dest.Key A

    WHEN MATCHED AND Source.CheckSum<>Dest.CheckSum THEN

    UPDATE SET

    Dest.col1=Source.col1

    ,Dest.col2=Source.col

    ,Dest.col3=Source.col2

    --:::::::::::::::::

    ,Dest.CheckSum=Source.CheckSum

    WHEN NOT MATCHED BY TARGET THEN

    INSERT (Key, col1,col2,col3,.........,Checksum)

    VALUES (Source.Key,Source.col1,Source.col3,.......,Source.CheckSum);

    ----

    @pietlinden, there will be no benefits of partitioning and then using Switch, as you cannot guarantee that the data being altered will all be in the same partition frame, and 10 million rows really doesn't justify the additional overhead of partitioning.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Jason,

    I've changed the code from delete & insert to MERGE, and I'll get the test result tomorrow.

    Why do you prefer CLUSTERED index to NON-CLUSTERED one? I guess it would cost more I/O while importing data from the data file, for data in data file is not ordered.

    Jason-299789 (3/30/2015)


    Effectively step 3 is redundant why delete the rows in the source, and then reinsert them, you're doubling the work load. Its should just be a case of

    1) Import data

    2) Update Changed Data

    3) Insert New Data.

    I agree with Eirikur, It sounds like a good candidate for MERGE.

    I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.

    ----

    @pietlinden, there will be no benefits of partitioning and then using Switch, as you cannot guarantee that the data being altered will all be in the same partition frame, and 10 million rows really doesn't justify the additional overhead of partitioning.

  • Jason-299789 (3/30/2015)


    I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.

    "It Depends". Unless the key is ever increasing, Inserts could suffer dramatically due to page splits no matter how they are accomplished.

    And I'm not so sure that MERGE will provide anything substantial in the area of performance. To be honest, I'm also not sure that it won't, so it's worth a try even if my hypothesis is that it won't. I don't use MERGE in SQL Server because there have been way too many reported problems with it and I can tolerate doing "l' fashioned", yet effective "upserts" until that the reported problems seriously decline.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • George Wei (3/28/2015)


    Dear all,

    Our system runs a SQL Server 2012 DB, it has a table (table_a) which has over 10M records. Our system have to receive data file from previous system daily which contains approximate 3M updated or new records for table_a. My job is to update table_a with the new data.

    The initial solution is:

    1 Create a table (table_b) which structur is as the same as table_a

    2 Use BCP to import updated records into table_b

    3 Remove outdated data in table_a:

    delete from table_a inner join table_b on table_a.key_fileds = table_b.key_fields

    4 Append updated or new data into table_a:

    insert into table_a select * from table_b

    As the test result, this solution is very inefficient. Step 3 costs several hours, e.g. How can I improve it?

    George

    A separate step 3 doesn't seem necessary. Can't you just change step2 to only insert non-outdated data to begin with? Wasted overhead to insert the rows just to immediately delete them.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Jeff Moden (3/30/2015)


    Jason-299789 (3/30/2015)


    I would also suggest that the key on both the destination table is clustered it should make the process significantly faster. so something like.

    "It Depends". Unless the key is ever increasing, Inserts could suffer dramatically due to page splits no matter how they are accomplished.

    And I'm not so sure that MERGE will provide anything substantial in the area of performance. To be honest, I'm also not sure that it won't, so it's worth a try even if my hypothesis is that it won't. I don't use MERGE in SQL Server because there have been way too many reported problems with it and I can tolerate doing "l' fashioned", yet effective "upserts" until that the reported problems seriously decline.

    This looks like being close to the ideal case for a MERGE, given that few precautions are in place such as the source set must be distinct, no conflicting match conditions etc.. If one knows and observes those prerequisites, the MERGE works like a charm.

    😎

  • Most ETL jobs from source to target want to perform both inserts for new rows and updates for existing rows having the same primary key. However, if you only need to insert new rows (rows having a new primary key), and you are not concerned with updating existing rows, then this fairly straightforward.

    One option is to set IGNORE_DUP_KEY = ON on the table's primary key, which means that any incoming rows with an existing primary key are simply ignored, meaning they are not inserted. Once that is in place, you can just slam the records home using BCP utility or INSERT.. SELECT..

    This would probably be the most optimal in terms of performance, but carefully consider the implications of this. I'll admit to being a reductionist and by default I will implement the simplist solution unless or until the underlying assumptions change, because I've found that approach to be most optimal in terms of performance usually.

    A creative use of IGNORE_DUP_KEY

    http://sqlblog.com/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • What do you mean "quickly"?

    is it "within shortest possible time" or "with shortage possible data access outage"?

    _____________
    Code for TallyGenerator

  • I must update existing rows if key matches, so a MERGE statement is suitable for me.

    Eric M Russell (3/30/2015)


    Most ETL jobs from source to target want to perform both inserts for new rows and updates for existing rows having the same primary key. However, if you only need to insert new rows (rows having a new primary key), and you are not concerned with updating existing rows, then this fairly straightforward.

    One option is to set IGNORE_DUP_KEY = ON on the table's primary key, which means that any incoming rows with an existing primary key are simply ignored, meaning they are not inserted. Once that is in place, you can just slam the records home using BCP utility or INSERT.. SELECT..

    This would probably be the most optimal in terms of performance, but carefully consider the implications of this. I'll admit to being a reductionist and by default I will implement the simplist solution unless or until the underlying assumptions change, because I've found that approach to be most optimal in terms of performance usually.

    A creative use of IGNORE_DUP_KEY

    http://sqlblog.com/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply