Wht is the best way to load 15 million rows into a table with a pk? There are known duplicates.

  • Good morning,

    I have a table that needed a Primary key on the table id column (tableID int).  The table has 15 million rows.
    1.  I copied the table to a exportLogTable2

    select * into exportLogTable2 from exportLogTable

    2.  I then truncated my table

    Truncate table exportLogTable

    3.  Created the Primary Key on the TableID column

    4.  Now I need to add those 15 million rows back into the table.  I know there will be duplicates.  However, as soon as SQL Server hits a duplicate, it errors and stops the inserting of rows.  What is the best way to load the data into knowing some rows will be kicked out or not written due to duplicates?

    5.  The primary key is just on the TableID column.

    Thanks.

    Things will work out.  Get back up, change some parameters and recode.

  • Create a clustered index on table2 on the same columns that will form the clustering key on the main table.  You'll then need to write a query that will not only identify the duplicates, but choose which row to keep in any set of duplicates.  The just use that query to insert back into the main table.

    John

  • You could: 
    1) identify the dups, and either flag them on the source, or create an 'exclusion' table that contains the duplcate data that you can use to eliminate them for the next step
    2) load the bulk, but exclude the duplicate rows. 
    3) process the duplicates off the the exclusion table or flagged source table to bring over just the records you want.

  • I can't create the clustered index with the duplicates. I've already tried that.  I have get rid of the dups first.

    I guess I could write a query to show the dups.  Seems a slow way when 15 million rows in the table.  But if I identity the dups, I could go through them and delete the one of the dup rows.  Again, seems a long process.

    Things will work out.  Get back up, change some parameters and recode.

  • WebTechie - Monday, June 4, 2018 8:47 AM

    I can't create the clustered index with the duplicates. I've already tried that.  I have get rid of the dups first. 

    Why not?  Just create a clustered index, not a primary key constraint.

    Actually, I don't think you need to use table2 at all.
    (1) Create a clustered index on the main table on the tableID column
    (2) Write a query that will delete the rows you don't want to keep
    (3) Drop the clustered index and then create your primary key.

    John

  • Not exactly considered best practice, but if you don't care which dupe gets dumped, you could consider setting IGNORE_DUP_KEY = ON on your PK to ignore dupes silently (link)

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I wasn't able to create the index.

    The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.TableID' and the index name 'ClusteredIndex-20180604-101725'.

    Things will work out.  Get back up, change some parameters and recode.

  • Well, yes.  A unique index is (almost) the same as a primary key.  Take the UNIQUE out of the statement and everything will be good.  You only need the index at this stage to make your DELETE statement run faster - you'll trash it afterwards and then create the primary key constraint.

    John

  • I tried the  IGNORE_DUP_KEY = ON  and that failed also.

    Things will work out.  Get back up, change some parameters and recode.

  • Not exactly considered best practice, but if you don't care which dupe gets dumped, you could consider setting IGNORE_DUP_KEY = ON on your PK to ignore dupes silently (linklink)                        "

    I don't care.  But this failed as well.

    Things will work out.  Get back up, change some parameters and recode.

  • WebTechie - Monday, June 4, 2018 9:51 AM

    Not exactly considered best practice, but if you don't care which dupe gets dumped, you could consider setting IGNORE_DUP_KEY = ON on your PK to ignore dupes silently (linklink)                        "

    I don't care.  But this failed as well.

    Remove the offending rows, set IGNORE_DUP_KEY. Any subsequent INSERT attempts which would create a dupe will fail silently.

    But the data needs to be fixed before you put it in place.

    I've used it, so I know that it works.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • WebTechie - Monday, June 4, 2018 9:48 AM

    I tried the  IGNORE_DUP_KEY = ON  and that failed also.

    Try it for yourself:

    CREATE TABLE #DupeTest
    (
      SomeId INT PRIMARY KEY CLUSTERED WITH (IGNORE_DUP_KEY = ON)
    );

    INSERT #DupeTest
    (
      SomeId
    )
    VALUES
    (
      1
    );

    INSERT #DupeTest
    (
      SomeId
    )
    VALUES
    (
      1
    );

    SELECT *
    FROM #DupeTest dt;

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Can I step back for a minute. You want to load data into the table, but is this the source of truth for the data?

    Meaning, if I have  a row in  the main table and a row in the copy with the same ID, but some changed values, which is to be used? If I have dups in the copy that I've imported, which row is the one to use? Or does it matter? do you just need "a" row?

  • The original table is exportLogTable.  There are no rows in this table though.  I copied the table to exportLogTable2 in hopes to truncate the original table, create the primary key and then reload the rows.  I don't care which row of the duplicates is saved.  Normally, that isn't the case, of course, but it is for this table. 

    The copy of the table is exportLogTable2.

    Things will work out.  Get back up, change some parameters and recode.

  • How about posting the table create statements you can get from scripting out the table in SSMS.  That should help us figure out why you keep failing.   Also post the exact code that you are trying to run, and we can go from there.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply