Inserting Only Unique Records

  • Hello,

    I have a table with sixty columns in it, five of which define uniqueness for the records.  Currently there are 190,775 records in the table. One of the records is a duplicate. I need to insert only the unique records from this table (all columns) into another table. I cannot use a unique nonclistered index with IGNORE_DUP_KEY in the destination table because of a problem I am having with the 'duplicate key was ignored' message. The destination table has a primary key with a clustered index on the same five columns.

    How can I put together a SELECT statement that will give me all of the columns in the source table based on uniqueness of the five key columns?

    Does my request make sense? Please let me know if you have questions.

    Thank you for your help!

    CSDunn

  • Hi,

    Which row's values of the duplicate key do you want? There are several ways you could do it.. perhaps somthing like:

    INSERT INTO dest VALUES (

    SELECT * FROM tbl o INNER JOIN (SELECT A,B,C,D,E,MAX(F) FROM tbl GROUP BY A,B,C,D,E) i ON o.A = i.A AND o.B AND i.B AND ..... )

    - James

    --
    James Moore
    Red Gate Software Ltd

  • Thanks for your response. My first thought was to use MAX on all but the key fields in the Select statement on the source table, then group by the key fields. So the insert would look something like this;

    INSERT INTO dest

    (fields)

    SELECT MAX([Column1]) AS Column1, MAX([Column2]) AS Column2......(repeat for other 55 columns)

    FROM source

    GROUP BY keyfield1, ... keyfield5

     

  • --Here is a method to eliminate dups:

    --First select all the columns of the existing table into a temp table (make sure you have enough disk space for this)

    --adding a new integer column that will contain a unique value

    select identity(int) as newCol, *

    into firstTempTable

    from existingTable

    --Now delete out all duplicate records

    delete from firstTempTable where newCol in (select max(newCol) from firstTempTable group by col1, col2, col3, etc having count(*) > 1)

    --alter the temp table to get rid of the extra column

    alter table firstTempTable drop column newCol

    --copy the data to the final table (the one with the unique index)

    insert into newTable

    select * from secondTempTable

    drop firstTempTable

    /* here is come example/test code

    if exists (select 1

       from  sysobjects

          where  id = object_id('existingTable_')

       and   type = 'U')

     drop table existingTable_

    --I'm only going to use three columns to make it easy

    create table existingTable_ (col1_ int, col2_ int, col3_ int)

    --create 25 records with the three columns populated

    DECLARE @counter smallint

    SET @counter = 1

    WHILE @counter < 26

       BEGIN

          insert into existingTable_ values (RAND(@counter * 10) * 100000, rand(@counter * 100) * 100000,rand(@counter * 1000) * 100000)

          SET @counter = @counter + 1

       END

    --now create 20 records that duplicate the first 20 records created above

    SET @counter = 1

    WHILE @counter < 21

       BEGIN

          insert into existingTable_ values (RAND(@counter * 10) * 100000, rand(@counter * 100) * 100000,rand(@counter * 1000) * 100000)

          SET @counter = @counter + 1

       END

    --look at the data, should be 20 duplicate records with 5 non dups

    select *, count(*) from existingTable_ group by col1_, col2_, col3_

    if exists (select 1

       from  sysobjects

          where  id = object_id('tempTable_')

       and   type = 'U')

     drop table tempTable_

    --now copy the data to new table creating a unique data value column

    select identity(int) as newCol_, *

     into tempTable_

     from existingTable_

    --delete all dups from the temptable_

    --NOTE: This command would need to be run multiple times if there were three or four copies of the record,

    --      of course in that case I would then copy all records to a secondTempTable_ of all records that match the sub-select

    --      but deleting is faster in your case.

    delete from temptable_ where newCol_ in (select max(newCol_) from tempTable_ group by col1_, col2_, col3_ having count(*) > 1)

    --get rid of the extra column

    alter table tempTable_ drop column newCol_

    --check the results, should be zero dup rows

    select * from temptable_ group by col1_, col2_, col3_ having count(*) > 1

    --Last step is to copy the data into your final table that has the unique index in place

    insert into yourFinalTable_

    select * from tempTable_

    */

     

  • I think that there are simpler solutions. Does this table have an identity column? If not, you can use the technique below. It's pretty straightforward. If it does, just skip the first statement below that creates the identity column.

    ALTER TABLE

    ADD MyIdentCol INT IDENTITY(1, 1)

    -- Optional- create an index on MyIdentCol. Can be handy.

    -- I'll leave this to you.

    INSERT INTO TargetTable

    SELECT a.*

    FROM SourceTableName a

    INNER JOIN

    (

    -- This gets all ident values for rows that should

    -- be moved to the target table.

    SELECT MAX(MyIdentCol) AS MyIdentColValuesToTransfer

    FROM SourceTableName

    GROUP BY KeyCol1, KeyCol2, ..., KeyCol5) b

    ON a.MyIdentCol = b.MyIdentColValuesToTransfer

  • You say that 5 of the columns define the uniqueness.  I assume then since you have a duplicate row that there is not a constraint of any kind here to enforce the uniqueness.  This will insert only unique rows (based on your 5 columns) into the destination table:

    INSERT INTO dest

    (fields)

    SELECT (Fields)

    FROM Source S

        INNER JOIN (

                    SELECT Column1,

                        Column2,

                        Column3,

                        Column4,

                        Column5

                    FROM Source S

                    GROUP BY Column1, Column2, Column3, Column4, Column5

                    HAVING COUNT(*) = 1

                    ) S2

        ON S.Column1 = S2.Column1

            AND S.Column2 = C2.Column2

            AND S.Column3 = S2.Column3

            AND S.Column4 = S2.Column4

            AND S.Column5 = S2.Column5

     

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Hmm... What about the insertion of 1 row among all duplicate rows? That's where the challenge is

  • You mean if the duplicates are not limited to the 5 columns and instead we have to check all 60 columns?  In that case, you must scale out the GROUP BY and JOIN ON clauses of the sql statement.  What's the challenge?  The option of adding an IDENTITY column would also require you to add all 60 columns into your GROUP BY clause.  I am not in favor of altering schema when simple ANSI-SQL will pull the required data. 

     

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • The problem posted above says that there are 5 cols that uniquely identify each row EXCEPT for one row that has dups (i.e., values appear multiple times for these 5 cols), and the goal is to write a statement that will pull all distinct values from the table.

    Say that the dup row appears 3 times in the table, and that there are 100 other rows that have no dups. Again- the dups only come about in these 5 cols. The goal is then to pull out 101 rows.

    The statement that generates S2 above will pull back 100 rows. Then you join again on the original table (that has 103 rows) and the overall select will return 100 rows. How do you get 1 copy of the dup row to also be selected, to get 101 rows?

    The problem is that there is no unique identifier for the table since the 5 cols are unique except for that dup, therefore you have to plug in some uniqueness to be able to identify each row. Well... you don't really have to (you can iterate over all dups and write long code to handle each dup row at a time), but plugging the unique identifier seems like a more "record-set" friendly solution than loops. It can be removed easily after that as well (the table only has 200k rows; it's pretty small).

  • Thanks for the input!

    CSDunn

  • Just a thought, have you tried using INSERT INTO ... SELECT DISTINCT ... to get the non-duplicate rows inserted into your new table? Then join back to the original to populate the rest of the columns


    Regards,

    Carla Hepker
    Documentation: A Shaft of light into a coded world.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply