Inserting Only Unique Records

Question

Post reply

Inserting Only Unique Records

cdun2

SSCertifiable

Points: 7893
More actions
April 24, 2007 at 2:01 pm

#116931

Hello,
I have a table with sixty columns in it, five of which define uniqueness for the records. Currently there are 190,775 records in the table. One of the records is a duplicate. I need to insert only the unique records from this table (all columns) into another table. I cannot use a unique nonclistered index with IGNORE_DUP_KEY in the destination table because of a problem I am having with the 'duplicate key was ignored' message. The destination table has a primary key with a clustered index on the same five columns.
How can I put together a SELECT statement that will give me all of the columns in the source table based on uniqueness of the five key columns?
Does my request make sense? Please let me know if you have questions.
Thank you for your help!
CSDunn

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

James Moore (Red Gate) SSCrazy Points: 2644 More actions · Answer 1

Hi,

Which row's values of the duplicate key do you want? There are several ways you could do it.. perhaps somthing like:

INSERT INTO dest VALUES (

SELECT * FROM tbl o INNER JOIN (SELECT A,B,C,D,E,MAX(F) FROM tbl GROUP BY A,B,C,D,E) i ON o.A = i.A AND o.B AND i.B AND ..... )

- James

--
James Moore
Red Gate Software Ltd

cdun2 SSCertifiable Points: 7893 More actions · Answer 2

Thanks for your response. My first thought was to use MAX on all but the key fields in the Select statement on the source table, then group by the key fields. So the insert would look something like this;

INSERT INTO dest

(fields)

SELECT MAX([Column1]) AS Column1, MAX([Column2]) AS Column2......(repeat for other 55 columns)

FROM source

GROUP BY keyfield1, ... keyfield5

-=JLK=- SSCertifiable Points: 5315 More actions · Answer 3

--Here is a method to eliminate dups:

--First select all the columns of the existing table into a temp table (make sure you have enough disk space for this)

--adding a new integer column that will contain a unique value

select identity(int) as newCol, *

into firstTempTable

from existingTable

--Now delete out all duplicate records

delete from firstTempTable where newCol in (select max(newCol) from firstTempTable group by col1, col2, col3, etc having count(*) > 1)

--alter the temp table to get rid of the extra column

alter table firstTempTable drop column newCol

--copy the data to the final table (the one with the unique index)

insert into newTable

select * from secondTempTable

drop firstTempTable

/* here is come example/test code

if exists (select 1

from sysobjects

where id = object_id('existingTable_')

and type = 'U')

drop table existingTable_

--I'm only going to use three columns to make it easy

create table existingTable_ (col1_ int, col2_ int, col3_ int)

--create 25 records with the three columns populated

DECLARE @counter smallint

SET @counter = 1

WHILE @counter < 26

BEGIN

insert into existingTable_ values (RAND(@counter * 10) * 100000, rand(@counter * 100) * 100000,rand(@counter * 1000) * 100000)

SET @counter = @counter + 1

END

--now create 20 records that duplicate the first 20 records created above

SET @counter = 1

WHILE @counter < 21

BEGIN

insert into existingTable_ values (RAND(@counter * 10) * 100000, rand(@counter * 100) * 100000,rand(@counter * 1000) * 100000)

SET @counter = @counter + 1

END

--look at the data, should be 20 duplicate records with 5 non dups

select *, count(*) from existingTable_ group by col1_, col2_, col3_

if exists (select 1

from sysobjects

where id = object_id('tempTable_')

and type = 'U')

drop table tempTable_

--now copy the data to new table creating a unique data value column

select identity(int) as newCol_, *

into tempTable_

from existingTable_

--delete all dups from the temptable_

--NOTE: This command would need to be run multiple times if there were three or four copies of the record,

-- of course in that case I would then copy all records to a secondTempTable_ of all records that match the sub-select

-- but deleting is faster in your case.

delete from temptable_ where newCol_ in (select max(newCol_) from tempTable_ group by col1_, col2_, col3_ having count(*) > 1)

--get rid of the extra column

alter table tempTable_ drop column newCol_

--check the results, should be zero dup rows

select * from temptable_ group by col1_, col2_, col3_ having count(*) > 1

--Last step is to copy the data into your final table that has the unique index in place

insert into yourFinalTable_

select * from tempTable_

*/

Omri Bahat-433962 SSC Eights! Points: 901 More actions · Answer 4

I think that there are simpler solutions. Does this table have an identity column? If not, you can use the technique below. It's pretty straightforward. If it does, just skip the first statement below that creates the identity column.

ALTER TABLE

ADD MyIdentCol INT IDENTITY(1, 1)

-- Optional- create an index on MyIdentCol. Can be handy.

-- I'll leave this to you.

INSERT INTO TargetTable

SELECT a.*

FROM SourceTableName a

INNER JOIN

(

-- This gets all ident values for rows that should

-- be moved to the target table.

SELECT MAX(MyIdentCol) AS MyIdentColValuesToTransfer

FROM SourceTableName

GROUP BY KeyCol1, KeyCol2, ..., KeyCol5) b

ON a.MyIdentCol = b.MyIdentColValuesToTransfer

John Rowan SSC Guru Points: 56440 More actions · Answer 5

You say that 5 of the columns define the uniqueness. I assume then since you have a duplicate row that there is not a constraint of any kind here to enforce the uniqueness. This will insert only unique rows (based on your 5 columns) into the destination table:

INSERT INTO dest

(fields)

SELECT (Fields)

FROM Source S

INNER JOIN (

SELECT Column1,

Column2,

Column3,

Column4,

Column5

FROM Source S

GROUP BY Column1, Column2, Column3, Column4, Column5

HAVING COUNT(*) = 1

) S2

ON S.Column1 = S2.Column1

AND S.Column2 = C2.Column2

AND S.Column3 = S2.Column3

AND S.Column4 = S2.Column4

AND S.Column5 = S2.Column5

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Omri Bahat-433962 SSC Eights! Points: 901 More actions · Answer 6

Hmm... What about the insertion of 1 row among all duplicate rows? That's where the challenge is

John Rowan SSC Guru Points: 56440 More actions · Answer 7

You mean if the duplicates are not limited to the 5 columns and instead we have to check all 60 columns? In that case, you must scale out the GROUP BY and JOIN ON clauses of the sql statement. What's the challenge? The option of adding an IDENTITY column would also require you to add all 60 columns into your GROUP BY clause. I am not in favor of altering schema when simple ANSI-SQL will pull the required data.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Omri Bahat-433962 SSC Eights! Points: 901 More actions · Answer 8

The problem posted above says that there are 5 cols that uniquely identify each row EXCEPT for one row that has dups (i.e., values appear multiple times for these 5 cols), and the goal is to write a statement that will pull all distinct values from the table.

Say that the dup row appears 3 times in the table, and that there are 100 other rows that have no dups. Again- the dups only come about in these 5 cols. The goal is then to pull out 101 rows.

The statement that generates S2 above will pull back 100 rows. Then you join again on the original table (that has 103 rows) and the overall select will return 100 rows. How do you get 1 copy of the dup row to also be selected, to get 101 rows?

The problem is that there is no unique identifier for the table since the 5 cols are unique except for that dup, therefore you have to plug in some uniqueness to be able to identify each row. Well... you don't really have to (you can iterate over all dups and write long code to handle each dup row at a time), but plugging the unique identifier seems like a more "record-set" friendly solution than loops. It can be removed easily after that as well (the table only has 200k rows; it's pretty small).

cdun2 SSCertifiable Points: 7893 More actions · Answer 9

cdun2

SSCertifiable

Points: 7893

April 28, 2007 at 3:32 am

#703107

Thanks for the input!

CSDunn

Carla Hepker Ten Centuries Points: 1295 More actions · Answer 10

Just a thought, have you tried using INSERT INTO ... SELECT DISTINCT ... to get the non-duplicate rows inserted into your new table? Then join back to the original to populate the rest of the columns

Regards,

Carla Hepker
Documentation: A Shaft of light into a coded world.