April 24, 2007 at 2:01 pm
Hello,
I have a table with sixty columns in it, five of which define uniqueness for the records. Currently there are 190,775 records in the table. One of the records is a duplicate. I need to insert only the unique records from this table (all columns) into another table. I cannot use a unique nonclistered index with IGNORE_DUP_KEY in the destination table because of a problem I am having with the 'duplicate key was ignored' message. The destination table has a primary key with a clustered index on the same five columns.
How can I put together a SELECT statement that will give me all of the columns in the source table based on uniqueness of the five key columns?
Does my request make sense? Please let me know if you have questions.
Thank you for your help!
CSDunn
April 24, 2007 at 2:17 pm
Hi,
Which row's values of the duplicate key do you want? There are several ways you could do it.. perhaps somthing like:
INSERT INTO dest VALUES (
SELECT * FROM tbl o INNER JOIN (SELECT A,B,C,D,E,MAX(F) FROM tbl GROUP BY A,B,C,D,E) i ON o.A = i.A AND o.B AND i.B AND ..... )
- James
--
James Moore
Red Gate Software Ltd
April 24, 2007 at 2:36 pm
Thanks for your response. My first thought was to use MAX on all but the key fields in the Select statement on the source table, then group by the key fields. So the insert would look something like this;
INSERT INTO dest
(fields)
SELECT MAX([Column1]) AS Column1, MAX([Column2]) AS Column2......(repeat for other 55 columns)
FROM source
GROUP BY keyfield1, ... keyfield5
April 26, 2007 at 2:53 pm
--Here is a method to eliminate dups:
--First select all the columns of the existing table into a temp table (make sure you have enough disk space for this)
--adding a new integer column that will contain a unique value
select identity(int) as newCol, *
into firstTempTable
from existingTable
--Now delete out all duplicate records
delete from firstTempTable where newCol in (select max(newCol) from firstTempTable group by col1, col2, col3, etc having count(*) > 1)
--alter the temp table to get rid of the extra column
alter table firstTempTable drop column newCol
--copy the data to the final table (the one with the unique index)
insert into newTable
select * from secondTempTable
drop firstTempTable
/* here is come example/test code
if exists (select 1
from sysobjects
where id = object_id('existingTable_')
and type = 'U')
drop table existingTable_
--I'm only going to use three columns to make it easy
create table existingTable_ (col1_ int, col2_ int, col3_ int)
--create 25 records with the three columns populated
DECLARE @counter smallint
SET @counter = 1
WHILE @counter < 26
BEGIN
insert into existingTable_ values (RAND(@counter * 10) * 100000, rand(@counter * 100) * 100000,rand(@counter * 1000) * 100000)
SET @counter = @counter + 1
END
--now create 20 records that duplicate the first 20 records created above
SET @counter = 1
WHILE @counter < 21
BEGIN
insert into existingTable_ values (RAND(@counter * 10) * 100000, rand(@counter * 100) * 100000,rand(@counter * 1000) * 100000)
SET @counter = @counter + 1
END
--look at the data, should be 20 duplicate records with 5 non dups
select *, count(*) from existingTable_ group by col1_, col2_, col3_
if exists (select 1
from sysobjects
where id = object_id('tempTable_')
and type = 'U')
drop table tempTable_
--now copy the data to new table creating a unique data value column
select identity(int) as newCol_, *
into tempTable_
from existingTable_
--delete all dups from the temptable_
--NOTE: This command would need to be run multiple times if there were three or four copies of the record,
-- of course in that case I would then copy all records to a secondTempTable_ of all records that match the sub-select
-- but deleting is faster in your case.
delete from temptable_ where newCol_ in (select max(newCol_) from tempTable_ group by col1_, col2_, col3_ having count(*) > 1)
--get rid of the extra column
alter table tempTable_ drop column newCol_
--check the results, should be zero dup rows
select * from temptable_ group by col1_, col2_, col3_ having count(*) > 1
--Last step is to copy the data into your final table that has the unique index in place
insert into yourFinalTable_
select * from tempTable_
*/
April 26, 2007 at 3:26 pm
I think that there are simpler solutions. Does this table have an identity column? If not, you can use the technique below. It's pretty straightforward. If it does, just skip the first statement below that creates the identity column.
ALTER TABLE
ADD MyIdentCol INT IDENTITY(1, 1)
-- Optional- create an index on MyIdentCol. Can be handy.
-- I'll leave this to you.
INSERT INTO TargetTable
SELECT a.*
FROM SourceTableName a
INNER JOIN
(
-- This gets all ident values for rows that should
-- be moved to the target table.
SELECT MAX(MyIdentCol) AS MyIdentColValuesToTransfer
FROM SourceTableName
GROUP BY KeyCol1, KeyCol2, ..., KeyCol5) b
ON a.MyIdentCol = b.MyIdentColValuesToTransfer
April 26, 2007 at 3:44 pm
You say that 5 of the columns define the uniqueness. I assume then since you have a duplicate row that there is not a constraint of any kind here to enforce the uniqueness. This will insert only unique rows (based on your 5 columns) into the destination table:
INSERT INTO dest
(fields)
SELECT (Fields)
FROM Source S
INNER JOIN (
SELECT Column1,
Column2,
Column3,
Column4,
Column5
FROM Source S
GROUP BY Column1, Column2, Column3, Column4, Column5
HAVING COUNT(*) = 1
) S2
ON S.Column1 = S2.Column1
AND S.Column2 = C2.Column2
AND S.Column3 = S2.Column3
AND S.Column4 = S2.Column4
AND S.Column5 = S2.Column5
April 26, 2007 at 3:47 pm
Hmm... What about the insertion of 1 row among all duplicate rows? That's where the challenge is
April 26, 2007 at 4:01 pm
You mean if the duplicates are not limited to the 5 columns and instead we have to check all 60 columns? In that case, you must scale out the GROUP BY and JOIN ON clauses of the sql statement. What's the challenge? The option of adding an IDENTITY column would also require you to add all 60 columns into your GROUP BY clause. I am not in favor of altering schema when simple ANSI-SQL will pull the required data.
April 26, 2007 at 4:12 pm
The problem posted above says that there are 5 cols that uniquely identify each row EXCEPT for one row that has dups (i.e., values appear multiple times for these 5 cols), and the goal is to write a statement that will pull all distinct values from the table.
Say that the dup row appears 3 times in the table, and that there are 100 other rows that have no dups. Again- the dups only come about in these 5 cols. The goal is then to pull out 101 rows.
The statement that generates S2 above will pull back 100 rows. Then you join again on the original table (that has 103 rows) and the overall select will return 100 rows. How do you get 1 copy of the dup row to also be selected, to get 101 rows?
The problem is that there is no unique identifier for the table since the 5 cols are unique except for that dup, therefore you have to plug in some uniqueness to be able to identify each row. Well... you don't really have to (you can iterate over all dups and write long code to handle each dup row at a time), but plugging the unique identifier seems like a more "record-set" friendly solution than loops. It can be removed easily after that as well (the table only has 200k rows; it's pretty small).
April 28, 2007 at 3:32 am
Thanks for the input!
CSDunn
May 1, 2007 at 7:44 am
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply