remove dup rows within same columns

Question

Post reply

remove dup rows within same columns

Dreamsz

SSC Eights!

Points: 911
More actions
March 12, 2009 at 7:12 pm

#202262

Hi when i heard the requirement it sounded me simple as i am a newbie but as am getting deeper and deeper its gettign complicated can any 1of you please help me out ...
req :-
there is a table wiht
Col1 Col2
John trivolta
trivolta John
Kate winslet
as you can see there are 2 johns but the value is swaped within the 2 coloumns so i need remove 1 john (trivolta John)
please help me out wiht the query...
Thanks

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Tim Mitchell SSCoach Points: 15685 More actions · Answer 1

What have you tried so far?

Tim Mitchell
TimMitchell.net | @tmitch.net | Tyleris.com
ETL Best Practices

nimdil Say Hey Kid Points: 676 More actions · Answer 2

While this solution is probably not the most efficient one, it should work provided you have some kind of row_id to know the difference between different rows.

DELETE FROM #pom WHERE id = ANY (SELECT DISTINCT(a.id) FROM #pom a INNER JOIN #pom b ON (((a.var1 = b.var2 AND a.var2 = b.var1) OR (a.var1 = b.var1 AND a.var2 = b.var2)) AND (a.id < b.id)))

of course - you have to change table_name and columns name but that's just obvious.

ramu.valleti SSCrazy Points: 2229 More actions · Answer 3

Hi Nimdil,

I tried your query, and is perfect.

Anyway to delete duplicates either you should have a row number or one key column, right?

Dreamsz - Can you tell us how exactly you want the query(with out using row number or key)?

Regards,

Ramu

Ramu
No Dream Is Too Big....!

nimdil Say Hey Kid Points: 676 More actions · Answer 4

--TO BE DELETED; SEE BELOW--

It would be probably a lot slower without primary key in the table. And why you have no primary key in the table? Primary keys are strictly fundmental.

Let's see...

if you can perform a script with a loop this should work (provided that your table has col1 and col2 as only columns):

DECLARE @i INT;

DECLARE @col1 VARCHAR(100);

DECLARE @col2 VARCHAR(100); --or whatever length

DECLARE curs CURSOR FOR SELECT col1, col2, count(1) FROM table GROUP BY col1, col2;

OPEN curs;

FETCH NEXT FROM curs INTO @col1, @col2, @i;

WHILE (@@FETCH_STATUS = 0)

BEGIN

DELETE TOP(@i - 1) FROM table WHERE @col1 = col1 AND @col2 = col2;

DELETE FROM table WHERE @col1 = col2 AND @col2 = col1;

FETCH NEXT FROM curs INTO @col1, @col2, @i;

END

CLOSE curs;

DEALLOCATE curs;

But I think it's kind of slow. You can probably also try with temporary table, s.t. like this:

SELECT 'greater' = CASE WHEN col1 1;

TRUNCATE TABLE table; -- or DELETE FROM table;

INSERT INTO table (col1, col2) SELECT greater, smaller FROM #temp1

Note, that the second code is a bit of guess as I don't have access to SQL Server until wednesday. You should probably test it and - probably - correct a mistake or two. Also - it's a bit of tricky, as you literally move all data to temporary table. You should probably perform the operation within transaction. And it may be a bit tricky on really HUGE table.

nimdil Say Hey Kid Points: 676 More actions · Answer 5

It would be probably a lot slower without primary key in the table. And why you have no primary key in the table? Primary keys are strictly fundmental.

Let's see...

if you can perform a script with a loop this should work (provided that your table has col1 and col2 as only columns):

DECLARE @i INT;

DECLARE @col1 VARCHAR(100);

DECLARE @col2 VARCHAR(100); --or whatever length

DECLARE curs CURSOR FOR SELECT col1, col2, count(1) FROM table GROUP BY col1, col2;

OPEN curs;

FETCH NEXT FROM curs INTO @col1, @col2, @i;

WHILE (@@FETCH_STATUS = 0)

BEGIN

DELETE TOP(@i - 1) FROM table WHERE @col1 = col1 AND @col2 = col2;

DELETE FROM table WHERE @col1 = col2 AND @col2 = col1;

FETCH NEXT FROM curs INTO @col1, @col2, @i;

END

CLOSE curs;

DEALLOCATE curs;

But I think it's kind of slow. You can probably also try with temporary table, s.t. like this:

SELECT 'greater' = CASE WHEN col1 < col2 THEN col2 ELSE col1,

'smaller' = CASE WHEN col2 < col1 THEN col2 ELSE col1,

count(1) as no INTO #temp1

FROM table GROUP BY CASE WHEN col1 < col2 THEN col2 ELSE col1,

CASE WHEN col2 < col1 THEN col2 ELSE col1

HAVING COUNT(1) > 1;

TRUNCATE TABLE table; -- or DELETE FROM table;

INSERT INTO table (col1, col2) SELECT greater, smaller FROM #temp1

Note, that the second code is a bit of guess as I don't have access to SQL Server until wednesday. You should probably test it and - probably - correct a mistake or two. Also - it's a bit of tricky, as you literally move all data to temporary table. You should probably perform the operation within transaction. And it may be a bit tricky on really HUGE table.

Dreamsz SSC Eights! Points: 911 More actions · Answer 6

Thanks nimdil for the query as well for the time...

Nimdil the query is good but the thing is it is deleting both the duplicate rows whereas we need to get rid of 1 duplicate row and keep 1 ....

nimdil Say Hey Kid Points: 676 More actions · Answer 7

Can you explain why you can't just add primary key for the table? Every table -should- have primary key. It is even - funny thing, really - written in rule2 of Codd's Rules of Relational DB. Please - seriously consider adding PK - even simple integer column (in case of unbelievably huge db you can use bigint).

I admit that my suggestions are flawed. I will try to correct them.

ChiragNS One Orange Chip Points: 26137 More actions · Answer 8

Edit:- I posted a query which was not correct. I have removed that query. I could not delete this post.

"Keep Trying"

arr.nagaraj SSCertifiable Points: 6528 More actions · Answer 9

If you can afford to create an identity column/ transfer rows to an temp table

with identity column then this can work

create table #t1 ( id int identity(1,1) ,cid1 varchar(100), cid2 varchar(200))

insert into #t1(cid1,cid2) select 'john','candy'

insert into #t1(cid1,cid2) select 'candy','john'

insert into #t1(cid1,cid2) select 'irfan','khan'

delete x from #t1 x,#t1 y where x.cid2 = y.cid1 and x.id > y.id

Select * from #t1

drop table #t1

Basically your pblm is u need to differentiate which one you want to delete.

( Note: SQL Server doesnt understand that 'john' row came 1st and hence it needs to retained. )

Hey, just forgot. Are you using SQL 2005 then row_number shd solve ur problem easily.

if its 2000 then i guess the above one is the solution

Regards,
Raj

http://Strictlysql.blogspot.com

nimdil Say Hey Kid Points: 676 More actions · Answer 10

I think that you can run out of memory with temporary table.

arr.nagaraj SSCertifiable Points: 6528 More actions · Answer 11

It heavily depends on how much RAM you have and how many users are going to access, how much of data on your query..what kind of settings on SQL Server etc..Too many things to put down under this context..

Btw, It was a small query issue..why bring such a huge topic our here? 😉

Regards,
Raj

http://Strictlysql.blogspot.com

arr.nagaraj SSCertifiable Points: 6528 More actions · Answer 12

for 2005 this one shd work..

create table #t1(col1 varchar(200),col2 varchar(200))

insert into #t1 select 'john','trovolta'

insert into #t1 select 'trovolta','john'

insert into #t1 select 'carlos','moya'

delete x from

( select row_number() OVER (order by col1,col2) as id,col1,col2 from #t1 ) x,

( select row_number() OVER (order by col1,col2) as id,col1,col2 from #t1) Y

where x.col2 = y.col1 and x.id > y.id

select * from #t1

Regards,
Raj

http://Strictlysql.blogspot.com

arr.nagaraj SSCertifiable Points: 6528 More actions · Answer 13

create table #t1(col1 varchar(200),col2 varchar(200))

insert into #t1 select 'john','trovolta'

insert into #t1 select 'trovolta','john'

insert into #t1 select 'carlos','moya'

insert into #t1 select 'john','candy'

insert into #t1 select 'candy','john'

delete x from

( select row_number() OVER (order by getdate()) as id,col1,col2 from #t1 ) x,

( select row_number() OVER (order by getdate()) as id,col1,col2 from #t1) Y

where x.col2 = y.col1 and x.id > y.id

This one is better than the previous one as it ensures the order in which it is inserted

ie it deletes the duplicate inserted at the last and not the 1st one. So its not influencd by the aplbhabetical order of the data

this one deletes candy john and not john candy like the last one.

Regards,
Raj

http://Strictlysql.blogspot.com

nigel. SSChampion Points: 11636 More actions · Answer 14

This doesn't work. If you add the following to the list of inserts and try you'll see why:

insert into #t1 select 'travolta', 'carlos'

insert into #t1 select 'sainz', 'john'

Nigel
How to post questions