remove dup rows within same columns

  • Hi when i heard the requirement it sounded me simple as i am a newbie but as am getting deeper and deeper its gettign complicated can any 1of you please help me out ...

    req :-

    there is a table wiht

    Col1 Col2

    John trivolta

    trivolta John

    Kate winslet

    as you can see there are 2 johns but the value is swaped within the 2 coloumns so i need remove 1 john (trivolta John)

    please help me out wiht the query...

    Thanks

  • What have you tried so far?

    Tim Mitchell, Microsoft Data Platform MVP
    Data Warehouse and ETL Consultant
    TimMitchell.net | @Tim_Mitchell | Tyleris.com
    ETL Best Practices

  • While this solution is probably not the most efficient one, it should work provided you have some kind of row_id to know the difference between different rows.

    DELETE FROM #pom WHERE id = ANY (SELECT DISTINCT(a.id) FROM #pom a INNER JOIN #pom b ON (((a.var1 = b.var2 AND a.var2 = b.var1) OR (a.var1 = b.var1 AND a.var2 = b.var2)) AND (a.id < b.id)))

    of course - you have to change table_name and columns name but that's just obvious.

  • Hi Nimdil,

    I tried your query, and is perfect.

    Anyway to delete duplicates either you should have a row number or one key column, right?

    Dreamsz - Can you tell us how exactly you want the query(with out using row number or key)?

    Regards,

    Ramu

    Ramu
    No Dream Is Too Big....!

  • --TO BE DELETED; SEE BELOW--

    It would be probably a lot slower without primary key in the table. And why you have no primary key in the table? Primary keys are strictly fundmental.

    Let's see...

    if you can perform a script with a loop this should work (provided that your table has col1 and col2 as only columns):

    DECLARE @i INT;

    DECLARE @col1 VARCHAR(100);

    DECLARE @col2 VARCHAR(100); --or whatever length

    DECLARE curs CURSOR FOR SELECT col1, col2, count(1) FROM table GROUP BY col1, col2;

    OPEN curs;

    FETCH NEXT FROM curs INTO @col1, @col2, @i;

    WHILE (@@FETCH_STATUS = 0)

    BEGIN

    DELETE TOP(@i - 1) FROM table WHERE @col1 = col1 AND @col2 = col2;

    DELETE FROM table WHERE @col1 = col2 AND @col2 = col1;

    FETCH NEXT FROM curs INTO @col1, @col2, @i;

    END

    CLOSE curs;

    DEALLOCATE curs;

    But I think it's kind of slow. You can probably also try with temporary table, s.t. like this:

    SELECT 'greater' = CASE WHEN col1 1;

    TRUNCATE TABLE table; -- or DELETE FROM table;

    INSERT INTO table (col1, col2) SELECT greater, smaller FROM #temp1

    Note, that the second code is a bit of guess as I don't have access to SQL Server until wednesday. You should probably test it and - probably - correct a mistake or two. Also - it's a bit of tricky, as you literally move all data to temporary table. You should probably perform the operation within transaction. And it may be a bit tricky on really HUGE table.

  • It would be probably a lot slower without primary key in the table. And why you have no primary key in the table? Primary keys are strictly fundmental.

    Let's see...

    if you can perform a script with a loop this should work (provided that your table has col1 and col2 as only columns):

    DECLARE @i INT;

    DECLARE @col1 VARCHAR(100);

    DECLARE @col2 VARCHAR(100); --or whatever length

    DECLARE curs CURSOR FOR SELECT col1, col2, count(1) FROM table GROUP BY col1, col2;

    OPEN curs;

    FETCH NEXT FROM curs INTO @col1, @col2, @i;

    WHILE (@@FETCH_STATUS = 0)

    BEGIN

    DELETE TOP(@i - 1) FROM table WHERE @col1 = col1 AND @col2 = col2;

    DELETE FROM table WHERE @col1 = col2 AND @col2 = col1;

    FETCH NEXT FROM curs INTO @col1, @col2, @i;

    END

    CLOSE curs;

    DEALLOCATE curs;

    But I think it's kind of slow. You can probably also try with temporary table, s.t. like this:

    SELECT 'greater' = CASE WHEN col1 < col2 THEN col2 ELSE col1,

    'smaller' = CASE WHEN col2 < col1 THEN col2 ELSE col1,

    count(1) as no INTO #temp1

    FROM table GROUP BY CASE WHEN col1 < col2 THEN col2 ELSE col1,

    CASE WHEN col2 < col1 THEN col2 ELSE col1

    HAVING COUNT(1) > 1;

    TRUNCATE TABLE table; -- or DELETE FROM table;

    INSERT INTO table (col1, col2) SELECT greater, smaller FROM #temp1

    Note, that the second code is a bit of guess as I don't have access to SQL Server until wednesday. You should probably test it and - probably - correct a mistake or two. Also - it's a bit of tricky, as you literally move all data to temporary table. You should probably perform the operation within transaction. And it may be a bit tricky on really HUGE table.

  • Thanks nimdil for the query as well for the time...

    Nimdil the query is good but the thing is it is deleting both the duplicate rows whereas we need to get rid of 1 duplicate row and keep 1 ....

  • Can you explain why you can't just add primary key for the table? Every table -should- have primary key. It is even - funny thing, really - written in rule2 of Codd's Rules of Relational DB. Please - seriously consider adding PK - even simple integer column (in case of unbelievably huge db you can use bigint).

    I admit that my suggestions are flawed. I will try to correct them.

  • Edit:- I posted a query which was not correct. I have removed that query. I could not delete this post.

    "Keep Trying"

  • If you can afford to create an identity column/ transfer rows to an temp table

    with identity column then this can work

    create table #t1 ( id int identity(1,1) ,cid1 varchar(100), cid2 varchar(200))

    insert into #t1(cid1,cid2) select 'john','candy'

    insert into #t1(cid1,cid2) select 'candy','john'

    insert into #t1(cid1,cid2) select 'irfan','khan'

    delete x from #t1 x,#t1 y where x.cid2 = y.cid1 and x.id > y.id

    Select * from #t1

    drop table #t1

    Basically your pblm is u need to differentiate which one you want to delete.

    ( Note: SQL Server doesnt understand that 'john' row came 1st and hence it needs to retained. )

    Hey, just forgot. Are you using SQL 2005 then row_number shd solve ur problem easily.

    if its 2000 then i guess the above one is the solution

  • I think that you can run out of memory with temporary table.

  • It heavily depends on how much RAM you have and how many users are going to access, how much of data on your query..what kind of settings on SQL Server etc..Too many things to put down under this context..

    Btw, It was a small query issue..why bring such a huge topic our here? 😉

  • for 2005 this one shd work..

    create table #t1(col1 varchar(200),col2 varchar(200))

    insert into #t1 select 'john','trovolta'

    insert into #t1 select 'trovolta','john'

    insert into #t1 select 'carlos','moya'

    delete x from

    ( select row_number() OVER (order by col1,col2) as id,col1,col2 from #t1 ) x,

    ( select row_number() OVER (order by col1,col2) as id,col1,col2 from #t1) Y

    where x.col2 = y.col1 and x.id > y.id

    select * from #t1

  • create table #t1(col1 varchar(200),col2 varchar(200))

    insert into #t1 select 'john','trovolta'

    insert into #t1 select 'trovolta','john'

    insert into #t1 select 'carlos','moya'

    insert into #t1 select 'john','candy'

    insert into #t1 select 'candy','john'

    delete x from

    ( select row_number() OVER (order by getdate()) as id,col1,col2 from #t1 ) x,

    ( select row_number() OVER (order by getdate()) as id,col1,col2 from #t1) Y

    where x.col2 = y.col1 and x.id > y.id

    This one is better than the previous one as it ensures the order in which it is inserted

    ie it deletes the duplicate inserted at the last and not the 1st one. So its not influencd by the aplbhabetical order of the data

    this one deletes candy john and not john candy like the last one.

  • This doesn't work. If you add the following to the list of inserts and try you'll see why:

    insert into #t1 select 'travolta', 'carlos'

    insert into #t1 select 'sainz', 'john'

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply