Removing duplicates

  • Here is my sample data:

    declare @t1 table (loadId int, dataId int, Code varchar(50), imageName varchar(255))

    insert into @t1

    select 12, 0, 'AS0989', '123.bmp'

    union all

    select 13, 0, 'AS0989', '123.bmp'

    union all

    select 12, 1, 'AS0990', '124.bmp'

    union all

    select 13, 1, 'AS0990', '124.bmp'

    union all

    select 12, 2, 'AS0991', null

    union all

    select 13, 2, 'AS0991', null

    select distinct dataId, code, imageName

    from @t1

    The data is result of duplicates coming from different files (files can not be modified).

    As you can see with a distinct I get the values as I want. What I would additionally need to fetch is the "loadId". It could be either of the values "12" or "13"; doesn't matter.

    I am unable to figure out a single step to do that. Any ideas?

    Thanks.

  • You can use ROW_NUMBER() for the same

    SELECT *

    FROM(

    SELECT ROW_NUMBER() OVER ( PARTITION BY dataId, code, imageName ORDER BY loadId ) RowNum, *

    FROM @t1

    ) T1

    WHERERowNum = 1


    Kingston Dhasian

    How to post data/code on a forum to get the best help - Jeff Moden
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • Thanks Kingston.

    I was mulling over the use of RANK & Partition By, the only thing that held me back was the number of columns I would have to put in Partition clause, 60 in an extreme case. I had simplified the example and I should have mentioned this.:blush:

    Do you still think Rank is the way to go? Will there be any performance issues (the highest number of rows that I envision to be processed in this way is somewhere around 20K)?

    If this is not ideal what other options are there?

    Thanks again for the help.

  • If you have 60 columns in a table there are likely to be other issues with your design. All the same, if there is some reason why the table would need to be designed as such, ROW_NUMBER should be as fast or faster than any other way to facilitate the same functionality.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Thank you for weighing in on the idea.

    The 60 columns will never be in a single table; they are spread across 6 tables with one having 5 columns and the biggest having 14 columns. To get rid of the duplicates I'm trying to make a flat view of all the data that has been parsed. To explain "why" of this approach would require of me to do a whole lot of explaining, and there's still a chance that I may not be able to convey the idea. It would be suffice to say that the source of files was never written with an understanding that those log files could be used by an automated process to parse and present them. The authors had presumed it will always be human eyes trained "in" the project. :doze:

    If ROW_NUMBER is the only way out, I will take that route. But before committing to that won't it be nice to tap into the vast intelligence out there 🙂

  • select dataId, code, imageName, min(loadid)

    from @t1

    group by dataId, code, imageName

    Scott Pletcher, SQL Server MVP 2008-2010

  • scott.pletcher (8/9/2010)


    select dataId, code, imageName, min(loadid)

    from @t1

    group by dataId, code, imageName

    I have to admit it didn't occur to me that grouping will work.

    I would assume since the "group by" clause doesn't involve any computation it should be faster in extreme cases compared to RANK function. Is that right?

    The execution plan doesn't show any cost attached to "Compute Scalar", "Segment", "Sort", "Top" & "Sequence Project" components (using the RANK function query). Does that give much idea on what approach to take?

  • I would think GROUP BY would have to be faster than ROW_NUMBER(), since SQL doesn't have to partition and generate row numbers, but it would be interesting to compare the query plans and execution results.

    Scott Pletcher, SQL Server MVP 2008-2010

  • For a sample data with 60 rows and 31 columns

    Group By Rank

    Time Statistics

    Client processing time 63188

    Total execution time 4531156

    Wait time on server replies 390968

    This to me seem indisputable that Group By should be the way to go.

    Execution plan however didn't show any difference in cost percentage.

    eDIT: The spaces are eaten up during the post and the columns do not align. The first column is for "Group by" query and the second for "Rank" quey.

  • 60 rows is probably not a large enough sample to show a performance affect.

    Scott Pletcher, SQL Server MVP 2008-2010

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply