Removing duplicates

Question

Post reply

Removing duplicates

rjv_rnjn

Hall of Fame

Points: 3023
More actions
August 9, 2010 at 11:53 am

#227620

Here is my sample data:
declare @t1 table (loadId int, dataId int, Code varchar(50), imageName varchar(255))
insert into @t1
select 12, 0, 'AS0989', '123.bmp'
union all
select 13, 0, 'AS0989', '123.bmp'
union all
select 12, 1, 'AS0990', '124.bmp'
union all
select 13, 1, 'AS0990', '124.bmp'
union all
select 12, 2, 'AS0991', null
union all
select 13, 2, 'AS0991', null
select distinct dataId, code, imageName
from @t1
The data is result of duplicates coming from different files (files can not be modified).
As you can see with a distinct I get the values as I want. What I would additionally need to fetch is the "loadId". It could be either of the values "12" or "13"; doesn't matter.
I am unable to figure out a single step to do that. Any ideas?
Thanks.

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Kingston Dhasian SSCoach Points: 19799 More actions · Answer 1

You can use ROW_NUMBER() for the same

SELECT *

FROM(

SELECT ROW_NUMBER() OVER ( PARTITION BY dataId, code, imageName ORDER BY loadId ) RowNum, *

FROM @t1

) T1

WHERERowNum = 1

Kingston Dhasian

How to post data/code on a forum to get the best help - Jeff Moden
http://www.sqlservercentral.com/articles/Best+Practices/61537/

rjv_rnjn Hall of Fame Points: 3023 More actions · Answer 2

Thanks Kingston.

I was mulling over the use of RANK & Partition By, the only thing that held me back was the number of columns I would have to put in Partition clause, 60 in an extreme case. I had simplified the example and I should have mentioned this.:blush:

Do you still think Rank is the way to go? Will there be any performance issues (the highest number of rows that I envision to be processed in this way is somewhere around 20K)?

If this is not ideal what other options are there?

Thanks again for the help.

Ben Teraberry SSChampion Points: 11845 More actions · Answer 3

If you have 60 columns in a table there are likely to be other issues with your design. All the same, if there is some reason why the table would need to be designed as such, ROW_NUMBER should be as fast or faster than any other way to facilitate the same functionality.

└> bt

Forum Etiquette: How to post data/code on a forum to get the best help[/url]

rjv_rnjn Hall of Fame Points: 3023 More actions · Answer 4

Thank you for weighing in on the idea.

The 60 columns will never be in a single table; they are spread across 6 tables with one having 5 columns and the biggest having 14 columns. To get rid of the duplicates I'm trying to make a flat view of all the data that has been parsed. To explain "why" of this approach would require of me to do a whole lot of explaining, and there's still a chance that I may not be able to convey the idea. It would be suffice to say that the source of files was never written with an understanding that those log files could be used by an automated process to parse and present them. The authors had presumed it will always be human eyes trained "in" the project. :doze:

If ROW_NUMBER is the only way out, I will take that route. But before committing to that won't it be nice to tap into the vast intelligence out there 🙂

scott.pletcher SSCertifiable Points: 5958 More actions · Answer 5

select dataId, code, imageName, min(loadid)

from @t1

group by dataId, code, imageName

Scott Pletcher, SQL Server MVP 2008-2010

rjv_rnjn Hall of Fame Points: 3023 More actions · Answer 6

scott.pletcher (8/9/2010)
select dataId, code, imageName, min(loadid)
from @t1
group by dataId, code, imageName

I have to admit it didn't occur to me that grouping will work.

I would assume since the "group by" clause doesn't involve any computation it should be faster in extreme cases compared to RANK function. Is that right?

The execution plan doesn't show any cost attached to "Compute Scalar", "Segment", "Sort", "Top" & "Sequence Project" components (using the RANK function query). Does that give much idea on what approach to take?

scott.pletcher SSCertifiable Points: 5958 More actions · Answer 7

I would think GROUP BY would have to be faster than ROW_NUMBER(), since SQL doesn't have to partition and generate row numbers, but it would be interesting to compare the query plans and execution results.

Scott Pletcher, SQL Server MVP 2008-2010

rjv_rnjn Hall of Fame Points: 3023 More actions · Answer 8

For a sample data with 60 rows and 31 columns

Group By Rank
Time Statistics
Client processing time 63188
Total execution time 4531156
Wait time on server replies 390968

This to me seem indisputable that Group By should be the way to go.

Execution plan however didn't show any difference in cost percentage.

eDIT: The spaces are eaten up during the post and the columns do not align. The first column is for "Group by" query and the second for "Rank" quey.

scott.pletcher SSCertifiable Points: 5958 More actions · Answer 9

60 rows is probably not a large enough sample to show a performance affect.

Scott Pletcher, SQL Server MVP 2008-2010