August 9, 2010 at 11:53 am
Here is my sample data:
declare @t1 table (loadId int, dataId int, Code varchar(50), imageName varchar(255))
insert into @t1
select 12, 0, 'AS0989', '123.bmp'
union all
select 13, 0, 'AS0989', '123.bmp'
union all
select 12, 1, 'AS0990', '124.bmp'
union all
select 13, 1, 'AS0990', '124.bmp'
union all
select 12, 2, 'AS0991', null
union all
select 13, 2, 'AS0991', null
select distinct dataId, code, imageName
from @t1
The data is result of duplicates coming from different files (files can not be modified).
As you can see with a distinct I get the values as I want. What I would additionally need to fetch is the "loadId". It could be either of the values "12" or "13"; doesn't matter.
I am unable to figure out a single step to do that. Any ideas?
Thanks.
August 9, 2010 at 12:05 pm
You can use ROW_NUMBER() for the same
SELECT *
FROM(
SELECT ROW_NUMBER() OVER ( PARTITION BY dataId, code, imageName ORDER BY loadId ) RowNum, *
FROM @t1
) T1
WHERERowNum = 1
How to post data/code on a forum to get the best help - Jeff Moden
http://www.sqlservercentral.com/articles/Best+Practices/61537/
August 9, 2010 at 12:35 pm
Thanks Kingston.
I was mulling over the use of RANK & Partition By, the only thing that held me back was the number of columns I would have to put in Partition clause, 60 in an extreme case. I had simplified the example and I should have mentioned this.:blush:
Do you still think Rank is the way to go? Will there be any performance issues (the highest number of rows that I envision to be processed in this way is somewhere around 20K)?
If this is not ideal what other options are there?
Thanks again for the help.
August 9, 2010 at 12:44 pm
If you have 60 columns in a table there are likely to be other issues with your design. All the same, if there is some reason why the table would need to be designed as such, ROW_NUMBER should be as fast or faster than any other way to facilitate the same functionality.
August 9, 2010 at 1:08 pm
Thank you for weighing in on the idea.
The 60 columns will never be in a single table; they are spread across 6 tables with one having 5 columns and the biggest having 14 columns. To get rid of the duplicates I'm trying to make a flat view of all the data that has been parsed. To explain "why" of this approach would require of me to do a whole lot of explaining, and there's still a chance that I may not be able to convey the idea. It would be suffice to say that the source of files was never written with an understanding that those log files could be used by an automated process to parse and present them. The authors had presumed it will always be human eyes trained "in" the project. :doze:
If ROW_NUMBER is the only way out, I will take that route. But before committing to that won't it be nice to tap into the vast intelligence out there 🙂
August 9, 2010 at 4:33 pm
select dataId, code, imageName, min(loadid)
from @t1
group by dataId, code, imageName
Scott Pletcher, SQL Server MVP 2008-2010
August 10, 2010 at 9:19 am
scott.pletcher (8/9/2010)
select dataId, code, imageName, min(loadid)from @t1
group by dataId, code, imageName
I have to admit it didn't occur to me that grouping will work.
I would assume since the "group by" clause doesn't involve any computation it should be faster in extreme cases compared to RANK function. Is that right?
The execution plan doesn't show any cost attached to "Compute Scalar", "Segment", "Sort", "Top" & "Sequence Project" components (using the RANK function query). Does that give much idea on what approach to take?
August 10, 2010 at 9:22 am
I would think GROUP BY would have to be faster than ROW_NUMBER(), since SQL doesn't have to partition and generate row numbers, but it would be interesting to compare the query plans and execution results.
Scott Pletcher, SQL Server MVP 2008-2010
August 10, 2010 at 10:07 am
For a sample data with 60 rows and 31 columns
Group By Rank
Time Statistics
Client processing time 63188
Total execution time 4531156
Wait time on server replies 390968
This to me seem indisputable that Group By should be the way to go.
Execution plan however didn't show any difference in cost percentage.
eDIT: The spaces are eaten up during the post and the columns do not align. The first column is for "Group by" query and the second for "Rank" quey.
August 10, 2010 at 11:05 am
60 rows is probably not a large enough sample to show a performance affect.
Scott Pletcher, SQL Server MVP 2008-2010
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply