December 1, 2008 at 2:49 am
Sample code is below.
What I want to do is list the top 2 records that have the same name for a given batch, so if there are say five records marked "RED" I want to see two of them, if there is only one "PURPLE" I want to see that as well.
I've got the query working using row_number(), but I'm not sure this is the most efficient way ? I need to return the whole record (including unique columns), hence not using group.
This will eventually be used on fairly large sets of data - maybe a couple of million rows, although its more likely to be used on far smaller amounts of data on a regular basis.
Any suggestions ?
thanks
-- Create table
if exists ( select 1 from SYSOBJECTS where NAME = 'TOPTEN' and TYPE = 'U')
drop table TOPTEN
go
create table TOPTEN
(
ID int not null primary key clustered,
NAME varchar(10),
BATCH int
)
-- create data
insert into TOPTEN (ID, NAME, BATCH)
select 3456, 'BLUE', 10 union all
select 91, 'YELLOW', 10 union all
select 3, 'RED', 10 union all
select 278, 'BLUE', 10 union all
select 69, 'YELLOW', 10 union all
select 10, 'RED', 10 union all
select 560, 'BLUE', 10 union all
select 567, 'YELLOW', 10 union all
select 12, 'RED', 10 union all
select 66, 'BLACK', 10 union all
select 1235, 'BLACK', 10 union all
select 9, 'PURPLE', 10
-- Return all colours in a batch, but do not return more than two of the same
select * from
(
select *, row_number() over (partition by NAME order by NAME desc) as counter from TOPTEN where BATCH = 10
) pis
where pis.Counter < 3
order by pis.NAME
December 1, 2008 at 3:11 am
Eight seconds for around 12 million (simple) rows looks good to me:
[font="Courier New"]-- Return all colours in a batch, but do not return more than two of the same
SELECT * FROM
(
SELECT *, row_number() OVER (partition BY NAME ORDER BY NAME DESC) AS counter
FROM TOPTEN a
CROSS JOIN Numbers b WHERE BATCH = 10 AND number < 1000000
) pis
WHERE pis.Counter < 3
ORDER BY pis.NAME
-- 8 seconds
;WITH pis AS (SELECT *, row_number() OVER (partition BY NAME ORDER BY NAME DESC) AS counter
FROM TOPTEN a
CROSS JOIN Numbers b
WHERE BATCH = 10 AND number < 1000000)
SELECT * FROM pis WHERE Counter < 3
-- 8 seconds
[/font]
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 1, 2008 at 3:53 am
It must be faster. If not compare the execution paln by applying these
http://sqlblogcasts.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx
Failing to plan is Planning to fail
December 1, 2008 at 4:08 am
Madhivanan (12/1/2008)
It must be faster. If not compare the execution paln by applying thesehttp://sqlblogcasts.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx
Only the third method would work - and it's exactly the same as the OP's suggestion.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 1, 2008 at 7:01 am
I'd suggest using the TOP and an ORDER BY within the sub-query instead. Assuming there is an index for the ORDER BY clause to use, I think you'll see better performance for this. If you can get ahold of the latest copy of the SQL Standard (discontinued as of this month), I wrote an article comparing TOP, MAX & ROW_NUMBER for exactly this type of query. TOP works a bit better, especially with large data sets.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 1, 2008 at 7:45 am
Grant Fritchey (12/1/2008)
I'd suggest using the TOP and an ORDER BY within the sub-query instead. Assuming there is an index for the ORDER BY clause to use, I think you'll see better performance for this. If you can get ahold of the latest copy of the SQL Standard (discontinued as of this month), I wrote an article comparing TOP, MAX & ROW_NUMBER for exactly this type of query. TOP works a bit better, especially with large data sets.
Thanks for the tip, Grant, Ill try and get hold of that article. Do you mean this...
SELECT o.*
FROM #TOPTEN AS o
WHERE BATCH = 10
AND [ID] IN
(SELECT TOP 2 [ID]
FROM #TOPTEN
WHERE [NAME]=o.[NAME] AND BATCH = 10
ORDER BY [ID] DESC)
ORDER BY o.[NAME]
...which is the same as Madhivanan's Option 1?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 1, 2008 at 7:52 am
Thanks for all the replies,
I will give the sub-query method a go.
Grant I will try and hunt down a copy of your article, thanks for the tip.
December 1, 2008 at 8:10 am
Chris Morris (12/1/2008)
Thanks for the tip, Grant, Ill try and get hold of that article. Do you mean this...
SELECT o.*
FROM #TOPTEN AS o
WHERE BATCH = 10
AND [ID] IN
(SELECT TOP 2 [ID]
FROM #TOPTEN
WHERE [NAME]=o.[NAME] AND BATCH = 10
ORDER BY [ID] DESC)
ORDER BY o.[NAME]
...which is the same as Madhivanan's Option 1?
Yeah, that's basically it. In most of our instances, we're looking for the latest version of a record, so we're going for one value in each set, not a list. The IN clause may change this negatively.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 1, 2008 at 8:17 am
Grant Fritchey (12/1/2008)
Yeah, that's basically it. In most of our instances, we're looking for the latest version of a record, so we're going for one value in each set, not a list. The IN clause may change this negatively.
Well well. You'd never guess that this construct would be performant. Shows the importance of testing. Thanks!
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 1, 2008 at 8:44 am
Well now, our methods worked, but they were predicated off of joins and a single row out of each set, not multiple rows out of each set. I still think the IN construct might be a pain. However, my testing showed that with smaller data sets the ROW_NUMBER method worked fine, but it started breaking down at larger data sets. MAX worked OK, but frequently not as well as TOP.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 2, 2008 at 9:44 am
Chris Morris (12/1/2008)
Madhivanan (12/1/2008)
It must be faster. If not compare the execution paln by applying thesehttp://sqlblogcasts.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx
Only the third method would work - and it's exactly the same as the OP's suggestion.
Apologies, Madhivanan; not only would your third method work but your first method would also be appropriate for the task. My mistake.:blush:
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply