Top N plus

Question

Post reply

Top N plus

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Ranking the whole result set is probably your only option.

One last idea, have you tried:

select top 5 ...

from ...

order by ranking desc

union

select ...

from ...

where product = @param

So the first query just gets the top N, and the Union gets the one you want?

I'm not sure that would solve anything, but have you tried it?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 2

You're missing the point I'm trying to make: the two are related. By not giving it something to work with that it "likes" (i.e. appropriate indexing, etc...), and by requiring the RANK() function over the 1.2M to just to get the rankings, it is going to force a scan of all 1.2M rows, just to find the distinct values and create said ranks.

Depending on how the view is set up - it's very possible that it by itself is forcing the entire view to be materialized just so a really small subset can be used, which is yet another spot to look at.

Try this on:

Use the Top(N) with TIES to get you the records you need, and THEN create the rankings on the results of the Top(N).

such as

declare @g datetime;

select @g=getdate();

select RANK() OVER (ORDER BY orderamount DESC) rnk,

*

from (

select top(10) WITH ties

*

from pivottable

Order by orderamount DESC

) pvt;

select datediff(ms,@g,getdate());

In this way - at least you're running the RANK() on a much smaller subset, versus creating a hash table that might contain 1.2Million values that then need to be ordered at run-time.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 3

You're missing the point I'm trying to make: the two are related. By not giving it something to work with that it "likes" (i.e. appropriate indexing, etc...), and by requiring the RANK() function over the 1.2M to just to get the rankings, it is going to force a scan of all 1.2M rows, just to find the distinct values and create said ranks.
...snip...
Use the Top(N) with TIES to get you the records you need, and THEN create the rankings on the results of the Top(N).
...snip...

I understand your point, but you missed mine. The RANK() is already based on the dataset corresponding to the given criteria, not all records in the table(s). So if someone provides criteria as size=mid, engine=v6, qtr=3Q 07 they'll get a different rank for Toyota Camry over 3yr_lease_value than they would for engine=4cyl, qtr=3Q 07 ranked over 2yr_lease_value. TOP (99) WITH TIES will not include Toyota Camry if its rank is > 99. Also nearly all rankable metrics are summarizations or calculations that include business rules so the only way to sort/order/rank is to calculate the metric.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 4

I did catch that, and I'm not denying in any way that your metric is required to get to the top 10, 15, 20, etc... choices they might want. What I'm trying to impress upon you is that once you have calculated the personalized rating or metric as you call it, using the Rank() function against the TOP(N) with TIES subquery will return your results faster than the RANK() function against the whole set, since the top(N) is an ordering function, and not an aggregation-type function. And you'll still end up with the same results - just quite a bit faster.

In other words - if you replicated the query above , and replaced "orderamount" with the name of your calculated metric, by itself you should see a noticeable difference in efficiency in reads and execution time. RANK() is pretty much always going to go out an build a hash table of all of the values, so it's not going to be good for performance against a large set.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 5

... using the Rank() function against the TOP(N) with TIES subquery will return your results faster than the RANK() function against the whole set, since the top(N) is an ordering function, and not an aggregation-type function. And you'll still end up with the same results - just quite a bit faster.

No it won't be any faster because RANK() can only be calculated on the entire candidate data set (what's identified by the where clause). If the where clause boils down to 2000 rows, it's going to calculate RANK() for 2000 items; if its 2M rows, all 2M will be ranked.

But maybe I'm wrong. So, here's a simple example.

create table #z ( model varchar(20), metric1 int, metric2 int )
insert into #z values ('honda accord', 1540, 200 )
insert into #z values ('honda civic', 1530, 250 )
insert into #z values ('chevy malibu', 1520, 26 )
insert into #z values ('toyota camry', 1250, 27 )
insert into #z values ('toyota avalon', 1350, 30 )
insert into #z values ('toyota rav4', 1450, 40 )
insert into #z values ('honda pilot', 150, 50 )
insert into #z values ('honda odyssey', 550, 55 )
insert into #z values ('chevy suburban', 750, 600 )
insert into #z values ('pontiac gto', 850, 100 )
insert into #z values ('toyota land cruiser', 950, 300 )
insert into #z values ('pontiac g6', 650, 155 )
insert into #z values ('ford fusion', 250, 255 )
insert into #z values ('ford edge', 440, 270 )
insert into #z values ('mazda 3', 430, 300 )
insert into #z values ('mazda 6', 180, 205 )
insert into #z values ('mazda rx7', 70, 200 )
insert into #z values ('kia spectra', 90, 205 )
insert into #z values ('hyundai tucson', 80, 25 )
insert into #z values ('nissan murano', 110, 50 )
insert into #z values ('nissan altima', 130, 55 )

Toyota Camry is #6 when ranked by metic1, but #19 when ranked by metric2. What's the query to produce this top 10 by metric1

honda accord15401

honda civic15302

chevy malibu15203

toyota rav414504

toyota avalon13505

toyota camry12506

toyota land cruiser9507

pontiac gto8508

chevy suburban7509

pontiac g665010

and this top 10 by metric 2 with Toyota Camry included in both?

chevy suburban6001

toyota land cruiser3002

mazda 33002

ford edge2704

ford fusion2555

honda civic2506

mazda 62057

kia spectra2057

mazda rx72009

honda accord2009

toyota camry2719

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 6

ACtually - no. RANK just needs to be run over the top N and in everything from "camry" upwards. Everything else is worthless.

with that in mind:

declare @g datetime;

--create index ix_pivot4 on testpivot(orderamount) include(coid,prodid)

declare @camryID int;

declare @camryamount int;

select @camryID=5466326; --#50000 in the ranking

--select @camryID=5466326; --#2415667 in the ranking

select @g=getdate();

select @camryamount=orderamount from testpivot where rid=@camryID

select *

from (

select Top(20) RANK() over (order by orderamount desc) rnk,

*

from (

select top(20) with ties

coid,prodid,orderamount,rid

fromtestpivot

wherecoid between 1 and 5

and prodid between 1 and 40

order by orderamount desc

) pvt

order by orderamount desc

) topN

UNION

select * from

(select Top 1 *

from(

select RANK() over (order by orderamount desc) rnk,

coid,prodid,orderamount,rid

fromtestpivot

where(orderamount>@camryamount

and coid between 1 and 10

and prodid between 1 and 40)

or rid=@camryID

) camry

order by orderamount) c;

select datediff(ms,@g,getdate())

dbcc freeproccache

---and now - for the RANK solution

select @g=getdate();

;With RANKCTE as (

select RANK() over (order by orderamount desc) rnk,

coid,prodid,orderamount,rid

fromtestpivot

wherecoid between 1 and 10

and prodid between 1 and 40)

select * from RANKCTE

where rnk<=20

UNION

select * from RANKCTE where rid=@camryid

order by orderamount desc;

select datediff(ms,@g,getdate())

Results:

WITH the index in place

the TOP(n) syntax : from 190ms to 19453ms

the CTE: 21780 to 24570

WITHOUT any helper index in place

the TOP(n) syntax : from 14760ms to 37516ms

the CTE: 71733 to 78943

(variations seem to have to do with how far down on the list the "camry" is).

So - yes it's a little bit more to write, but you can outpace RANK() fairly dramatically if you should so desire.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 7

This will give you the result you seek.

set nocount on

declare @z table ( model varchar(20), metric1 int, metric2 int )

insert into @z values ('honda accord', 1540, 200 )

insert into @z values ('honda civic', 1530, 250 )

insert into @z values ('chevy malibu', 1520, 26 )

insert into @z values ('toyota camry', 1250, 27 )

insert into @z values ('toyota avalon', 1350, 30 )

insert into @z values ('toyota rav4', 1450, 40 )

insert into @z values ('honda pilot', 150, 50 )

insert into @z values ('honda odyssey', 550, 55 )

insert into @z values ('chevy suburban', 750, 600 )

insert into @z values ('pontiac gto', 850, 100 )

insert into @z values ('toyota land cruiser', 950, 300 )

insert into @z values ('pontiac g6', 650, 155 )

insert into @z values ('ford fusion', 250, 255 )

insert into @z values ('ford edge', 440, 270 )

insert into @z values ('mazda 3', 430, 300 )

insert into @z values ('mazda 6', 180, 205 )

insert into @z values ('mazda rx7', 70, 200 )

insert into @z values ('kia spectra', 90, 205 )

insert into @z values ('hyundai tucson', 80, 25 )

insert into @z values ('nissan murano', 110, 50 )

insert into @z values ('nissan altima', 130, 55 )

;WITH MyCTE1 (Model, Row, Ranking, Metric1, Metric2)

AS (

SELECTModel,

ROW_NUMBER() OVER (ORDER BY Metric1 DESC),

RANK() OVER (ORDER BY Metric1 DESC),

Metric1,

Metric2

FROM@z

)

-- Show the expected result

SELECTTOP 10 WITH TIES

Ranking,

Model,

Metric1,

Metric2

FROMMyCTE1

WHERERow <= 10

OR Model = 'Toyota Camry'

ORDER BYCASE WHEN Model = 'Toyota Camry' AND Row > 10 THEN 10 ELSE Row END

;WITH MyCTE2 (Model, Row, Ranking, Metric1, Metric2)

AS (

SELECTModel,

ROW_NUMBER() OVER (ORDER BY Metric2 DESC),

RANK() OVER (ORDER BY Metric2 DESC),

Metric1,

Metric2

FROM@z

)

-- Show the expected result

SELECTTOP 10 WITH TIES

Ranking,

Model,

Metric1,

Metric2

FROMMyCTE2

WHERERow <= 10

OR Model = 'Toyota Camry'

ORDER BYCASE WHEN Model = 'Toyota Camry' AND Row > 10 THEN 10 ELSE Row END

Could you please timetest it and post back times here?

N 56°04'39.16"
E 12°55'05.25"

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 8

peso: that's the CTE version that was discussed earlier. it performs the same as the temp table version if the expression is only referred to once.

as i said before, the bulk of the time is calculating the rank. getting the top N results is a breeze (only 70ms to get them out of a temp table). i don't think there is a way to speed up a rank()ed query.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 9

Notice that my latest calls your Temp table solution, the CTE solution.

You really should give some of this a whirl. Running RANK() to the whole set just to get 10 or 11 records back is ultimately a waste. It's the same as running a count to check for existence - you're using a bazooka to swat a fly.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 10

Here is a fast algorithm. Please notice that NO INDEXES AT ALL is present.

SET NOCOUNT ON

-- Create 100000 sample records

CREATE TABLE#Sample

(

Model CHAR(20) NOT NULL,

Metric1 INT,

Metric2 INT

)

INSERT#Sample

(

Model,

Metric1,

Metric2

)

SELECTLEFT(REPLACE(CAST(NEWID() AS CHAR(36)), '-', ''), 20),

ABS(CHECKSUM(NEWID())) % 600000,

ABS(CHECKSUM(NEWID())) % 600000

FROMmaster..spt_values AS v1

INNER JOINmaster..spt_values AS v2 ON v2.Type = 'p'

WHEREv1.Type = 'p'

AND v1.Number < 100

AND v1.Number < 1000

-- Mimic user supplied parameter

DECLARE@Model VARCHAR(20)

SELECT TOP 1@Model = Model

FROM#Sample

ORDER BYNEWID()

-- Timetest staging table

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

DECLARE @Time DATETIME

SET @Time = GETDATE()

-- Stage query result

DECLARE@Stage TABLE (Model VARCHAR(20), Metric INT, Ranking INT)

INSERT@Stage

(

Model,

Metric

)

SELECT TOP 15Model,

Metric1

FROM#Sample

ORDER BYMetric1 DESC

UPDATEs

SETs.Ranking = 1 + (SELECT COUNT(*) FROM @Stage AS x WHERE x.Metric > s.Metric)

FROM@Stage AS s

IF NOT EXISTS (SELECT * FROM @Stage WHERE Model = @Model)

BEGIN

INSERT@Stage

(

Model,

Metric

)

SELECTModel,

Metric1

FROM#Sample

WHEREModel = @Model

UPDATEs

SETs.Ranking = 1 + (SELECT COUNT(*) FROM #Sample AS x WHERE x.Metric1 > s.Metric)

FROM@Stage AS s

WHEREs.Model = @Model

END

SELECT*

FROM@Stage

-- Average at 80 ms

SELECT DATEDIFF(MILLISECOND, @Time, GETDATE())

-- Timetest CTE

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

SET @Time = GETDATE()

;WITH MyCTE (Model, Row, Ranking, Metric1)

AS (

SELECTModel,

ROW_NUMBER() OVER (ORDER BY Metric1 DESC),

RANK() OVER (ORDER BY Metric2 DESC),

Metric1

FROM#Sample

)

SELECTTOP 15 WITH TIES

Ranking,

Model,

Metric1

FROMMyCTE

WHERERow <= 15

OR Model = @Model

ORDER BYCASE WHEN Model = @Model AND Row > 15 THEN 15 ELSE Row END

-- Average at 983 ms

SELECT DATEDIFF(MILLISECOND, @Time, GETDATE())

N 56°04'39.16"
E 12°55'05.25"

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 11

Statistics IO for

"Staging algorithm"

Table '#2180FB33'. Scan count 0, logical reads 15, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#Sample_____________________________________________________________________________________________________________00000000001C'. Scan count 1, logical reads 940, physical reads 0, read-ahead reads 30, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#2180FB33'. Scan count 16, logical reads 31, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 1, logical reads 33, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#2180FB33'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

"CTE algorithm"

Table '#Sample_____________________________________________________________________________________________________________00000000001C'. Scan count 3, logical reads 940, physical reads 0, read-ahead reads 30, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

N 56°04'39.16"
E 12°55'05.25"

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 12

Matt Miller (2/5/2008)
Notice that my latest calls your Temp table solution, the CTE solution.
You really should give some of this a whirl. Running RANK() to the whole set just to get 10 or 11 records back is ultimately a waste. It's the same as running a count to check for existence - you're using a bazooka to swat a fly.

Hey guys, thanks for all your suggestions. I was hoping we'd find some syntax or technique that could produce a rank() and allow the query engine to skip/ignore some of the candidate rows, but it looks like that's not possible.

Ranking is extremely valuable for certain uses. Raw figures aren't that useful unless they're put into context with the competition. For example, if sales dropped 15% last month, the first thing a sales manager might do is run a ranking report to figure out where they are relative to everyone else. So if their rank hasn't changed, there's reason to panic (since everyone else's sales dropped also). Top N Plus will let them see the top performers and themselves even if they aren't in the top cut and avoid having to run two seperate reports. This is valuable since once you go past the top 5 or 10, everyone is bunched up very closely.

Ranking also let's them put a spin on numbers and find out where their strengths are. So they've also asked for reports that will locate the categories in which they're ranked N or higher (we're #3 in the southeast and #2 in Michigan) and alerts that will notify them if their category rank changes.

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 13

antonio.collins (2/5/2008)
Hey guys, thanks for all your suggestions. I was hoping we'd find some syntax or technique that could produce a rank() and allow the query engine to skip/ignore some of the candidate rows, but it looks like that's not possible.

Huh?

What about the algorithm posted by me 34 minutes earlier?

http://www.sqlservercentral.com/Forums/FindPost451626.aspx

N 56°04'39.16"
E 12°55'05.25"

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 14

peso: you're trying to speed up the selection of items from the cte/temp table and that's not the issue. even in my original temp table approach, it took 150s to load the temp table and only 70ms to get the proper data out of it.

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 15

Well, in my temp table algorithm above it takes an average of 80 milliseconds to fill the temptable and get the rankings calculated...

Are you sure we are talking about the same thing?

In my previous post, the population of #Sample table is just to mimic your environment, to get the 100000 records to work with.

Have you checked the code below that?

From the "Timetest staging" line to before "Timetest CTE" ?

CREATE PROCEDURE dbo.uspGetMyModels

(

@Top TINYINT,

@Model VARCHAR(20)

)

AS

SET NOCOUNT ON

-- Stage query result

DECLARE @Stage TABLE

(

Model VARCHAR(20),

Metric INT,

Ranking INT

)

-- Populate stagin table

INSERT@Stage

(

Model,

Metric

)

SELECTTOP (@Top)

Model,

Metric1

FROM{YourTableNameHere}

ORDER BYMetric1 DESC

-- Update ranking for top n records

UPDATEs

SETs.Ranking = 1 + (SELECT COUNT(*) FROM @Stage AS x WHERE x.Metric > s.Metric)

FROM@Stage AS s

-- If user supplied model is not among the top n records, add it

IF NOT EXISTS (SELECT * FROM @Stage WHERE Model = @Model)

BEGIN

INSERT@Stage

(

Model,

Metric

)

SELECTModel,

Metric1

FROM{YourTableNameHere}

WHEREModel = @Model

UPDATEs

SETs.Ranking = 1 + (SELECT COUNT(*) FROM {YourTableNameHere} AS x WHERE x.Metric1 > s.Metric)

FROM@Stage AS s

WHEREs.Model = @Model

END

-- Show the expected result

SELECTModel,

Metric

FROM@Stage

N 56°04'39.16"
E 12°55'05.25"