Returning the Top X row for each group

  • Thanks much for the info. I tried both versions on production data and found the row_number to be much faster, which took a second to run compared to 14 seconds to run using cross apply.

    Account table has 121,000 records and call has 651,000 records. Due to account type clause there are only 8,000 acount and 98,000 call records affected. I wonder why the results are so different, acctid is primary key and index does exist on callactid.

    14 seconds

    select acctservareaid, acctid, acctname, acctstorenumber, callconfirmation, callaccttypeid, work_order

    from account with (nolock)

    cross apply (

    select top 1 calltime, callconfirmation, callaccttypeid, callcokeagent + callpo work_order

    from call with (nolock)

    where callacctid = acctid

    and callaccttypeid = 'c'

    order by calltime desc) as call

    where accttypeid = 'c'

    1 second

    with cteCall

    as

    (

    select acctservareaid, acctid, acctname, acctstorenumber, calltime, callconfirmation, callaccttypeid, callcokeagent + callpo work_order

    ,row_number() over (partition by AcctID order by CallTime desc ) as RowN

    from call with (nolock)

    inner join account with (nolock) on callacctid = acctid

    and callaccttypeid = 'c'

    and accttypeid = 'c'

    )

    Select * from cteCall

    where RowN <=1

    order by AcctID, CallTime Desc

  • Idea Deadbeat (12/6/2010)


    Not directly related to article - but - an alternative to spt_values:

    create table #Number (number int);

    with N4000 as (select 0 as Number union all select Number+1 from N4000 where Number <4000

    )insert into #number select * from N4000 option (MAXRECURSION 4000);

    create index ix_N on #Number (Number);

    I know you're only building it once but I'd never use the hidden RBAR of a recursive CTE to build an on-the-fly Tally Table for no other reason than someone may copy the code to do something larger and really end up with a performance problem.

    I'd also use a clustered index on the table with a FILLFACTOR = 100 just in case someone wants to use the code on a server where the default FILLFACTOR has been changed to something other than 0 (100).

    Here's what I mean... the following code includes several different methods including the original recursive CTE. Yeah... I even threw in a WHILE loop because it's a big surprise for a lot of folks when compared to a recursive CTE: (Didn't include index builds because they're all the same)

    --===== Drop Tables ===========================================================================

    DROP TABLE #Number, #Number1, #Number2, #Tally, #Tally1;

    SET NOCOUNT ON;

    DBCC FREEPROCCACHE;

    WAITFOR DELAY '00:00:05'; --Let the system "settle" for several seconds

    GO

    --===== Recursive CTE w/Predefined Table (original)

    create table #Number (number int);

    with N4000 as (select 1 as Number union all select Number+1 from N4000 where Number <4000

    )insert into #number select * from N4000 option (MAXRECURSION 4000);

    GO

    --===== Recursive CTE w/SELECT/INTO

    with N4000 as (select ISNULL(1,0) as Number union all select Number+1 from N4000 where Number <4000

    )select * INTO #Number1 from N4000 option (MAXRECURSION 4000);

    GO

    --===== WHILE Loop

    CREATE TABLE #Number2 (Number INT NOT NULL);

    DECLARE @Counter INT;

    SELECT @Counter = 1;

    BEGIN TRANSACTION;

    WHILE @Counter <= 4000

    BEGIN

    INSERT INTO #Number2

    (Number)

    SELECT @Counter;

    set @Counter = @Counter + 1;

    END;

    COMMIT;

    GO

    --===== Cross Join

    SELECT TOP 4000

    N = IDENTITY(INT,1,1)

    INTO #Tally

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    GO

    --===== Modified Itzek Method

    WITH

    E1(N) AS ( --=== Create Ten 1's

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 --10

    ),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000

    cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT N)) FROM E4)

    SELECT N = ISNULL(N,0)

    INTO #Tally1

    FROM cteTally

    WHERE N <= 4000

    ;

    And, here are the results from SQL Profiler. Even the WHILE loop beat the recursive CTE! 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • swoveland21 (12/6/2010)

    Account table has 121,000 records and call has 651,000 records. Due to account type clause there are only 8,000 acount and 98,000 call records affected. I wonder why the results are so different, acctid is primary key and index does exist on callactid.

    Without seeing the plan its hard to say for definite , but as you require a large portion of the data (651K / 121 K = 20% roughly assuming that each account has 1 call ) , then the single scan through the call table is more efficient that the seeks to find the first call. In my example we were required to find a tiny portion on the data 1mill / 100 = 0.01% .



    Clear Sky SQL
    My Blog[/url]

  • Ok , so how could you achieve this in SQL Server 2000? Again, just need the top x from each group ordered by the ID field.

    Thanks

  • Tim Widdup (12/9/2010)


    Ok , so how could you achieve this in SQL Server 2000? Again, just need the top x from each group ordered by the ID field.

    Thanks

    You will have to loop or cursor through the data. A "quirky update" may be possible, cant remember if possible in 2000 , sorry.



    Clear Sky SQL
    My Blog[/url]

  • Dave Ballantyne (12/9/2010)


    Tim Widdup (12/9/2010)


    Ok , so how could you achieve this in SQL Server 2000? Again, just need the top x from each group ordered by the ID field.

    Thanks

    You will have to loop or cursor through the data. A "quirky update" may be possible, cant remember if possible in 2000 , sorry.

    Or you can do it with a self-join: http://davidsoussan.co.uk/2009/10/10/how-to-sequence-entries-in-each-sub-set/

    SELECT t1.userid, COUNT(t1.tableid) AS sequence, t1.tableid, t1.tableid >= t2.tableid AS flg

    FROM table t1 INNER JOIN table t2 ON t1.userid = t2.userid

    GROUP BY t1.userid, t1.tableid, flg

    HAVING flg = TRUE

    That method is platform-agnostic so it will work on SQL2000, Access, MySQL, etc. I'd be interested to hear how it compares in performance to the platform-specific row_number(), it runs very efficiently on MySQL and even on Access/JET provided that the join column is properly indexed. You can use any column for the join, or even multiple columns with a multi-colmun index, just so long as it results in a unique index within each subset. Add your condition using the count/sequence number to return the x number of records for each group. My blog shows how to vary the condition and one or two usefull variations and applications of the method.

  • Hi David ,

    Pretty poorly. Your method also performs poorly in relation to a traditional sub-select.

    This may or may not be platform agnostic, although an important point its not a concern for me personally. BTW your code does need mods to run in a SQLServer environment 😉

    See the attached image for a screen shot from profiler

    drop table #testtab

    go

    create table #testtab

    (

    userid integer,

    tableid integer

    )

    go

    create unique clustered index idx1 on #testtab(userid,tableid)

    go

    with cter

    as

    (

    select top(1000) row_number() over (order by (select null)) as r

    from syscolumns a cross join syscolumns b cross join syscolumns c

    )

    insert into #testtab

    Select r1.r,r2.r

    from cter r1 cross join cter r2

    select @@rowcount -- 1,000,000 rows

    go

    -- David Soussan Method

    SELECT t1.userid, COUNT(t1.tableid) AS sequence, t1.tableid, case when t1.tableid >= t2.tableid then 1 else 0 end AS flg

    FROM #testtab t1 INNER JOIN #testtab t2 ON t1.userid = t2.userid

    GROUP BY t1.userid, t1.tableid, case when t1.tableid >= t2.tableid then 1 else 0 end

    HAVING case when t1.tableid >= t2.tableid then 1 else 0 end = 1

    go

    -- count(*) sub select

    select *,(Select count(*)

    from #testtab t2

    where t1.userid = t2.userid

    and t2.TableId <= t1.tableId)

    from #testtab t1

    go

    -- 2005 + , row_number()

    Select * ,row_number() over (partition by userid order by tableid)

    from #testtab



    Clear Sky SQL
    My Blog[/url]

  • Well Dave, I was hoping for some real benchmark data, but thanks anyway.

    Of course the sql needs to be changed to the dialect specific to the platform - that is always a given. I'm not sure why you have to use CASE on sql server, but then I am a sql server newbie (not an sql newbie just a sql server newbie). Perhaps that is what makes it soooo slow and the syntax so ugly. On Access you have to use the HAVING clause for this to work but the boolean calculation is only a very small performance hit. On MySQL v 4 other problems arise, like the lack of table-type subquery support. v5 on the other hand would allow the sub-query select so that is just fine.

    As I said, if all else fails then my method will work on any platform, just adjust the sql to the local dialect. If the platform has other tools then use them.

  • david.soussan (12/10/2010)


    Well Dave, I was hoping for some real benchmark data, but thanks anyway.

    Setup SQL Profiler and you'll have it. 😉

    So far as true portability across platforms goes, it's a myth.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • BTW... the problem with the first two queries (after the test data setup) is that they both have full blown Triangular Joins in them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/12/2010)


    Setup SQL Profiler and you'll have it.;-)

    Dave Ballantyne (12/10/2010)


    See the attached image for a screen shot from profiler

    😉



    Clear Sky SQL
    My Blog[/url]

  • Dave, whoops, looks like we both missed the attachment, thanks.

  • Dave Ballantyne (12/12/2010)


    Jeff Moden (12/12/2010)


    Setup SQL Profiler and you'll have it.;-)

    Dave Ballantyne (12/10/2010)


    See the attached image for a screen shot from profiler

    😉

    Yep... I definitely missed the attachment.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Madhivanan-208264 (12/6/2010)


    More methods are available here

    http://beyondrelational.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx

    Awesome article Dave, and a great follow up (and hollistic) resource Old Hand. Thanks for sharing guys!

    -----------------
    ... Then again, I could be totally wrong! Check the answer.
    Check out posting guidelines here for faster more precise answers[/url].

    I believe in Codd
    ... and Thinknook is my Chamber of Understanding

  • Why not DENSE_RANK instead of ROW_NUMBER?

Viewing 15 posts - 16 through 30 (of 44 total)

You must be logged in to reply to this topic. Login to reply