Selecting top N rows until sum(column X) > y

  • Hi - I need to write a query to select a set of rows until I reach a certain value. An example might be to select the top N employees (from a table sorted by descending salary) whose combined salary is greater than, say $600,000.

    The table might be as follows

    Name Salary

    John 100,000

    Fred 100,000

    Jake 100,000

    Mary 100,000

    Joan 100,000

    Ethan 80,000

    Jacob 50,000

    Tom 30,000

    Grant 30,000

    In this case, the query should select all from John to Jacob which adds to slightly more than 600,000.

    Is there a way to do this without writing a loop?

    Thanks

    ncs

  • There are at least two methods: triangular join and "quirky update".

    I suggest you take the quirky update if the number of rows involved is big, especially if there is already a clustered index in place that orders the data accordingly.

    --Prepare sample data

    IF OBJECT_ID('tempdb..#Salaries') IS NOT NULL DROP TABLE #Salaries

    CREATE TABLE #Salaries (

    Name varchar(50),

    Salary int

    )

    INSERT INTO #Salaries

    SELECT 'John', 100000

    UNION

    SELECT 'Fred', 100000

    UNION

    SELECT 'Jake', 100000

    UNION

    SELECT 'Mary', 100000

    UNION

    SELECT 'Joan', 100000

    UNION

    SELECT 'Ethan', 80000

    UNION

    SELECT 'Jacob', 50000

    UNION

    SELECT 'Tom', 30000

    UNION

    SELECT 'Grant', 30000

    --Triangular join method

    ;WITH Qry (Name, Salary, ROW_NUM)

    AS (

    SELECT Name, Salary, ROW_NUMBER() OVER(ORDER BY Salary DESC)

    FROM #Salaries

    ),

    RunningSalaries (Name, Salary, ROW_NUM, RunningSalary)

    AS (

    SELECT *, RunningSalary = (SELECT SUM(Salary) FROM Qry WHERE ROW_NUM <= A.ROW_NUM)

    FROM Qry AS A

    )

    SELECT *

    FROM RunningSalaries

    WHERE ROW_NUM 600000

    )

    --"Quirky update" method

    IF OBJECT_ID('tempdb..#Temp_RunningSalaries') IS NOT NULL DROP TABLE #Temp_RunningSalaries

    CREATE TABLE #Temp_RunningSalaries (

    Name varchar(50),

    Salary int,

    ROW_NUM int,

    RunningSalary int

    )

    CREATE CLUSTERED INDEX IX_ROW_NUM ON #Temp_RunningSalaries (ROW_NUM)

    INSERT INTO #Temp_RunningSalaries (Name, Salary, ROW_NUM)

    SELECT Name, Salary, ROW_NUMBER() OVER(ORDER BY Salary DESC)

    FROM #Salaries

    DECLARE @RunningSalary int

    SET @RunningSalary = 0

    UPDATE #Temp_RunningSalaries

    SET @RunningSalary = RunningSalary = @RunningSalary + Salary

    FROM #Temp_RunningSalaries WITH(INDEX(IX_ROW_NUM))

    OPTION (MAXDOP 1)

    SELECT *

    FROM #Temp_RunningSalaries

    WHERE ROW_NUM 600000

    )

    hope this helps

    Gianluca

    -- Gianluca Sartori

  • I forgot to mention that the "quirky update" method for running totals is explained very clearly in an article by SQL Server MVP Jeff Moden, that currently is under rewrite.

    http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

    Attached to the article there's a .sql file that explains quite well the technique. Basically it relies on the clustered index order to perform a table update, using a variable to propagate the running total from one row to the following one.

    -- Gianluca Sartori

  • While working on a similar requirement (involving running totals) I came upon http://www.sqlteam.com/article/calculating-running-totals.

    Indeed the cursor solution described here is very fast (even if during tests there was no index on the Salary table). I tried it on a table (in my example the Salary table) which I populated with ~3.000.000 rows. At this moment I'm on Win Vista Business, Intel Pentium 4CPU 2.8Ghz, 2GB RAM, 32 bit OS, SQL 2008 Developer.

    CREATE TABLE Salary(ID int IDENTITY(1, 1), sal decimal)

    CREATE TABLE #rt (id int, Salary decimal, RunningTotal decimal)

    DECLARE @id int,

    @salary decimal,

    @RunningTotal decimal

    SET @RunningTotal = 0

    DECLARE rt_cursor CURSOR

    FOR

    SELECT id, sal

    FROM Salary

    OPEN rt_cursor

    FETCH NEXT FROM rt_cursor INTO @id, @salary

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SET @RunningTotal = @RunningTotal + @salary

    IF @RunningTotal >= 30000000 BREAK

    --the limit is 60.000 in your example

    INSERT #rt VALUES (@id,@salary,@RunningTotal)

    FETCH NEXT FROM rt_cursor INTO @id, @salary

    END

    CLOSE rt_cursor

    DEALLOCATE rt_cursor

  • Here it depends on how many rows you will have to perform the running count on, in a large table scenario.

    If you only are interested in finding the top n rows until a predetermined total is reached and you think the n rows is not very high, you'd better work with a cursor.

    If n can be very high or you want to use the running total over the whole table, you'd better go with one of the two methods I mentioned. Remember that copying the whole table into a #temp to perform the quirky update can be an expensive operation, so, if it can be performed in place, go that way.

    -- Gianluca Sartori

  • Terrific. Thanks to both of you for replying - I will give them a shot. The number of rows is not expected to be too high, and I cannot afford it to be too expensive, so I will give the cursor a try first.

  • I think your best bet is not to worry about how many rows you have and just worry about the running total.

    create table #t

    (Name Varchar(20),

    Salary Money)

    insert into #t values ('John', 100000)

    insert into #t values ('Fred', 100000)

    insert into #t values ('Jake', 100000)

    insert into #t values ('Mary', 100000)

    insert into #t values ('Joan', 100000)

    insert into #t values ('Ethan', 80000)

    insert into #t values ('Jacob', 50000)

    insert into #t values ('Tom', 30000)

    insert into #t values ('Grant', 30000)

    selectt1.RowNum,

    t1.Name,

    t1.Salary,

    sum(isnull(t2.RunningSum, t1.Salary)) as RunningSum

    from(select row_number() over (order by Salary desc) as RowNum, Name, Salary from #t) t1

    left outer join

    (select row_number() over (order by Salary desc) as RowNum, Name, Salary as RunningSum from #t) t2

    ont1.RowNum >= t2.RowNum

    Group by

    t1.RowNum,

    t1.Name,

    t1.Salary

    Having sum(isnull(t2.RunningSum, t1.Salary)) < 600000.00

  • mgentry (9/25/2009)


    I think your best bet is not to worry about how many rows you have and just worry about the running total.

    Let me disagree with you: it's always a matter of "how many rows". What works well in a scenario doesn't work well in others.

    The code you provided is very clear and works well on the sample data, but I suggest you try it on a 10 million rows table. It's a trinagular join, exactly as the one I put in one of my previous posts.

    I'm not saying this code is bad, I'm just saying that it could be slow on a very large table. It could also be very expensive if you just need to work on the top N rows, when you already know that N is not very high.

    Anyway, thanks for providing another way to achieve it.

    -- Gianluca Sartori

  • I tried the cursor on the whole set of records in my Salary table (~3.000.000 rows) and still it seems to be the faster - it takes ~6 minutes. I defined 2 indexes on the table - clustered on id and nonclustered on the "sal" column. The "celko" set based solution described in http://www.sqlteam.com/article/calculating-running-totals is much slower (it processed ~50..000 rows after half an hour).

    I'll also try to work this out in .NET using a SqlDataReader.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply