Selecting top N rows until sum(column X) > y

Question

Selecting top N rows until sum(column X) > y

nallan.suresh

SSC Veteran

Points: 288
More actions
September 23, 2009 at 5:52 pm

#156251

Hi - I need to write a query to select a set of rows until I reach a certain value. An example might be to select the top N employees (from a table sorted by descending salary) whose combined salary is greater than, say $600,000.
The table might be as follows
Name Salary
John 100,000
Fred 100,000
Jake 100,000
Mary 100,000
Joan 100,000
Ethan 80,000
Jacob 50,000
Tom 30,000
Grant 30,000
In this case, the query should select all from John to Jacob which adds to slightly more than 600,000.
Is there a way to do this without writing a loop?
Thanks
ncs

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

spaghettidba SSC Guru Points: 105732 More actions · Answer 1

There are at least two methods: triangular join and "quirky update".

I suggest you take the quirky update if the number of rows involved is big, especially if there is already a clustered index in place that orders the data accordingly.

--Prepare sample data

IF OBJECT_ID('tempdb..#Salaries') IS NOT NULL DROP TABLE #Salaries

CREATE TABLE #Salaries (

Name varchar(50),

Salary int

)

INSERT INTO #Salaries

SELECT 'John', 100000

UNION

SELECT 'Fred', 100000

UNION

SELECT 'Jake', 100000

UNION

SELECT 'Mary', 100000

UNION

SELECT 'Joan', 100000

UNION

SELECT 'Ethan', 80000

UNION

SELECT 'Jacob', 50000

UNION

SELECT 'Tom', 30000

UNION

SELECT 'Grant', 30000

--Triangular join method

;WITH Qry (Name, Salary, ROW_NUM)

AS (

SELECT Name, Salary, ROW_NUMBER() OVER(ORDER BY Salary DESC)

FROM #Salaries

),

RunningSalaries (Name, Salary, ROW_NUM, RunningSalary)

AS (

SELECT *, RunningSalary = (SELECT SUM(Salary) FROM Qry WHERE ROW_NUM <= A.ROW_NUM)

FROM Qry AS A

)

SELECT *

FROM RunningSalaries

WHERE ROW_NUM 600000

)

--"Quirky update" method

IF OBJECT_ID('tempdb..#Temp_RunningSalaries') IS NOT NULL DROP TABLE #Temp_RunningSalaries

CREATE TABLE #Temp_RunningSalaries (

Name varchar(50),

Salary int,

ROW_NUM int,

RunningSalary int

)

CREATE CLUSTERED INDEX IX_ROW_NUM ON #Temp_RunningSalaries (ROW_NUM)

INSERT INTO #Temp_RunningSalaries (Name, Salary, ROW_NUM)

SELECT Name, Salary, ROW_NUMBER() OVER(ORDER BY Salary DESC)

FROM #Salaries

DECLARE @RunningSalary int

SET @RunningSalary = 0

UPDATE #Temp_RunningSalaries

SET @RunningSalary = RunningSalary = @RunningSalary + Salary

FROM #Temp_RunningSalaries WITH(INDEX(IX_ROW_NUM))

OPTION (MAXDOP 1)

SELECT *

FROM #Temp_RunningSalaries

WHERE ROW_NUM 600000

)

hope this helps

Gianluca

-- Gianluca Sartori

spaghettidba SSC Guru Points: 105732 More actions · Answer 2

I forgot to mention that the "quirky update" method for running totals is explained very clearly in an article by SQL Server MVP Jeff Moden, that currently is under rewrite.

http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

Attached to the article there's a .sql file that explains quite well the technique. Basically it relies on the clustered index order to perform a table update, using a variable to propagate the running total from one row to the following one.

-- Gianluca Sartori

dmoldovan SSCertifiable Points: 6334 More actions · Answer 3

While working on a similar requirement (involving running totals) I came upon http://www.sqlteam.com/article/calculating-running-totals.

Indeed the cursor solution described here is very fast (even if during tests there was no index on the Salary table). I tried it on a table (in my example the Salary table) which I populated with ~3.000.000 rows. At this moment I'm on Win Vista Business, Intel Pentium 4CPU 2.8Ghz, 2GB RAM, 32 bit OS, SQL 2008 Developer.

CREATE TABLE Salary(ID int IDENTITY(1, 1), sal decimal)

CREATE TABLE #rt (id int, Salary decimal, RunningTotal decimal)

DECLARE @id int,

@salary decimal,

@RunningTotal decimal

SET @RunningTotal = 0

DECLARE rt_cursor CURSOR

FOR

SELECT id, sal

FROM Salary

OPEN rt_cursor

FETCH NEXT FROM rt_cursor INTO @id, @salary

WHILE @@FETCH_STATUS = 0

BEGIN

SET @RunningTotal = @RunningTotal + @salary

IF @RunningTotal >= 30000000 BREAK

--the limit is 60.000 in your example

INSERT #rt VALUES (@id,@salary,@RunningTotal)

FETCH NEXT FROM rt_cursor INTO @id, @salary

END

CLOSE rt_cursor

DEALLOCATE rt_cursor

spaghettidba SSC Guru Points: 105732 More actions · Answer 4

Here it depends on how many rows you will have to perform the running count on, in a large table scenario.

If you only are interested in finding the top n rows until a predetermined total is reached and you think the n rows is not very high, you'd better work with a cursor.

If n can be very high or you want to use the running total over the whole table, you'd better go with one of the two methods I mentioned. Remember that copying the whole table into a #temp to perform the quirky update can be an expensive operation, so, if it can be performed in place, go that way.

-- Gianluca Sartori

nallan.suresh SSC Veteran Points: 288 More actions · Answer 5

Terrific. Thanks to both of you for replying - I will give them a shot. The number of rows is not expected to be too high, and I cannot afford it to be too expensive, so I will give the cursor a try first.

mgentry SSC Journeyman Points: 81 More actions · Answer 6

I think your best bet is not to worry about how many rows you have and just worry about the running total.

create table #t

(Name Varchar(20),

Salary Money)

insert into #t values ('John', 100000)

insert into #t values ('Fred', 100000)

insert into #t values ('Jake', 100000)

insert into #t values ('Mary', 100000)

insert into #t values ('Joan', 100000)

insert into #t values ('Ethan', 80000)

insert into #t values ('Jacob', 50000)

insert into #t values ('Tom', 30000)

insert into #t values ('Grant', 30000)

selectt1.RowNum,

t1.Name,

t1.Salary,

sum(isnull(t2.RunningSum, t1.Salary)) as RunningSum

from(select row_number() over (order by Salary desc) as RowNum, Name, Salary from #t) t1

left outer join

(select row_number() over (order by Salary desc) as RowNum, Name, Salary as RunningSum from #t) t2

ont1.RowNum >= t2.RowNum

Group by

t1.RowNum,

t1.Name,

t1.Salary

Having sum(isnull(t2.RunningSum, t1.Salary)) < 600000.00

spaghettidba SSC Guru Points: 105732 More actions · Answer 7

mgentry (9/25/2009)
I think your best bet is not to worry about how many rows you have and just worry about the running total.

Let me disagree with you: it's always a matter of "how many rows". What works well in a scenario doesn't work well in others.

The code you provided is very clear and works well on the sample data, but I suggest you try it on a 10 million rows table. It's a trinagular join, exactly as the one I put in one of my previous posts.

I'm not saying this code is bad, I'm just saying that it could be slow on a very large table. It could also be very expensive if you just need to work on the top N rows, when you already know that N is not very high.

Anyway, thanks for providing another way to achieve it.

-- Gianluca Sartori

dmoldovan SSCertifiable Points: 6334 More actions · Answer 8

I tried the cursor on the whole set of records in my Salary table (~3.000.000 rows) and still it seems to be the faster - it takes ~6 minutes. I defined 2 indexes on the table - clustered on id and nonclustered on the "sal" column. The "celko" set based solution described in http://www.sqlteam.com/article/calculating-running-totals is much slower (it processed ~50..000 rows after half an hour).

I'll also try to work this out in .NET using a SqlDataReader.