September 23, 2009 at 5:52 pm
Hi - I need to write a query to select a set of rows until I reach a certain value. An example might be to select the top N employees (from a table sorted by descending salary) whose combined salary is greater than, say $600,000.
The table might be as follows
Name Salary
John 100,000
Fred 100,000
Jake 100,000
Mary 100,000
Joan 100,000
Ethan 80,000
Jacob 50,000
Tom 30,000
Grant 30,000
In this case, the query should select all from John to Jacob which adds to slightly more than 600,000.
Is there a way to do this without writing a loop?
Thanks
ncs
September 24, 2009 at 2:10 am
There are at least two methods: triangular join and "quirky update".
I suggest you take the quirky update if the number of rows involved is big, especially if there is already a clustered index in place that orders the data accordingly.
--Prepare sample data
IF OBJECT_ID('tempdb..#Salaries') IS NOT NULL DROP TABLE #Salaries
CREATE TABLE #Salaries (
Name varchar(50),
Salary int
)
INSERT INTO #Salaries
SELECT 'John', 100000
UNION
SELECT 'Fred', 100000
UNION
SELECT 'Jake', 100000
UNION
SELECT 'Mary', 100000
UNION
SELECT 'Joan', 100000
UNION
SELECT 'Ethan', 80000
UNION
SELECT 'Jacob', 50000
UNION
SELECT 'Tom', 30000
UNION
SELECT 'Grant', 30000
--Triangular join method
;WITH Qry (Name, Salary, ROW_NUM)
AS (
SELECT Name, Salary, ROW_NUMBER() OVER(ORDER BY Salary DESC)
FROM #Salaries
),
RunningSalaries (Name, Salary, ROW_NUM, RunningSalary)
AS (
SELECT *, RunningSalary = (SELECT SUM(Salary) FROM Qry WHERE ROW_NUM <= A.ROW_NUM)
FROM Qry AS A
)
SELECT *
FROM RunningSalaries
WHERE ROW_NUM 600000
)
--"Quirky update" method
IF OBJECT_ID('tempdb..#Temp_RunningSalaries') IS NOT NULL DROP TABLE #Temp_RunningSalaries
CREATE TABLE #Temp_RunningSalaries (
Name varchar(50),
Salary int,
ROW_NUM int,
RunningSalary int
)
CREATE CLUSTERED INDEX IX_ROW_NUM ON #Temp_RunningSalaries (ROW_NUM)
INSERT INTO #Temp_RunningSalaries (Name, Salary, ROW_NUM)
SELECT Name, Salary, ROW_NUMBER() OVER(ORDER BY Salary DESC)
FROM #Salaries
DECLARE @RunningSalary int
SET @RunningSalary = 0
UPDATE #Temp_RunningSalaries
SET @RunningSalary = RunningSalary = @RunningSalary + Salary
FROM #Temp_RunningSalaries WITH(INDEX(IX_ROW_NUM))
OPTION (MAXDOP 1)
SELECT *
FROM #Temp_RunningSalaries
WHERE ROW_NUM 600000
)
hope this helps
Gianluca
-- Gianluca Sartori
September 24, 2009 at 2:16 am
I forgot to mention that the "quirky update" method for running totals is explained very clearly in an article by SQL Server MVP Jeff Moden, that currently is under rewrite.
http://www.sqlservercentral.com/articles/Advanced+Querying/61716/
Attached to the article there's a .sql file that explains quite well the technique. Basically it relies on the clustered index order to perform a table update, using a variable to propagate the running total from one row to the following one.
-- Gianluca Sartori
September 24, 2009 at 4:48 am
While working on a similar requirement (involving running totals) I came upon http://www.sqlteam.com/article/calculating-running-totals.
Indeed the cursor solution described here is very fast (even if during tests there was no index on the Salary table). I tried it on a table (in my example the Salary table) which I populated with ~3.000.000 rows. At this moment I'm on Win Vista Business, Intel Pentium 4CPU 2.8Ghz, 2GB RAM, 32 bit OS, SQL 2008 Developer.
CREATE TABLE Salary(ID int IDENTITY(1, 1), sal decimal)
CREATE TABLE #rt (id int, Salary decimal, RunningTotal decimal)
DECLARE @id int,
@salary decimal,
@RunningTotal decimal
SET @RunningTotal = 0
DECLARE rt_cursor CURSOR
FOR
SELECT id, sal
FROM Salary
OPEN rt_cursor
FETCH NEXT FROM rt_cursor INTO @id, @salary
WHILE @@FETCH_STATUS = 0
BEGIN
SET @RunningTotal = @RunningTotal + @salary
IF @RunningTotal >= 30000000 BREAK
--the limit is 60.000 in your example
INSERT #rt VALUES (@id,@salary,@RunningTotal)
FETCH NEXT FROM rt_cursor INTO @id, @salary
END
CLOSE rt_cursor
DEALLOCATE rt_cursor
September 24, 2009 at 5:48 am
Here it depends on how many rows you will have to perform the running count on, in a large table scenario.
If you only are interested in finding the top n rows until a predetermined total is reached and you think the n rows is not very high, you'd better work with a cursor.
If n can be very high or you want to use the running total over the whole table, you'd better go with one of the two methods I mentioned. Remember that copying the whole table into a #temp to perform the quirky update can be an expensive operation, so, if it can be performed in place, go that way.
-- Gianluca Sartori
September 24, 2009 at 6:43 am
Terrific. Thanks to both of you for replying - I will give them a shot. The number of rows is not expected to be too high, and I cannot afford it to be too expensive, so I will give the cursor a try first.
September 25, 2009 at 10:44 am
I think your best bet is not to worry about how many rows you have and just worry about the running total.
create table #t
(Name Varchar(20),
Salary Money)
insert into #t values ('John', 100000)
insert into #t values ('Fred', 100000)
insert into #t values ('Jake', 100000)
insert into #t values ('Mary', 100000)
insert into #t values ('Joan', 100000)
insert into #t values ('Ethan', 80000)
insert into #t values ('Jacob', 50000)
insert into #t values ('Tom', 30000)
insert into #t values ('Grant', 30000)
selectt1.RowNum,
t1.Name,
t1.Salary,
sum(isnull(t2.RunningSum, t1.Salary)) as RunningSum
from(select row_number() over (order by Salary desc) as RowNum, Name, Salary from #t) t1
left outer join
(select row_number() over (order by Salary desc) as RowNum, Name, Salary as RunningSum from #t) t2
ont1.RowNum >= t2.RowNum
Group by
t1.RowNum,
t1.Name,
t1.Salary
Having sum(isnull(t2.RunningSum, t1.Salary)) < 600000.00
September 28, 2009 at 1:11 am
mgentry (9/25/2009)
I think your best bet is not to worry about how many rows you have and just worry about the running total.
Let me disagree with you: it's always a matter of "how many rows". What works well in a scenario doesn't work well in others.
The code you provided is very clear and works well on the sample data, but I suggest you try it on a 10 million rows table. It's a trinagular join, exactly as the one I put in one of my previous posts.
I'm not saying this code is bad, I'm just saying that it could be slow on a very large table. It could also be very expensive if you just need to work on the top N rows, when you already know that N is not very high.
Anyway, thanks for providing another way to achieve it.
-- Gianluca Sartori
September 28, 2009 at 2:59 am
I tried the cursor on the whole set of records in my Salary table (~3.000.000 rows) and still it seems to be the faster - it takes ~6 minutes. I defined 2 indexes on the table - clustered on id and nonclustered on the "sal" column. The "celko" set based solution described in http://www.sqlteam.com/article/calculating-running-totals is much slower (it processed ~50..000 rows after half an hour).
I'll also try to work this out in .NET using a SqlDataReader.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply