Help needed filling in values based on previous row

Question

Post reply

Help needed filling in values based on previous row

Marianne L Collins

SSCommitted

Points: 1895
More actions
October 15, 2008 at 11:27 am

#194580

Background: I have code that calculates daily cumulative point balance per account (from a detailed transaction table) and puts it into a temp table (#PointBalTotal). That table has the account ID and the balance per transaction date (there may be gaps, as there are not transactions every day). I have a another temp table (#DATE_DIM) that has a record per account per calendar day.
#PointBalTotal has these columns:
- TransactionDate (datetime)
- AccountID (int)
- PointsEarnedTotal (numeric)
#DATE_DIM has these columns:
- AccountID (int)
- Date (datetime)
In my code, I create #DailyBalance by using an outer join to combine #PointBalTotal and #DATE_DIM in order to have a record per account, per day. Obviously, this creates null values in the PointsEarnedTotal column where #DATE_DIM has a date that #PointBalTotal does not. My challenge is in populating the null values with PointsEarnedTotal from the previous date.
For example:
RowNumAccountIDdatePointsEarnedTotal
1553492008-08-111
2553492008-08-12 10
3553492008-08-13 19
4553492008-08-14 NULL
5553492008-08-15 NULL
6553492008-08-1631
7553492008-08-17NULL
Should be:
RowNumAccountIDdatePointsEarnedTotal
1553492008-08-111
2553492008-08-12 10
3553492008-08-13 19
4553492008-08-14 19
5553492008-08-15 19
6553492008-08-1631
7553492008-08-1731
I thought I could use a CTE to get the nulls filled in, but I'm not able to get it to work. It fills in the first instance of the NULL, but not the subsequent ones (For instance, RowNum 4 gets the 19 filled in for PointsEarnedTotal, but RowNum 5 does not; it remains null)
I am attaching the sample data for #PointBalTotal, #DATE_DIM and the code I am using to create #DailyBalance. My stab at the CTE is included as well.
I would appreciate any guidance/assistance. (Note: I am not allowed to use while loops or cursors, so I am looking for a set-based solution)
Thank you,
Marianne

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

tnk SSCrazy Points: 2613 More actions · Answer 1

If I understand what you are after correctly, then you just need to join from that table to itself like this:

SELECT a.TransactionDate, a.AccountID, a.PointsEarnedTotal, b.PointsEarnedTotal as 'YesterdayPointsEarnedTotal'

FROM #PointBalTotal a

JOIN #PointBalTotal b ON a.AccountID = b.AccountID

AND b.TransactionDate = DATEADD(d, -1, a.TransactionDate)

Watch out with the DATEADD though. If the time's are different you'll need to compare DAY, MONTH, and YEAR each rather than just checking equality.

Marianne L Collins SSCommitted Points: 1895 More actions · Answer 2

Unfortunately, that won't work. #PointBalTotal has gaps in the dates (it shows the balance only for dates that actually had transactions), which is why I join to #DATE_DIM - to bring in the missing dates. In the example I posted, #PointBalTotal had a balance on 8/13/2008 and 8/16/2008. I need to be able to see that the balance on 8/14 and 8/15 was the same as the balance on 8/13 (the balance is cumulative, and there was no activity on 8/14 and 8/15, so the balance stayed the same).

Thanks,

marianne

Garadin One Orange Chip Points: 29613 More actions · Answer 3

Not allowed to use loops or cursors? I like that =). Add this after #DailyBalance is generated.

[font="Courier New"]ALTER TABLE #DailyBalance

ALTER COLUMN RowNum INT NOT NULL

ALTER TABLE #DailyBalance

ALTER COLUMN AccountID INT NOT NULL

ALTER TABLE #DailyBalance

ADD CONSTRAINT PK_RowNum PRIMARY KEY CLUSTERED (AccountID, RowNum)

UPDATE #DailyBalance

SET PointsEarnedTotal = ISNULL(PointsEarnedTotal,( SELECT TOP 1 PointsEarnedTotal

FROM #DailyBalance

WHERE RowNum < D.RowNum AND AccountID = D.AccountID

AND PointsEarnedTotal IS NOT NULL ORDER BY RowNum DESC))

FROM #DailyBalance D WITH (INDEX(PK_RowNum))

SELECT * FROM #DailyBalance[/font]

It looks like a correlated subquery, but unless I screwed it up, it performs about the same as setting variables. Method based off of: http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Marianne L Collins SSCommitted Points: 1895 More actions · Answer 4

Seth,

Thank you!! That appears to work perfectly.

-Marianne

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 5

My solution is much like Seth's except it didn't require altering your base tables to have rowIDs. You can get the job done just by looking for the last transaction date with any points. I am including it here because the execution plan is interesting. Apparently the optimizer resolves the "SELECT TOP 1" subquery for all instances in one sort operation, producing a small worktable that is joined in with a nested loop. The whole thing ran in about 16ms with the volumes you supplied... faster than I would have expected.

--------------------- Gaps Problem

select row_number() over (order by d.accountID,d.date) as rowID,d.*, isnull(p.pointsEarnedTotal

,(select top 1 pointsEarnedTotal

from #pointBalTotal

where accountID = d.accountID

and transactionDate < d.date

and pointsEarnedTotal is not null

order by transactionDate desc)) as pointsEarnedTotal

from #date_dim d

left join #pointBalTotal p on p.accountID = d.accountID and p.transactionDate = d.date

order by d.accountID,d.date

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Marianne L Collins SSCommitted Points: 1895 More actions · Answer 6

Thanks for the additional option - I'll give it a try. It looks like it would allow me to eliminate one of my temp tables (#DailyBalance), which would be nice.

Thanks,

Marianne

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 7

You're very welcome.

If you have time, please let us know if performance is comparable. I'm sure you can speed it up by making accountID/date and accountID/transaction date the primary keys when you create your two temporary tables.

Good luck 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1004318 More actions · Answer 8

Marianne,

Actually, I'd be very interested in the performance of the queries. Could you tell us how many rows ended up in the table and how many of them had NULLs that needed to be updated... I'd like to do some testing and I'll post the results. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Garadin One Orange Chip Points: 29613 More actions · Answer 9

While my subquery seems to fit all that criteria, I'm not sure that it's creating a triangular join in this case. The subquery method is blazingly fast, it was one of a dozen different things I tried when I was evaluating your running totals method last week, and it was the only one that actually came close to equaling it. It may be that the triangular join is created, but because all the fields are covered by the clustered index, it's quick enough that it's not a big factor.

My original tests on this in a comparison: http://www.sqlservercentral.com/Forums/Topic582878-338-1.aspx#bm583453

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Marianne L Collins SSCommitted Points: 1895 More actions · Answer 10

I'll try to spend some time comparing the two methods today or tomorrow. The data I posted yesterday was just a small sample, but since I'm working in dev, even when I run it against all the data, I won't have much there. This is just a small (but critical) part of a larger stored proc I am developing.

What kind of information do you want? Just the execution plan?

-Marianne

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 11

Time and IO stats would be nice too, please 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Garadin One Orange Chip Points: 29613 More actions · Answer 12

Probably:

CPU

Reads

Writes

Duration

Number of total rows you are working with

Number of rows being replaced by the function. (Number of NULLS)

Nuber of rows in the final output.

Execution plan

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 13

... and prevailing wind direction and velocity. Most important. 😉

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Marianne L Collins SSCommitted Points: 1895 More actions · Answer 14

Well, most importantly - in Dallas, TX, the wind is blowing NNW at 10 mph. It's cloudy and 62 degrees. 😀

I hope I was able to make an equitable comparison. The data that gets into #PointBalTotal and #DATE_DIM gets filtered earlier in the procedure (due to various business rules), so I removed the filter in order to get a larger volume of data with which to work. I have not included the sample data for this larger volume. If necessary, I can provide some additional data, but probably not all of it (see the record counts below).

I created those tables in tempdb and added primary keys.

ALTER TABLE tempdb.dbo.DATE_DIM

ADD CONSTRAINT PK_DateDim PRIMARY KEY CLUSTERED (AccountID, Date)

ALTER TABLE tempdb.dbo.PointBalTotal

ADD CONSTRAINT PK_PointBal PRIMARY KEY CLUSTERED (AccountID, TransactionDate)

tempdb.dbo.PointBalTotal has 90,302 records (37983 distinct AccountID);

tempdb.dbo.DATE_DIM has 2,889,040 records (37520 distinct AccountID; dates from 8/1/2008 - 10/16/2008).

Once joined, there are 2,850,672 records with NULL PointsEarnedTotal values to be updated.

I have attached the output from my queries (MethodOne.txt and MethodTwo.txt).

Method One - #DailyBalance is created using row_number(), then the null values are updated. I have to do a separate update for null values on the first date at the beginning.

SET STATISTICS IO ON

SET STATISTICS TIME ON

SET STATISTICS PROFILE ON

--create #DailyBalance and fill in missing dates using the date dimension table

SELECT row_number() OVER (PARTITION BY dd.AccountID ORDER BY Date) as RowNum,

dd.AccountID, Date, PointsEarnedTotal

INTO #DailyBalance

FROM tempdb.dbo.PointBalTotal pbt

RIGHT JOIN tempdb.dbo.DATE_DIM dd on Date = Transactiondate

AND pbt.AccountID = dd.AccountID

--if the initial record in the date range has null points, update it to 0

UPDATE #DailyBalance

SET PointsEarnedTotal = 0

WHERE RowNum = 1 AND PointsEarnedTotal is null

ALTER TABLE #DailyBalance

ALTER COLUMN RowNum INT NOT NULL

ALTER TABLE #DailyBalance

ALTER COLUMN AccountID INT NOT NULL

ALTER TABLE #DailyBalance

ADD CONSTRAINT PK_RowNum PRIMARY KEY CLUSTERED (AccountID, RowNum)

UPDATE #DailyBalance

SET PointsEarnedTotal

= ISNULL(PointsEarnedTotal,(SELECT TOP 1 PointsEarnedTotal

FROM #DailyBalance

WHERE RowNum < D.RowNum AND AccountID = D.AccountID

AND PointsEarnedTotal IS NOT NULL ORDER BY RowNum DESC))

FROM #DailyBalance D WITH (INDEX(PK_RowNum))

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

SET STATISTICS PROFILE OFF

select 'DailyBalance Count', count(1) from #DailyBalance

Method Two - #DailyBalance is created without null values (except where the value was null on the first date - I update those at the end)

SET STATISTICS IO ON

SET STATISTICS TIME ON

SET STATISTICS PROFILE ON

select --row_number() over (order by d.accountID,d.date) as rowID,

d.*, isnull(p.pointsEarnedTotal

,(select top 1 pointsEarnedTotal

from tempdb.dbo.pointBalTotal

where accountID = d.accountID

and transactionDate < d.date

and pointsEarnedTotal is not null

order by transactionDate desc)) as pointsEarnedTotal

INTO #DailyBalance

from tempdb.dbo.date_dim d

left join tempdb.dbo.pointBalTotal p on p.accountID = d.accountID and p.transactionDate = d.date

order by d.accountID,d.date

UPDATE #DailyBalance

SET PointsEarnedTotal = 0

WHERE PointsEarnedTotal is null

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

SET STATISTICS PROFILE OFF

select 'DailyBalance Count', count(1) from #DailyBalance

I hope this give enough information. If not, please let me know.