October 15, 2008 at 11:27 am
Background: I have code that calculates daily cumulative point balance per account (from a detailed transaction table) and puts it into a temp table (#PointBalTotal). That table has the account ID and the balance per transaction date (there may be gaps, as there are not transactions every day). I have a another temp table (#DATE_DIM) that has a record per account per calendar day.
#PointBalTotal has these columns:
- TransactionDate (datetime)
- AccountID (int)
- PointsEarnedTotal (numeric)
#DATE_DIM has these columns:
- AccountID (int)
- Date (datetime)
In my code, I create #DailyBalance by using an outer join to combine #PointBalTotal and #DATE_DIM in order to have a record per account, per day. Obviously, this creates null values in the PointsEarnedTotal column where #DATE_DIM has a date that #PointBalTotal does not. My challenge is in populating the null values with PointsEarnedTotal from the previous date.
For example:
RowNumAccountIDdatePointsEarnedTotal
1553492008-08-111
2553492008-08-12 10
3553492008-08-13 19
4553492008-08-14 NULL
5553492008-08-15 NULL
6553492008-08-1631
7553492008-08-17NULL
Should be:
RowNumAccountIDdatePointsEarnedTotal
1553492008-08-111
2553492008-08-12 10
3553492008-08-13 19
4553492008-08-14 19
5553492008-08-15 19
6553492008-08-1631
7553492008-08-1731
I thought I could use a CTE to get the nulls filled in, but I'm not able to get it to work. It fills in the first instance of the NULL, but not the subsequent ones (For instance, RowNum 4 gets the 19 filled in for PointsEarnedTotal, but RowNum 5 does not; it remains null)
I am attaching the sample data for #PointBalTotal, #DATE_DIM and the code I am using to create #DailyBalance. My stab at the CTE is included as well.
I would appreciate any guidance/assistance. (Note: I am not allowed to use while loops or cursors, so I am looking for a set-based solution)
Thank you,
Marianne
October 15, 2008 at 11:40 am
If I understand what you are after correctly, then you just need to join from that table to itself like this:
SELECT a.TransactionDate, a.AccountID, a.PointsEarnedTotal, b.PointsEarnedTotal as 'YesterdayPointsEarnedTotal'
FROM #PointBalTotal a
JOIN #PointBalTotal b ON a.AccountID = b.AccountID
AND b.TransactionDate = DATEADD(d, -1, a.TransactionDate)
Watch out with the DATEADD though. If the time's are different you'll need to compare DAY, MONTH, and YEAR each rather than just checking equality.
October 15, 2008 at 11:57 am
Unfortunately, that won't work. #PointBalTotal has gaps in the dates (it shows the balance only for dates that actually had transactions), which is why I join to #DATE_DIM - to bring in the missing dates. In the example I posted, #PointBalTotal had a balance on 8/13/2008 and 8/16/2008. I need to be able to see that the balance on 8/14 and 8/15 was the same as the balance on 8/13 (the balance is cumulative, and there was no activity on 8/14 and 8/15, so the balance stayed the same).
Thanks,
marianne
October 15, 2008 at 12:00 pm
Not allowed to use loops or cursors? I like that =). Add this after #DailyBalance is generated.
[font="Courier New"]ALTER TABLE #DailyBalance
ALTER COLUMN RowNum INT NOT NULL
ALTER TABLE #DailyBalance
ALTER COLUMN AccountID INT NOT NULL
ALTER TABLE #DailyBalance
ADD CONSTRAINT PK_RowNum PRIMARY KEY CLUSTERED (AccountID, RowNum)
UPDATE #DailyBalance
SET PointsEarnedTotal = ISNULL(PointsEarnedTotal,( SELECT TOP 1 PointsEarnedTotal
FROM #DailyBalance
WHERE RowNum < D.RowNum AND AccountID = D.AccountID
AND PointsEarnedTotal IS NOT NULL ORDER BY RowNum DESC))
FROM #DailyBalance D WITH (INDEX(PK_RowNum))
SELECT * FROM #DailyBalance[/font]
It looks like a correlated subquery, but unless I screwed it up, it performs about the same as setting variables. Method based off of: http://www.sqlservercentral.com/articles/Advanced+Querying/61716/
October 15, 2008 at 12:27 pm
Seth,
Thank you!! That appears to work perfectly.
-Marianne
October 15, 2008 at 1:34 pm
My solution is much like Seth's except it didn't require altering your base tables to have rowIDs. You can get the job done just by looking for the last transaction date with any points. I am including it here because the execution plan is interesting. Apparently the optimizer resolves the "SELECT TOP 1" subquery for all instances in one sort operation, producing a small worktable that is joined in with a nested loop. The whole thing ran in about 16ms with the volumes you supplied... faster than I would have expected.
--------------------- Gaps Problem
select row_number() over (order by d.accountID,d.date) as rowID,d.*, isnull(p.pointsEarnedTotal
,(select top 1 pointsEarnedTotal
from #pointBalTotal
where accountID = d.accountID
and transactionDate < d.date
and pointsEarnedTotal is not null
order by transactionDate desc)) as pointsEarnedTotal
from #date_dim d
left join #pointBalTotal p on p.accountID = d.accountID and p.transactionDate = d.date
order by d.accountID,d.date
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 15, 2008 at 2:12 pm
Thanks for the additional option - I'll give it a try. It looks like it would allow me to eliminate one of my temp tables (#DailyBalance), which would be nice.
Thanks,
Marianne
October 15, 2008 at 3:40 pm
You're very welcome.
If you have time, please let us know if performance is comparable. I'm sure you can speed it up by making accountID/date and accountID/transaction date the primary keys when you create your two temporary tables.
Good luck 🙂
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 15, 2008 at 6:49 pm
Marianne,
Actually, I'd be very interested in the performance of the queries. Could you tell us how many rows ended up in the table and how many of them had NULLs that needed to be updated... I'd like to do some testing and I'll post the results. Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 15, 2008 at 6:49 pm
While my subquery seems to fit all that criteria, I'm not sure that it's creating a triangular join in this case. The subquery method is blazingly fast, it was one of a dozen different things I tried when I was evaluating your running totals method last week, and it was the only one that actually came close to equaling it. It may be that the triangular join is created, but because all the fields are covered by the clustered index, it's quick enough that it's not a big factor.
My original tests on this in a comparison: http://www.sqlservercentral.com/Forums/Topic582878-338-1.aspx#bm583453
October 16, 2008 at 8:16 am
I'll try to spend some time comparing the two methods today or tomorrow. The data I posted yesterday was just a small sample, but since I'm working in dev, even when I run it against all the data, I won't have much there. This is just a small (but critical) part of a larger stored proc I am developing.
What kind of information do you want? Just the execution plan?
-Marianne
October 16, 2008 at 9:58 am
Time and IO stats would be nice too, please 🙂
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 16, 2008 at 11:06 am
Probably:
CPU
Reads
Writes
Duration
Number of total rows you are working with
Number of rows being replaced by the function. (Number of NULLS)
Nuber of rows in the final output.
Execution plan
October 16, 2008 at 1:39 pm
... and prevailing wind direction and velocity. Most important. 😉
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 16, 2008 at 1:58 pm
Well, most importantly - in Dallas, TX, the wind is blowing NNW at 10 mph. It's cloudy and 62 degrees. 😀
I hope I was able to make an equitable comparison. The data that gets into #PointBalTotal and #DATE_DIM gets filtered earlier in the procedure (due to various business rules), so I removed the filter in order to get a larger volume of data with which to work. I have not included the sample data for this larger volume. If necessary, I can provide some additional data, but probably not all of it (see the record counts below).
I created those tables in tempdb and added primary keys.
ALTER TABLE tempdb.dbo.DATE_DIM
ADD CONSTRAINT PK_DateDim PRIMARY KEY CLUSTERED (AccountID, Date)
ALTER TABLE tempdb.dbo.PointBalTotal
ADD CONSTRAINT PK_PointBal PRIMARY KEY CLUSTERED (AccountID, TransactionDate)
tempdb.dbo.PointBalTotal has 90,302 records (37983 distinct AccountID);
tempdb.dbo.DATE_DIM has 2,889,040 records (37520 distinct AccountID; dates from 8/1/2008 - 10/16/2008).
Once joined, there are 2,850,672 records with NULL PointsEarnedTotal values to be updated.
I have attached the output from my queries (MethodOne.txt and MethodTwo.txt).
Method One - #DailyBalance is created using row_number(), then the null values are updated. I have to do a separate update for null values on the first date at the beginning.
SET STATISTICS IO ON
SET STATISTICS TIME ON
SET STATISTICS PROFILE ON
--create #DailyBalance and fill in missing dates using the date dimension table
SELECT row_number() OVER (PARTITION BY dd.AccountID ORDER BY Date) as RowNum,
dd.AccountID, Date, PointsEarnedTotal
INTO #DailyBalance
FROM tempdb.dbo.PointBalTotal pbt
RIGHT JOIN tempdb.dbo.DATE_DIM dd on Date = Transactiondate
AND pbt.AccountID = dd.AccountID
--if the initial record in the date range has null points, update it to 0
UPDATE #DailyBalance
SET PointsEarnedTotal = 0
WHERE RowNum = 1 AND PointsEarnedTotal is null
ALTER TABLE #DailyBalance
ALTER COLUMN RowNum INT NOT NULL
ALTER TABLE #DailyBalance
ALTER COLUMN AccountID INT NOT NULL
ALTER TABLE #DailyBalance
ADD CONSTRAINT PK_RowNum PRIMARY KEY CLUSTERED (AccountID, RowNum)
UPDATE #DailyBalance
SET PointsEarnedTotal
= ISNULL(PointsEarnedTotal,(SELECT TOP 1 PointsEarnedTotal
FROM #DailyBalance
WHERE RowNum < D.RowNum AND AccountID = D.AccountID
AND PointsEarnedTotal IS NOT NULL ORDER BY RowNum DESC))
FROM #DailyBalance D WITH (INDEX(PK_RowNum))
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
SET STATISTICS PROFILE OFF
select 'DailyBalance Count', count(1) from #DailyBalance
Method Two - #DailyBalance is created without null values (except where the value was null on the first date - I update those at the end)
SET STATISTICS IO ON
SET STATISTICS TIME ON
SET STATISTICS PROFILE ON
select --row_number() over (order by d.accountID,d.date) as rowID,
d.*, isnull(p.pointsEarnedTotal
,(select top 1 pointsEarnedTotal
from tempdb.dbo.pointBalTotal
where accountID = d.accountID
and transactionDate < d.date
and pointsEarnedTotal is not null
order by transactionDate desc)) as pointsEarnedTotal
INTO #DailyBalance
from tempdb.dbo.date_dim d
left join tempdb.dbo.pointBalTotal p on p.accountID = d.accountID and p.transactionDate = d.date
order by d.accountID,d.date
UPDATE #DailyBalance
SET PointsEarnedTotal = 0
WHERE PointsEarnedTotal is null
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
SET STATISTICS PROFILE OFF
select 'DailyBalance Count', count(1) from #DailyBalance
I hope this give enough information. If not, please let me know.
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply