Updating a Column Using a Recursive Definition

Question

Updating a Column Using a Recursive Definition

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

sgmunson SSC Guru Points: 110640 More actions · Answer 1

On the basis of your stated usage of the numbers being generated, I'd then be motivated to interpolate. I'd also generate the number of missing values as a percentage of the number of contributing values, so that you could assign a "confidence" number as well. Having done statistics, I know that this kind of measure can help define the overall accuracy of your model, and help provide clear documentation on how the data is derived.

One final thought - I might choose to not use float, but instead go with a decimal data type and max out the number of digits beyond the decimal point. It might slow things down a little, and when you figure you're running some 1100 calculations per symbol per 5 years of data, and that every time float doesn't accurately represent the actual value, the number of errors might be piling up faster than the cutting off of numbers beyond the 30 or so places beyond the decimal point you can use with the decimal data type, it may be more accurate. It might be worth some testing to determine if there are enough losses in float for it to matter.

Steve

(aka smunson)

:):):)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 2

smunson (12/12/2008)
Looks like Peso nailed the concept for using the "quirky update" to get at the data,...

Ya know what? I missed Peso's post... that's just about where I'm going with all of this.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 3

bkmooney (12/12/2008)
Jeff,
I'm attaching the file with the formatted test data. It should generate about 150 rows of data for you. And because it makes things more interesting (:)), I made sure to include some situations which can throw the recursion off (skipped days or days where various values are NULL).
Please let me know if you have any problems receiving the data.
Thanks again for all your help!

All set... your data came through fine and I've got it loaded up to play with. Thanks for taking the time to do this. Now, I just need to review your fine documentation in your previous code to make sure I get all the rules right.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 4

bkmooney (12/12/2008)
Jeff,
I'm attaching the file with the formatted test data. It should generate about 150 rows of data for you. And because it makes things more interesting (:)), I made sure to include some situations which can throw the recursion off (skipped days or days where various values are NULL).
Please let me know if you have any problems receiving the data.
Thanks again for all your help!

Ok... I think maybe this is it... I'm using the test table called #Test_HistoricalData that you provided as an attachment. The code copies the data to a work area so we can put the Clustered Index (a PK in this case) on the necessary columns to make it work, then calculates the moving averages using the formulas you and Peso provided, and then updates the original table from that. Here's the code... yeah, I know... it looks awfully short for what it does and that's the beauty of this type of "pseudo cursor" endearingly referred to as the "quirky update"....

--===== If the work table already exists, drop it

IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL

DROP TABLE #Work

--===== Create and populate the work table from the history table on the fly.

-- The ISNULL's are to keep from having to alter the columns to NOT NULL

-- so we can add the clustered pk on those columns.

-- The ExpAvgVol column is an empty column defined as FLOAT(54) to keep as

-- much precision and scale as possible during the float calculations.

SELECT RowID,

ISNULL(SymbolID,0) AS SymbolID,

ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

Volume,

CAST(0 AS FLOAT(54)) AS ExpAvgVol

INTO #Work

FROM #Test_HistoricalData

--===== Now, we add the clustered index/key that makes the "quirky update" actually work

-- on rows in the update order that we want. This only works when we're doing an

-- update that results in an index scan without parallelism occuring.

ALTER TABLE #Work

ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100

--===== All set... define some obvious named variables...

DECLARE @PrevExpAvgVol FLOAT,

@PrevSymbolID DATETIME

SELECT @PrevExpAvgVol = 0,

@PrevSymbolID = 0

--===== ... do the "quirky update" in the order controlled by the clustered index...

-- The rules in the CASE statement are obvious

UPDATE #Work

SET @PrevExpAvgVol = ExpAvgVol = CASE WHEN SymbolID = @PrevSymbolID AND Volume IS NOT NULL

THEN 0.03E * Volume + 0.97E * ISNULL(@PrevExpAvgVol,Volume)

WHEN SymbolID = @PrevSymbolID AND Volume IS NULL

THEN @PrevExpAvgVol

ELSE Volume

END,

@PrevSymbolID = SymbolID

FROM #Work WITH(INDEX(0))

--===== ... and update the original table from the results.

UPDATE #Test_HistoricalData

SET ExpAvgVol = w.ExpAvgVol

FROM #Test_HistoricalData h

INNER JOIN #Work w

ON h.RowID = w.RowID

--===== As you said before, display what we ended up with in the work table

-- so we can see that it actually worked

SELECT *

FROM #Work

ORDER BY SymbolID, MarketDate

If the code violates any of your rules, post back and we'll tweek it. 😉 This should update at the rate of about a million rows every 7 seconds in the work table. I've not tested the final update for performance... we may be on the verge of hitting the "tipping point" that every machine has for updates when you're talking about 5 million rows of joined updates... but only you will be able to test for that because you have the data.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bkmooney SSCrazy Points: 2041 More actions · Answer 5

This seems to work for the most part, but the version you posted had a slight quirk in that the calculations were a day off. This is a trailing volume calculation, so ExpAvgVol on 12/15/08 should be calculated with volume values for 12/14/08 and prior. (That was why I was keeping track of the day volume the day before.)

I made a couple of adjustments (code below) and now have it returning results like:

RowID SymbolIDDate VolumeExpAvgVol

1715966092007-01-03 16:00:00.0002575800NULL

1717058792007-01-04 16:00:00.00020737002575800

1718157692007-01-05 16:00:00.00026766002560737

1719251092007-01-08 16:00:00.00015572002564212.89

...

1723617892007-01-12 16:00:00.00014059002477273.27409432

17158435102007-01-01 16:00:00.000NULLNULL

17159021102007-01-02 16:00:00.0002857528575

17159661102007-01-03 16:00:00.00021673328575

17170588102007-01-04 16:00:00.00023872134219.74

...

This is exactly what it should be except for cases where there is a leading NULL volume, like with SymbolID=10 above. In these cases, the red value should actually be NULL. The problem is, I can't seem to figure out how to get this to happen without messing everything else up in the process.

If you've got any ideas, I'd really appreciate it. I feel like I'm chasing my tail with this one a bit. (However, am definitely learning a lot in the process, so that's always a good thing.)

Thanks! =)

And here's the code I'm using at the moment:

--===== If the work table already exists, drop it

IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL

DROP TABLE #Work

--===== Create and populate the work table from the history table on the fly.

-- The ISNULL's are to keep from having to alter the columns to NOT NULL

-- so we can add the clustered pk on those columns.

-- The ExpAvgVol column is an empty column defined as FLOAT(54) to keep as

-- much precision and scale as possible during the float calculations.

SELECT RowID,

ISNULL(SymbolID,0) AS SymbolID,

ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

Volume,

CAST(0 AS FLOAT(54)) AS ExpAvgVol

INTO #Work

FROM Test_HistoricalData

WHEREMarketDate <= '2007-01-15'

--===== Now, we add the clustered index/key that makes the "quirky update" actually work

-- on rows in the update order that we want. This only works when we're doing an

-- update that results in an index scan without parallelism occuring.

ALTER TABLE #Work

ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100

--===== All set... define some obvious named variables...

DECLARE @PrevExpAvgVol FLOAT,

@PrevSymbolID DATETIME,

@PrNonNullVol FLOAT

SELECT @PrevExpAvgVol = 0,

@PrevSymbolID = 0,

@PrNonNullVol = 0

--===== ... do the "quirky update" in the order controlled by the clustered index...

-- The rules in the CASE statement are obvious

UPDATE #Work

SET @PrevExpAvgVol = ExpAvgVol = CASE WHEN SymbolID <> @PrevSymbolID

THEN NULL

WHEN SymbolID = @PrevSymbolID

THEN 0.03E * ISNULL(@PrNonNullVol, Volume) + 0.97E * ISNULL(@PrevExpAvgVol,ISNULL(@PrNonNullVol, Volume))

ELSE @PrNonNullVol

END,

@PrNonNullVol = CASE WHEN SymbolID <> @PrevSymbolID

THEN Volume

WHEN @PrevExpAvgVol IS NOT NULL AND @PrevSymbolID = SymbolID

THEN Volume

ELSE @PrNonNullVol

END,

@PrevSymbolID = SymbolID

FROM #Work WITH(INDEX(0))

/*

--===== ... and update the original table from the results.

UPDATE #Test_HistoricalData

SET ExpAvgVol = w.ExpAvgVol

FROM #Test_HistoricalData h

INNER JOIN #Work w

ON h.RowID = w.RowID

*/

--===== As you said before, display what we ended up with in the work table

-- so we can see that it actually worked

SELECT *

FROM #Work

ORDER BY SymbolID, MarketDate

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 6

Sorry... been busy...

Were you able to flush out the anomoly?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bkmooney SSCrazy Points: 2041 More actions · Answer 7

Yeah, I figured it out. The final implementation is below. I've done the updates for all my past data, and now as new data comes in, I'll be performing daily/weekly updates using this code.

It's a little slower than I would like - running it for 12/15/08-12/16/08 took a little over 1 minute, and updated about 70K rows.

I think what takes the most time is the step highlighted in red below. When @updateThrough is a date early in my data sequence, this step runs very quickly. If it is at the end of my data sequence (as in the latest date available), that step alone takes quite a bit of time - about 38 seconds when I ran just that statement with @updateThrough=12/16/08. (So that's a bit more than half of the total run time for the entire query.)

I can only imagine this getting worse as my table tblHistoricalData begins to contain older and older data. Is there an index or something I should add to try and get this step to run a bit more efficiently?

Thanks!

-- Declares variables to keep track of when data last updated and when through.

DECLARE @lastUpdate DATETIME

, @updateThrough DATETIME

SET @lastUpdate=(SELECT MAX(MarketDate) FROM tblHistoricalData WHERE ExpAvgVol20 IS NOT NULL)

SET @updateThrough=GETDATE()

--===== If the work table already exists, drop it

IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL

DROP TABLE #Work

--===== Create and populate the work table from the history table on the fly.

-- The ISNULL's are to keep from having to alter the columns to NOT NULL

-- so we can add the clustered pk on those columns.

-- The ExpAvgVol20 column is an empty column defined as FLOAT(54) to keep as

-- much precision and scale as possible during the float calculations.

-- This insertion inserts all rows where the ExpAvgs are NON-NULL with dates

--between the last update and today.

SELECT RowID,

ISNULL(SymbolID,0) AS SymbolID,

ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

Volume,

CAST(ExpAvgVol20 AS FLOAT(54)) AS ExpAvgVol20,

CAST(ExpAvgVol05 AS FLOAT(54)) AS ExpAvgVol05,

CAST(ExpAvgVol004 AS FLOAT(54)) AS ExpAvgVol004

INTO #Work

FROM tblHistoricalData

WHEREExpAvgVol20 IS NULL AND MarketDate BETWEEN @lastUpdate AND @updateThrough

-- This insertion inserts the most recent row with NON-NULL ExpAvgs and Volume

--for each symbol if one exists. Serves as 'initial value' when iterating.

INSERT INTO #Work (RowID, SymbolID, MarketDate, Volume, ExpAvgVol20, ExpAvgVol05, ExpAvgVol004)

SELECT T.RowID,

ISNULL(T.SymbolID,0) AS SymbolID,

ISNULL(T.MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

T.Volume,

T.ExpAvgVol20,

T.ExpAvgVol05,

T.ExpAvgVol004

FROM tblHistoricalData T

LEFT JOIN (

SELECT SymbolID

, MAX(MarketDate) AS LatestDate

FROM tblHistoricalData

WHERE ExpAvgVol20 IS NOT NULL

AND Volume IS NOT NULL

AND MarketDate <= @updateThrough

GROUP BY SymbolID

) AS D ON D.SymbolID = T.SymbolID AND D.LatestDate=T.MarketDate

WHERED.LatestDate IS NOT NULL

AND T.MarketDate <= @updateThrough

-- Add the clustered index/key that makes the "quirky update" actually work

-- on rows in the update order that we want. This only works when we're doing an

-- update that results in an index scan without parallelism occuring.

ALTER TABLE #Work

ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100

-- Defining Variables for iteration.

DECLARE @PrevExpAvgVol20 FLOAT,

@PrevExpAvgVol05 FLOAT,

@PrevExpAvgVol004 FLOAT,

@PrevSymbolID DATETIME,

@PrNonNullVol FLOAT

SELECT @PrevExpAvgVol20 = 0,

@PrevExpAvgVol05 = 0,

@PrevExpAvgVol004 = 0,

@PrevSymbolID = 0,

@PrNonNullVol = 0

-- Updating the variables and the temporary work table.

UPDATE #Work

SET @PrevExpAvgVol20 = ExpAvgVol20 = CASE WHEN SymbolID <> @PrevSymbolID

THEN ExpAvgVol20

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NOT NULL

THEN ExpAvgVol20

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NULL AND @PrNonNullVol IS NULL

THEN @PrevExpAvgVol20

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NULL AND @PrNonNullVol IS NOT NULL

THEN 0.20E * @PrNonNullVol + 0.80E * ISNULL(@PrevExpAvgVol20,@PrNonNullVol)

END,

@PrevExpAvgVol05 = ExpAvgVol05 = CASE WHEN SymbolID <> @PrevSymbolID

THEN ExpAvgVol05

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NOT NULL

THEN ExpAvgVol05

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NULL AND @PrNonNullVol IS NULL

THEN @PrevExpAvgVol05

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NULL AND @PrNonNullVol IS NOT NULL

THEN 0.05E * @PrNonNullVol + 0.95E * ISNULL(@PrevExpAvgVol05,@PrNonNullVol)

END,

@PrevExpAvgVol004 = ExpAvgVol004 = CASE WHEN SymbolID <> @PrevSymbolID

THEN ExpAvgVol004

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NOT NULL

THEN ExpAvgVol004

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NULL AND @PrNonNullVol IS NULL

THEN @PrevExpAvgVol004

WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NULL AND @PrNonNullVol IS NOT NULL

THEN 0.004E * @PrNonNullVol + 0.996E * ISNULL(@PrevExpAvgVol004,@PrNonNullVol)

END,

@PrNonNullVol = Volume,

@PrevSymbolID = SymbolID

FROM #Work WITH(INDEX(0))

-- Updating the original table with the results

UPDATE tblHistoricalData

SET ExpAvgVol20 = w.ExpAvgVol20,

ExpAvgVol05 = w.ExpAvgVol05,

ExpAvgVol004 = w.ExpAvgVol004

FROM tblHistoricalData h

INNER JOIN #Work w

ON h.RowID = w.RowID