Updating a Column Using a Recursive Definition

  • On the basis of your stated usage of the numbers being generated, I'd then be motivated to interpolate. I'd also generate the number of missing values as a percentage of the number of contributing values, so that you could assign a "confidence" number as well. Having done statistics, I know that this kind of measure can help define the overall accuracy of your model, and help provide clear documentation on how the data is derived.

    One final thought - I might choose to not use float, but instead go with a decimal data type and max out the number of digits beyond the decimal point. It might slow things down a little, and when you figure you're running some 1100 calculations per symbol per 5 years of data, and that every time float doesn't accurately represent the actual value, the number of errors might be piling up faster than the cutting off of numbers beyond the 30 or so places beyond the decimal point you can use with the decimal data type, it may be more accurate. It might be worth some testing to determine if there are enough losses in float for it to matter.

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • smunson (12/12/2008)


    Looks like Peso nailed the concept for using the "quirky update" to get at the data,...

    Ya know what? I missed Peso's post... that's just about where I'm going with all of this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • bkmooney (12/12/2008)


    Jeff,

    I'm attaching the file with the formatted test data. It should generate about 150 rows of data for you. And because it makes things more interesting (:)), I made sure to include some situations which can throw the recursion off (skipped days or days where various values are NULL).

    Please let me know if you have any problems receiving the data.

    Thanks again for all your help!

    All set... your data came through fine and I've got it loaded up to play with. Thanks for taking the time to do this. Now, I just need to review your fine documentation in your previous code to make sure I get all the rules right.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • bkmooney (12/12/2008)


    Jeff,

    I'm attaching the file with the formatted test data. It should generate about 150 rows of data for you. And because it makes things more interesting (:)), I made sure to include some situations which can throw the recursion off (skipped days or days where various values are NULL).

    Please let me know if you have any problems receiving the data.

    Thanks again for all your help!

    Ok... I think maybe this is it... I'm using the test table called #Test_HistoricalData that you provided as an attachment. The code copies the data to a work area so we can put the Clustered Index (a PK in this case) on the necessary columns to make it work, then calculates the moving averages using the formulas you and Peso provided, and then updates the original table from that. Here's the code... yeah, I know... it looks awfully short for what it does and that's the beauty of this type of "pseudo cursor" endearingly referred to as the "quirky update"....

    --===== If the work table already exists, drop it

    IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL

    DROP TABLE #Work

    --===== Create and populate the work table from the history table on the fly.

    -- The ISNULL's are to keep from having to alter the columns to NOT NULL

    -- so we can add the clustered pk on those columns.

    -- The ExpAvgVol column is an empty column defined as FLOAT(54) to keep as

    -- much precision and scale as possible during the float calculations.

    SELECT RowID,

    ISNULL(SymbolID,0) AS SymbolID,

    ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

    Volume,

    CAST(0 AS FLOAT(54)) AS ExpAvgVol

    INTO #Work

    FROM #Test_HistoricalData

    --===== Now, we add the clustered index/key that makes the "quirky update" actually work

    -- on rows in the update order that we want. This only works when we're doing an

    -- update that results in an index scan without parallelism occuring.

    ALTER TABLE #Work

    ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100

    --===== All set... define some obvious named variables...

    DECLARE @PrevExpAvgVol FLOAT,

    @PrevSymbolID DATETIME

    SELECT @PrevExpAvgVol = 0,

    @PrevSymbolID = 0

    --===== ... do the "quirky update" in the order controlled by the clustered index...

    -- The rules in the CASE statement are obvious

    UPDATE #Work

    SET @PrevExpAvgVol = ExpAvgVol = CASE WHEN SymbolID = @PrevSymbolID AND Volume IS NOT NULL

    THEN 0.03E * Volume + 0.97E * ISNULL(@PrevExpAvgVol,Volume)

    WHEN SymbolID = @PrevSymbolID AND Volume IS NULL

    THEN @PrevExpAvgVol

    ELSE Volume

    END,

    @PrevSymbolID = SymbolID

    FROM #Work WITH(INDEX(0))

    --===== ... and update the original table from the results.

    UPDATE #Test_HistoricalData

    SET ExpAvgVol = w.ExpAvgVol

    FROM #Test_HistoricalData h

    INNER JOIN #Work w

    ON h.RowID = w.RowID

    --===== As you said before, display what we ended up with in the work table

    -- so we can see that it actually worked

    SELECT *

    FROM #Work

    ORDER BY SymbolID, MarketDate

    If the code violates any of your rules, post back and we'll tweek it. 😉 This should update at the rate of about a million rows every 7 seconds in the work table. I've not tested the final update for performance... we may be on the verge of hitting the "tipping point" that every machine has for updates when you're talking about 5 million rows of joined updates... but only you will be able to test for that because you have the data.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This seems to work for the most part, but the version you posted had a slight quirk in that the calculations were a day off. This is a trailing volume calculation, so ExpAvgVol on 12/15/08 should be calculated with volume values for 12/14/08 and prior. (That was why I was keeping track of the day volume the day before.)

    I made a couple of adjustments (code below) and now have it returning results like:

    RowID SymbolIDDate VolumeExpAvgVol

    1715966092007-01-03 16:00:00.0002575800NULL

    1717058792007-01-04 16:00:00.00020737002575800

    1718157692007-01-05 16:00:00.00026766002560737

    1719251092007-01-08 16:00:00.00015572002564212.89

    ...

    1723617892007-01-12 16:00:00.00014059002477273.27409432

    17158435102007-01-01 16:00:00.000NULLNULL

    17159021102007-01-02 16:00:00.0002857528575

    17159661102007-01-03 16:00:00.00021673328575

    17170588102007-01-04 16:00:00.00023872134219.74

    ...

    This is exactly what it should be except for cases where there is a leading NULL volume, like with SymbolID=10 above. In these cases, the red value should actually be NULL. The problem is, I can't seem to figure out how to get this to happen without messing everything else up in the process.

    If you've got any ideas, I'd really appreciate it. I feel like I'm chasing my tail with this one a bit. (However, am definitely learning a lot in the process, so that's always a good thing.)

    Thanks! =)

    And here's the code I'm using at the moment:

    --===== If the work table already exists, drop it

    IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL

    DROP TABLE #Work

    --===== Create and populate the work table from the history table on the fly.

    -- The ISNULL's are to keep from having to alter the columns to NOT NULL

    -- so we can add the clustered pk on those columns.

    -- The ExpAvgVol column is an empty column defined as FLOAT(54) to keep as

    -- much precision and scale as possible during the float calculations.

    SELECT RowID,

    ISNULL(SymbolID,0) AS SymbolID,

    ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

    Volume,

    CAST(0 AS FLOAT(54)) AS ExpAvgVol

    INTO #Work

    FROM Test_HistoricalData

    WHEREMarketDate <= '2007-01-15'

    --===== Now, we add the clustered index/key that makes the "quirky update" actually work

    -- on rows in the update order that we want. This only works when we're doing an

    -- update that results in an index scan without parallelism occuring.

    ALTER TABLE #Work

    ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100

    --===== All set... define some obvious named variables...

    DECLARE @PrevExpAvgVol FLOAT,

    @PrevSymbolID DATETIME,

    @PrNonNullVol FLOAT

    SELECT @PrevExpAvgVol = 0,

    @PrevSymbolID = 0,

    @PrNonNullVol = 0

    --===== ... do the "quirky update" in the order controlled by the clustered index...

    -- The rules in the CASE statement are obvious

    UPDATE #Work

    SET @PrevExpAvgVol = ExpAvgVol = CASE WHEN SymbolID <> @PrevSymbolID

    THEN NULL

    WHEN SymbolID = @PrevSymbolID

    THEN 0.03E * ISNULL(@PrNonNullVol, Volume) + 0.97E * ISNULL(@PrevExpAvgVol,ISNULL(@PrNonNullVol, Volume))

    ELSE @PrNonNullVol

    END,

    @PrNonNullVol = CASE WHEN SymbolID <> @PrevSymbolID

    THEN Volume

    WHEN @PrevExpAvgVol IS NOT NULL AND @PrevSymbolID = SymbolID

    THEN Volume

    ELSE @PrNonNullVol

    END,

    @PrevSymbolID = SymbolID

    FROM #Work WITH(INDEX(0))

    /*

    --===== ... and update the original table from the results.

    UPDATE #Test_HistoricalData

    SET ExpAvgVol = w.ExpAvgVol

    FROM #Test_HistoricalData h

    INNER JOIN #Work w

    ON h.RowID = w.RowID

    */

    --===== As you said before, display what we ended up with in the work table

    -- so we can see that it actually worked

    SELECT *

    FROM #Work

    ORDER BY SymbolID, MarketDate

  • Sorry... been busy...

    Were you able to flush out the anomoly?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yeah, I figured it out. The final implementation is below. I've done the updates for all my past data, and now as new data comes in, I'll be performing daily/weekly updates using this code.

    It's a little slower than I would like - running it for 12/15/08-12/16/08 took a little over 1 minute, and updated about 70K rows.

    I think what takes the most time is the step highlighted in red below. When @updateThrough is a date early in my data sequence, this step runs very quickly. If it is at the end of my data sequence (as in the latest date available), that step alone takes quite a bit of time - about 38 seconds when I ran just that statement with @updateThrough=12/16/08. (So that's a bit more than half of the total run time for the entire query.)

    I can only imagine this getting worse as my table tblHistoricalData begins to contain older and older data. Is there an index or something I should add to try and get this step to run a bit more efficiently?

    Thanks!

    -- Declares variables to keep track of when data last updated and when through.

    DECLARE @lastUpdate DATETIME

    , @updateThrough DATETIME

    SET @lastUpdate=(SELECT MAX(MarketDate) FROM tblHistoricalData WHERE ExpAvgVol20 IS NOT NULL)

    SET @updateThrough=GETDATE()

    --===== If the work table already exists, drop it

    IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL

    DROP TABLE #Work

    --===== Create and populate the work table from the history table on the fly.

    -- The ISNULL's are to keep from having to alter the columns to NOT NULL

    -- so we can add the clustered pk on those columns.

    -- The ExpAvgVol20 column is an empty column defined as FLOAT(54) to keep as

    -- much precision and scale as possible during the float calculations.

    -- This insertion inserts all rows where the ExpAvgs are NON-NULL with dates

    --between the last update and today.

    SELECT RowID,

    ISNULL(SymbolID,0) AS SymbolID,

    ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

    Volume,

    CAST(ExpAvgVol20 AS FLOAT(54)) AS ExpAvgVol20,

    CAST(ExpAvgVol05 AS FLOAT(54)) AS ExpAvgVol05,

    CAST(ExpAvgVol004 AS FLOAT(54)) AS ExpAvgVol004

    INTO #Work

    FROM tblHistoricalData

    WHEREExpAvgVol20 IS NULL AND MarketDate BETWEEN @lastUpdate AND @updateThrough

    -- This insertion inserts the most recent row with NON-NULL ExpAvgs and Volume

    --for each symbol if one exists. Serves as 'initial value' when iterating.

    INSERT INTO #Work (RowID, SymbolID, MarketDate, Volume, ExpAvgVol20, ExpAvgVol05, ExpAvgVol004)

    SELECT T.RowID,

    ISNULL(T.SymbolID,0) AS SymbolID,

    ISNULL(T.MarketDate,CAST(0 AS DATETIME)) AS MarketDate,

    T.Volume,

    T.ExpAvgVol20,

    T.ExpAvgVol05,

    T.ExpAvgVol004

    FROM tblHistoricalData T

    LEFT JOIN (

    SELECT SymbolID

    , MAX(MarketDate) AS LatestDate

    FROM tblHistoricalData

    WHERE ExpAvgVol20 IS NOT NULL

    AND Volume IS NOT NULL

    AND MarketDate <= @updateThrough

    GROUP BY SymbolID

    ) AS D ON D.SymbolID = T.SymbolID AND D.LatestDate=T.MarketDate

    WHERED.LatestDate IS NOT NULL

    AND T.MarketDate <= @updateThrough

    -- Add the clustered index/key that makes the "quirky update" actually work

    -- on rows in the update order that we want. This only works when we're doing an

    -- update that results in an index scan without parallelism occuring.

    ALTER TABLE #Work

    ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100

    -- Defining Variables for iteration.

    DECLARE @PrevExpAvgVol20 FLOAT,

    @PrevExpAvgVol05 FLOAT,

    @PrevExpAvgVol004 FLOAT,

    @PrevSymbolID DATETIME,

    @PrNonNullVol FLOAT

    SELECT @PrevExpAvgVol20 = 0,

    @PrevExpAvgVol05 = 0,

    @PrevExpAvgVol004 = 0,

    @PrevSymbolID = 0,

    @PrNonNullVol = 0

    -- Updating the variables and the temporary work table.

    UPDATE #Work

    SET @PrevExpAvgVol20 = ExpAvgVol20 = CASE WHEN SymbolID <> @PrevSymbolID

    THEN ExpAvgVol20

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NOT NULL

    THEN ExpAvgVol20

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NULL AND @PrNonNullVol IS NULL

    THEN @PrevExpAvgVol20

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NULL AND @PrNonNullVol IS NOT NULL

    THEN 0.20E * @PrNonNullVol + 0.80E * ISNULL(@PrevExpAvgVol20,@PrNonNullVol)

    END,

    @PrevExpAvgVol05 = ExpAvgVol05 = CASE WHEN SymbolID <> @PrevSymbolID

    THEN ExpAvgVol05

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NOT NULL

    THEN ExpAvgVol05

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NULL AND @PrNonNullVol IS NULL

    THEN @PrevExpAvgVol05

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NULL AND @PrNonNullVol IS NOT NULL

    THEN 0.05E * @PrNonNullVol + 0.95E * ISNULL(@PrevExpAvgVol05,@PrNonNullVol)

    END,

    @PrevExpAvgVol004 = ExpAvgVol004 = CASE WHEN SymbolID <> @PrevSymbolID

    THEN ExpAvgVol004

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NOT NULL

    THEN ExpAvgVol004

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NULL AND @PrNonNullVol IS NULL

    THEN @PrevExpAvgVol004

    WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NULL AND @PrNonNullVol IS NOT NULL

    THEN 0.004E * @PrNonNullVol + 0.996E * ISNULL(@PrevExpAvgVol004,@PrNonNullVol)

    END,

    @PrNonNullVol = Volume,

    @PrevSymbolID = SymbolID

    FROM #Work WITH(INDEX(0))

    -- Updating the original table with the results

    UPDATE tblHistoricalData

    SET ExpAvgVol20 = w.ExpAvgVol20,

    ExpAvgVol05 = w.ExpAvgVol05,

    ExpAvgVol004 = w.ExpAvgVol004

    FROM tblHistoricalData h

    INNER JOIN #Work w

    ON h.RowID = w.RowID

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply