December 12, 2008 at 8:42 am
On the basis of your stated usage of the numbers being generated, I'd then be motivated to interpolate. I'd also generate the number of missing values as a percentage of the number of contributing values, so that you could assign a "confidence" number as well. Having done statistics, I know that this kind of measure can help define the overall accuracy of your model, and help provide clear documentation on how the data is derived.
One final thought - I might choose to not use float, but instead go with a decimal data type and max out the number of digits beyond the decimal point. It might slow things down a little, and when you figure you're running some 1100 calculations per symbol per 5 years of data, and that every time float doesn't accurately represent the actual value, the number of errors might be piling up faster than the cutting off of numbers beyond the 30 or so places beyond the decimal point you can use with the decimal data type, it may be more accurate. It might be worth some testing to determine if there are enough losses in float for it to matter.
Steve
(aka smunson)
:):):)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
December 12, 2008 at 6:36 pm
smunson (12/12/2008)
Looks like Peso nailed the concept for using the "quirky update" to get at the data,...
Ya know what? I missed Peso's post... that's just about where I'm going with all of this.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 12, 2008 at 6:54 pm
bkmooney (12/12/2008)
Jeff,I'm attaching the file with the formatted test data. It should generate about 150 rows of data for you. And because it makes things more interesting (:)), I made sure to include some situations which can throw the recursion off (skipped days or days where various values are NULL).
Please let me know if you have any problems receiving the data.
Thanks again for all your help!
All set... your data came through fine and I've got it loaded up to play with. Thanks for taking the time to do this. Now, I just need to review your fine documentation in your previous code to make sure I get all the rules right.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 12, 2008 at 7:58 pm
bkmooney (12/12/2008)
Jeff,I'm attaching the file with the formatted test data. It should generate about 150 rows of data for you. And because it makes things more interesting (:)), I made sure to include some situations which can throw the recursion off (skipped days or days where various values are NULL).
Please let me know if you have any problems receiving the data.
Thanks again for all your help!
Ok... I think maybe this is it... I'm using the test table called #Test_HistoricalData that you provided as an attachment. The code copies the data to a work area so we can put the Clustered Index (a PK in this case) on the necessary columns to make it work, then calculates the moving averages using the formulas you and Peso provided, and then updates the original table from that. Here's the code... yeah, I know... it looks awfully short for what it does and that's the beauty of this type of "pseudo cursor" endearingly referred to as the "quirky update"....
--===== If the work table already exists, drop it
IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL
DROP TABLE #Work
--===== Create and populate the work table from the history table on the fly.
-- The ISNULL's are to keep from having to alter the columns to NOT NULL
-- so we can add the clustered pk on those columns.
-- The ExpAvgVol column is an empty column defined as FLOAT(54) to keep as
-- much precision and scale as possible during the float calculations.
SELECT RowID,
ISNULL(SymbolID,0) AS SymbolID,
ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,
Volume,
CAST(0 AS FLOAT(54)) AS ExpAvgVol
INTO #Work
FROM #Test_HistoricalData
--===== Now, we add the clustered index/key that makes the "quirky update" actually work
-- on rows in the update order that we want. This only works when we're doing an
-- update that results in an index scan without parallelism occuring.
ALTER TABLE #Work
ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100
--===== All set... define some obvious named variables...
DECLARE @PrevExpAvgVol FLOAT,
@PrevSymbolID DATETIME
SELECT @PrevExpAvgVol = 0,
@PrevSymbolID = 0
--===== ... do the "quirky update" in the order controlled by the clustered index...
-- The rules in the CASE statement are obvious
UPDATE #Work
SET @PrevExpAvgVol = ExpAvgVol = CASE WHEN SymbolID = @PrevSymbolID AND Volume IS NOT NULL
THEN 0.03E * Volume + 0.97E * ISNULL(@PrevExpAvgVol,Volume)
WHEN SymbolID = @PrevSymbolID AND Volume IS NULL
THEN @PrevExpAvgVol
ELSE Volume
END,
@PrevSymbolID = SymbolID
FROM #Work WITH(INDEX(0))
--===== ... and update the original table from the results.
UPDATE #Test_HistoricalData
SET ExpAvgVol = w.ExpAvgVol
FROM #Test_HistoricalData h
INNER JOIN #Work w
ON h.RowID = w.RowID
--===== As you said before, display what we ended up with in the work table
-- so we can see that it actually worked
SELECT *
FROM #Work
ORDER BY SymbolID, MarketDate
If the code violates any of your rules, post back and we'll tweek it. 😉 This should update at the rate of about a million rows every 7 seconds in the work table. I've not tested the final update for performance... we may be on the verge of hitting the "tipping point" that every machine has for updates when you're talking about 5 million rows of joined updates... but only you will be able to test for that because you have the data.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 15, 2008 at 9:45 am
This seems to work for the most part, but the version you posted had a slight quirk in that the calculations were a day off. This is a trailing volume calculation, so ExpAvgVol on 12/15/08 should be calculated with volume values for 12/14/08 and prior. (That was why I was keeping track of the day volume the day before.)
I made a couple of adjustments (code below) and now have it returning results like:
RowID SymbolIDDate VolumeExpAvgVol
1715966092007-01-03 16:00:00.0002575800NULL
1717058792007-01-04 16:00:00.00020737002575800
1718157692007-01-05 16:00:00.00026766002560737
1719251092007-01-08 16:00:00.00015572002564212.89
...
1723617892007-01-12 16:00:00.00014059002477273.27409432
17158435102007-01-01 16:00:00.000NULLNULL
17159021102007-01-02 16:00:00.0002857528575
17159661102007-01-03 16:00:00.00021673328575
17170588102007-01-04 16:00:00.00023872134219.74
...
This is exactly what it should be except for cases where there is a leading NULL volume, like with SymbolID=10 above. In these cases, the red value should actually be NULL. The problem is, I can't seem to figure out how to get this to happen without messing everything else up in the process.
If you've got any ideas, I'd really appreciate it. I feel like I'm chasing my tail with this one a bit. (However, am definitely learning a lot in the process, so that's always a good thing.)
Thanks! =)
And here's the code I'm using at the moment:
--===== If the work table already exists, drop it
IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL
DROP TABLE #Work
--===== Create and populate the work table from the history table on the fly.
-- The ISNULL's are to keep from having to alter the columns to NOT NULL
-- so we can add the clustered pk on those columns.
-- The ExpAvgVol column is an empty column defined as FLOAT(54) to keep as
-- much precision and scale as possible during the float calculations.
SELECT RowID,
ISNULL(SymbolID,0) AS SymbolID,
ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,
Volume,
CAST(0 AS FLOAT(54)) AS ExpAvgVol
INTO #Work
FROM Test_HistoricalData
WHEREMarketDate <= '2007-01-15'
--===== Now, we add the clustered index/key that makes the "quirky update" actually work
-- on rows in the update order that we want. This only works when we're doing an
-- update that results in an index scan without parallelism occuring.
ALTER TABLE #Work
ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100
--===== All set... define some obvious named variables...
DECLARE @PrevExpAvgVol FLOAT,
@PrevSymbolID DATETIME,
@PrNonNullVol FLOAT
SELECT @PrevExpAvgVol = 0,
@PrevSymbolID = 0,
@PrNonNullVol = 0
--===== ... do the "quirky update" in the order controlled by the clustered index...
-- The rules in the CASE statement are obvious
UPDATE #Work
SET @PrevExpAvgVol = ExpAvgVol = CASE WHEN SymbolID <> @PrevSymbolID
THEN NULL
WHEN SymbolID = @PrevSymbolID
THEN 0.03E * ISNULL(@PrNonNullVol, Volume) + 0.97E * ISNULL(@PrevExpAvgVol,ISNULL(@PrNonNullVol, Volume))
ELSE @PrNonNullVol
END,
@PrNonNullVol = CASE WHEN SymbolID <> @PrevSymbolID
THEN Volume
WHEN @PrevExpAvgVol IS NOT NULL AND @PrevSymbolID = SymbolID
THEN Volume
ELSE @PrNonNullVol
END,
@PrevSymbolID = SymbolID
FROM #Work WITH(INDEX(0))
/*
--===== ... and update the original table from the results.
UPDATE #Test_HistoricalData
SET ExpAvgVol = w.ExpAvgVol
FROM #Test_HistoricalData h
INNER JOIN #Work w
ON h.RowID = w.RowID
*/
--===== As you said before, display what we ended up with in the work table
-- so we can see that it actually worked
SELECT *
FROM #Work
ORDER BY SymbolID, MarketDate
December 16, 2008 at 7:44 pm
Sorry... been busy...
Were you able to flush out the anomoly?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 17, 2008 at 7:33 am
Yeah, I figured it out. The final implementation is below. I've done the updates for all my past data, and now as new data comes in, I'll be performing daily/weekly updates using this code.
It's a little slower than I would like - running it for 12/15/08-12/16/08 took a little over 1 minute, and updated about 70K rows.
I think what takes the most time is the step highlighted in red below. When @updateThrough is a date early in my data sequence, this step runs very quickly. If it is at the end of my data sequence (as in the latest date available), that step alone takes quite a bit of time - about 38 seconds when I ran just that statement with @updateThrough=12/16/08. (So that's a bit more than half of the total run time for the entire query.)
I can only imagine this getting worse as my table tblHistoricalData begins to contain older and older data. Is there an index or something I should add to try and get this step to run a bit more efficiently?
Thanks!
-- Declares variables to keep track of when data last updated and when through.
DECLARE @lastUpdate DATETIME
, @updateThrough DATETIME
SET @lastUpdate=(SELECT MAX(MarketDate) FROM tblHistoricalData WHERE ExpAvgVol20 IS NOT NULL)
SET @updateThrough=GETDATE()
--===== If the work table already exists, drop it
IF OBJECT_ID('TempDB..#Work','U') IS NOT NULL
DROP TABLE #Work
--===== Create and populate the work table from the history table on the fly.
-- The ISNULL's are to keep from having to alter the columns to NOT NULL
-- so we can add the clustered pk on those columns.
-- The ExpAvgVol20 column is an empty column defined as FLOAT(54) to keep as
-- much precision and scale as possible during the float calculations.
-- This insertion inserts all rows where the ExpAvgs are NON-NULL with dates
--between the last update and today.
SELECT RowID,
ISNULL(SymbolID,0) AS SymbolID,
ISNULL(MarketDate,CAST(0 AS DATETIME)) AS MarketDate,
Volume,
CAST(ExpAvgVol20 AS FLOAT(54)) AS ExpAvgVol20,
CAST(ExpAvgVol05 AS FLOAT(54)) AS ExpAvgVol05,
CAST(ExpAvgVol004 AS FLOAT(54)) AS ExpAvgVol004
INTO #Work
FROM tblHistoricalData
WHEREExpAvgVol20 IS NULL AND MarketDate BETWEEN @lastUpdate AND @updateThrough
-- This insertion inserts the most recent row with NON-NULL ExpAvgs and Volume
--for each symbol if one exists. Serves as 'initial value' when iterating.
INSERT INTO #Work (RowID, SymbolID, MarketDate, Volume, ExpAvgVol20, ExpAvgVol05, ExpAvgVol004)
SELECT T.RowID,
ISNULL(T.SymbolID,0) AS SymbolID,
ISNULL(T.MarketDate,CAST(0 AS DATETIME)) AS MarketDate,
T.Volume,
T.ExpAvgVol20,
T.ExpAvgVol05,
T.ExpAvgVol004
FROM tblHistoricalData T
LEFT JOIN (
SELECT SymbolID
, MAX(MarketDate) AS LatestDate
FROM tblHistoricalData
WHERE ExpAvgVol20 IS NOT NULL
AND Volume IS NOT NULL
AND MarketDate <= @updateThrough
GROUP BY SymbolID
) AS D ON D.SymbolID = T.SymbolID AND D.LatestDate=T.MarketDate
WHERED.LatestDate IS NOT NULL
AND T.MarketDate <= @updateThrough
-- Add the clustered index/key that makes the "quirky update" actually work
-- on rows in the update order that we want. This only works when we're doing an
-- update that results in an index scan without parallelism occuring.
ALTER TABLE #Work
ADD PRIMARY KEY CLUSTERED (SymbolID,MarketDate) WITH FILLFACTOR = 100
-- Defining Variables for iteration.
DECLARE @PrevExpAvgVol20 FLOAT,
@PrevExpAvgVol05 FLOAT,
@PrevExpAvgVol004 FLOAT,
@PrevSymbolID DATETIME,
@PrNonNullVol FLOAT
SELECT @PrevExpAvgVol20 = 0,
@PrevExpAvgVol05 = 0,
@PrevExpAvgVol004 = 0,
@PrevSymbolID = 0,
@PrNonNullVol = 0
-- Updating the variables and the temporary work table.
UPDATE #Work
SET @PrevExpAvgVol20 = ExpAvgVol20 = CASE WHEN SymbolID <> @PrevSymbolID
THEN ExpAvgVol20
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NOT NULL
THEN ExpAvgVol20
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NULL AND @PrNonNullVol IS NULL
THEN @PrevExpAvgVol20
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol20 IS NULL AND @PrNonNullVol IS NOT NULL
THEN 0.20E * @PrNonNullVol + 0.80E * ISNULL(@PrevExpAvgVol20,@PrNonNullVol)
END,
@PrevExpAvgVol05 = ExpAvgVol05 = CASE WHEN SymbolID <> @PrevSymbolID
THEN ExpAvgVol05
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NOT NULL
THEN ExpAvgVol05
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NULL AND @PrNonNullVol IS NULL
THEN @PrevExpAvgVol05
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol05 IS NULL AND @PrNonNullVol IS NOT NULL
THEN 0.05E * @PrNonNullVol + 0.95E * ISNULL(@PrevExpAvgVol05,@PrNonNullVol)
END,
@PrevExpAvgVol004 = ExpAvgVol004 = CASE WHEN SymbolID <> @PrevSymbolID
THEN ExpAvgVol004
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NOT NULL
THEN ExpAvgVol004
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NULL AND @PrNonNullVol IS NULL
THEN @PrevExpAvgVol004
WHEN SymbolID = @PrevSymbolID AND ExpAvgVol004 IS NULL AND @PrNonNullVol IS NOT NULL
THEN 0.004E * @PrNonNullVol + 0.996E * ISNULL(@PrevExpAvgVol004,@PrNonNullVol)
END,
@PrNonNullVol = Volume,
@PrevSymbolID = SymbolID
FROM #Work WITH(INDEX(0))
-- Updating the original table with the results
UPDATE tblHistoricalData
SET ExpAvgVol20 = w.ExpAvgVol20,
ExpAvgVol05 = w.ExpAvgVol05,
ExpAvgVol004 = w.ExpAvgVol004
FROM tblHistoricalData h
INNER JOIN #Work w
ON h.RowID = w.RowID
Viewing 7 posts - 16 through 21 (of 21 total)
You must be logged in to reply to this topic. Login to reply