Statistics question: Rolling window graph

  • I'm a bit stuck - statistics isn't my strong point.

    I want to create a report of a single column attribute (which can have 5 values) over a sequence of a fixed number of rows, then move the sequence on by one row and recalculate, then move on again to the end of the data.

    So the output will be a percentage of each of the 5 attributes

    lets say N is the number of rows in the window, then get the average of the number of each attribute (NoVals/N)*100 to get the percentage in that set of N

    Example output required.

    /*

    StartSeq attr1 attr2 attr3 attr4 attr5 Description

    -------- ----- ----- ----- ----- ----- -----------------------

    1 20% 20% 20% 40% 0% averages of rows 1 to N

    2 21% 20% 19% 40% 0% averages of rows 2 to N+1

    3 21% 20% 19% 39% 1% averages of rows 3 to N+2

    ...

    X 11% 10% 29% 39% 11% averages of rows X to N+X-1

    */

    I hope this makes sense.

    Any pointers, references to articles or scripts or even the proper statistical terms for this would help.

  • Moving Averages in T-SQL[/url]

  • Thanks exactly what I needed, I've read and inwardly digested the article.

    I created test data and worked out both methods - and on the test data I found the performance improvement on method #2 as described in the article.

    However, on the live data - method #2 takes just as long as method #2, there is the expected reduction in number of reads, but it still runs just as slowly as method #1

    I've attached my Test Data and the plan for both test and live - Live has more columns different column names, and foreign keys

    The major difference between test and live, is the clustered index on [Seqeunce Number] on test, and a plain index on [Sequence Number] on live - but surely that can't account for the huge performance difference.

    Edit: I've had to delete the LivePlan as it contains sensitive information. I'll try and put a screenshot up instead.

  • Attached screenshot of execution plans for

    Top: Live (with nonclustered index)

    Middle: Test (with clustered index )

    End: Live (with clustered index)

    Showing that the clustered index makes little or no difference to the plan on live.

    Is there some way of finding out why this plan is chosen over the more efficient test version?

  • Hey Tom,

    The article has a small error when it comes to using Method 2: a clustered index is *required* to take advantage of the Quirky Update method. This is noted in the discussion thread for the article, but the article text was never corrected to add the required index.

    I'm not saying that is the reason for the behaviour you observed - but it is a very important point.

    I had a quick look at your script and the execution plans, and I believe the cause to be the table spool. Please try the following:

    SET NOCOUNT ON;

    GO

    -- moving aggregates method #1 --------------------

    DECLARE @intervals INTEGER;

    SET @intervals = 20;

    SET STATISTICS IO ON;

    IF OBJECT_ID(N'tempdb..#DataTable', N'U')

    IS NOT NULL

    DROP TABLE #DataTable;

    CREATE TABLE #DataTable

    (

    Seq INTEGER NOT NULL PRIMARY KEY,

    A INTEGER NOT NULL,

    B INTEGER NOT NULL,

    C INTEGER NOT NULL,

    D INTEGER NOT NULL

    );

    INSERT #DataTable (Seq, A, B, C, D)

    SELECT [Sequence Number], A, B, C, D

    FROM dbo.MonitorData

    PIVOT (

    COUNT(VModel)

    FOR VModel IN ([A], , [C], [D])

    ) AS P;

    WITH Totals

    AS (

    SELECT D.Seq AS StartSeq,

    CAST(SUM(T.B) AS DECIMAL(10,4)) AS TotB,

    CAST(SUM(T.A) AS DECIMAL(10,4)) AS TotA,

    CAST(SUM(T.C) AS DECIMAL(10,4)) AS TotC,

    CAST(SUM(T.D) AS DECIMAL(10,4)) AS TotD,

    CAST(SUM(T.B) + SUM(T.A) + SUM(T.C) + SUM(T.D) AS DECIMAL(10,4)) AS Tot

    FROM #DataTable AS D

    JOIN #DataTable AS T

    ON T.Seq >= D.Seq

    AND T.Seq < D.Seq + @intervals

    GROUP BY D.Seq

    )

    SELECT StartSeq,

    CAST((TotA * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntA,

    CAST((TotB * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntB,

    CAST((TotC * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntC,

    CAST((TotD * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntD

    FROM Totals

    ORDER BY

    StartSeq

    OPTION (RECOMPILE);

    DROP TABLE #DataTable;

    GO

    -- moving aggregates method #2 --------------------

    DECLARE @intervals INTEGER;

    SET @intervals = 20;

    DECLARE @anchor INTEGER;

    DECLARE @init_B DECIMAL(10,4),

    @init_A DECIMAL(10,4),

    @init_C DECIMAL(10,4),

    @init_D DECIMAL(10,4);

    DECLARE @moving_B DECIMAL(10,4),

    @moving_A DECIMAL(10,4),

    @moving_C DECIMAL(10,4),

    @moving_D DECIMAL(10,4);

    IF OBJECT_ID(N'tempdb..#DataTable', N'U')

    IS NOT NULL

    DROP TABLE #DataTable;

    IF OBJECT_ID(N'tempdb..#Tmp1', N'U')

    IS NOT NULL

    DROP TABLE #Tmp1;

    CREATE TABLE #DataTable

    (

    Seq INTEGER NOT NULL PRIMARY KEY,

    A INTEGER NOT NULL,

    B INTEGER NOT NULL,

    C INTEGER NOT NULL,

    D INTEGER NOT NULL

    );

    CREATE TABLE #Tmp1

    (

    -- CLUSTERED INDEX REQUIRED ON Seq FOR

    -- THE QUIRKY UPDATE!

    Seq INTEGER NOT NULL

    PRIMARY KEY CLUSTERED,

    A INTEGER NOT NULL,

    B INTEGER NOT NULL,

    C INTEGER NOT NULL,

    D INTEGER NOT NULL,

    pcntA DEC (10,4) NULL,

    prevNA DEC (10,4) NULL,

    pcntB DEC (10,4) NULL,

    prevNB DEC (10,4) NULL,

    pcntC DEC (10,4) NULL,

    prevNC DEC (10,4) NULL,

    pcntD DEC (10,4) NULL,

    prevND DEC (10,4) NULL,

    );

    INSERT #DataTable (Seq, A, B, C, D)

    SELECT [Sequence Number], A, B, C, D

    FROM dbo.MonitorData

    PIVOT (

    COUNT(VModel)

    FOR VModel IN ([A], , [C], [D])

    ) AS P;

    INSERT #Tmp1

    (

    Seq,

    A, B, C, D,

    pcntA, pcntB, pcntC, pcntD,

    prevNA, prevNB, prevNC, prevND

    )

    SELECT A.Seq,

    A.A,

    A.B,

    A.C,

    A.D,

    CAST(NULL as DECIMAL(10,4)) [PcntA],

    CAST(B.A AS DECIMAL(10,4)) [PrevNA],

    CAST(NULL as DECIMAL(10,4)) [PcntB],

    CAST(B.B AS DECIMAL(10,4)) [PrevNB],

    CAST(NULL as DECIMAL(10,4)) [PcntC],

    CAST(B.C AS DECIMAL(10,4)) [PrevNC],

    CAST(NULL as DECIMAL(10,4)) [PcntD],

    CAST(B.D AS DECIMAL(10,4)) [PrevND]

    FROM #DataTable AS A

    LEFT

    JOIN #DataTable B ON A.Seq - @intervals = B.Seq;

    -- Get initial values

    SELECT @init_B = SUM(B),

    @init_A = SUM(A),

    @init_C = SUM(C),

    @init_D = SUM(D)

    FROM #tmp1

    WHERE Seq <= @intervals;

    UPDATE T1

    SET @moving_A = CASE

    WHEN Seq < @intervals then NULL

    WHEN Seq = @intervals then @init_A

    WHEN Seq > @intervals then @moving_A + [A] - [PrevNA]

    END,

    PcntA = (@moving_A * 100.0)/Cast(@intervals as DECIMAL(10,4)),

    @moving_B = CASE

    WHEN Seq < @intervals then NULL

    WHEN Seq = @intervals then @init_B

    WHEN Seq > @intervals then @moving_B + - [PrevNB]

    END,

    PcntB = (@moving_B * 100.0)/Cast(@intervals as DECIMAL(10,4)),

    @moving_C = CASE

    WHEN Seq < @intervals then NULL

    WHEN Seq = @intervals then @init_C

    WHEN Seq > @intervals then @moving_C + [C] - [PrevNC]

    END,

    PcntC = (@moving_C * 100.0)/Cast(@intervals as DECIMAL(10,4)),

    @moving_D = CASE

    WHEN Seq < @intervals then NULL

    WHEN Seq = @intervals then @init_D

    WHEN Seq > @intervals then @moving_D + [D] - [PrevND]

    END,

    PcntD = (@moving_D * 100.0)/Cast(@intervals as DECIMAL(10,4)),

    @anchor = Seq

    FROM #Tmp1 AS T1 WITH (TABLOCKX)

    OPTION (MAXDOP 1);

    SELECT Seq - @intervals +1 AS StartSeq,

    PcntA,

    PcntB,

    PcntC,

    PcntD

    FROM #Tmp1

    WHERE Seq >= @intervals;

    DROP TABLE #Tmp1;

    DROP TABLE #DataTable;

    SET STATISTICS IO OFF;

  • This is excellent Paul, Thank-you.

    The clustered primary key improved performance from a cost of 1.8 to 0.08,

    I must remember the discussions on SSC are a vital part of the articles

  • Tom Brown (4/21/2010)


    This is excellent Paul, Thank-you.

    No problem. You took the time to check out the article I referenced, so it is only fair that I should return the favour.

    The clustered primary key improved performance from a cost of 1.8 to 0.08

    Ah. I see I was too brief in my previous reply. The performance improvement does not stem from the clustered index (alone) - the clustered index is required for correct operation of the update method used in query 2.

    The query cost was dominated by the table spool, caused by the CTE self join. My primary improvement was to materialize the intermediate result in the #DataTable temporary table.

    Second point: query plans only ever include the cost estimate calculated by the optimiser when producing the plan - it never reflects the true cost of running the query. Please do not rely on it for performance comparisons. CPU time used, elapsed time, and reads and writes are all valid performance metrics (at least to some extent).

    I must remember the discussions on SSC are a vital part of the articles

    Often better than the articles! (Not in this case though, I hasten to add)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply