Statistics question: Rolling window graph

Question

Post reply

Statistics question: Rolling window graph

Tom257

SSChampion

Points: 11543
More actions
April 20, 2010 at 4:48 am

#217986

I'm a bit stuck - statistics isn't my strong point.
I want to create a report of a single column attribute (which can have 5 values) over a sequence of a fixed number of rows, then move the sequence on by one row and recalculate, then move on again to the end of the data.
So the output will be a percentage of each of the 5 attributes
lets say N is the number of rows in the window, then get the average of the number of each attribute (NoVals/N)*100 to get the percentage in that set of N
Example output required.
/*
StartSeq attr1 attr2 attr3 attr4 attr5 Description
-------- ----- ----- ----- ----- ----- -----------------------
1 20% 20% 20% 40% 0% averages of rows 1 to N
2 21% 20% 19% 40% 0% averages of rows 2 to N+1
3 21% 20% 19% 39% 1% averages of rows 3 to N+2
...
X 11% 10% 29% 39% 11% averages of rows X to N+X-1
*/
I hope this makes sense.
Any pointers, references to articles or scripts or even the proper statistical terms for this would help.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Paul White SSC Guru Points: 150467 More actions · Answer 1

Moving Averages in T-SQL[/url]

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Tom257 SSChampion Points: 11543 More actions · Answer 2

Thanks exactly what I needed, I've read and inwardly digested the article.

I created test data and worked out both methods - and on the test data I found the performance improvement on method #2 as described in the article.

However, on the live data - method #2 takes just as long as method #2, there is the expected reduction in number of reads, but it still runs just as slowly as method #1

I've attached my Test Data and the plan for both test and live - Live has more columns different column names, and foreign keys

The major difference between test and live, is the clustered index on [Seqeunce Number] on test, and a plain index on [Sequence Number] on live - but surely that can't account for the huge performance difference.

Edit: I've had to delete the LivePlan as it contains sensitive information. I'll try and put a screenshot up instead.

Tom257 SSChampion Points: 11543 More actions · Answer 3

Attached screenshot of execution plans for

Top: Live (with nonclustered index)

Middle: Test (with clustered index )

End: Live (with clustered index)

Showing that the clustered index makes little or no difference to the plan on live.

Is there some way of finding out why this plan is chosen over the more efficient test version?

Paul White SSC Guru Points: 150467 More actions · Answer 4

Hey Tom,

The article has a small error when it comes to using Method 2: a clustered index is *required* to take advantage of the Quirky Update method. This is noted in the discussion thread for the article, but the article text was never corrected to add the required index.

I'm not saying that is the reason for the behaviour you observed - but it is a very important point.

I had a quick look at your script and the execution plans, and I believe the cause to be the table spool. Please try the following:

SET NOCOUNT ON;

GO

-- moving aggregates method #1 --------------------

DECLARE @intervals INTEGER;

SET @intervals = 20;

SET STATISTICS IO ON;

IF OBJECT_ID(N'tempdb..#DataTable', N'U')

IS NOT NULL

DROP TABLE #DataTable;

CREATE TABLE #DataTable

(

Seq INTEGER NOT NULL PRIMARY KEY,

A INTEGER NOT NULL,

B INTEGER NOT NULL,

C INTEGER NOT NULL,

D INTEGER NOT NULL

);

INSERT #DataTable (Seq, A, B, C, D)

SELECT [Sequence Number], A, B, C, D

FROM dbo.MonitorData

PIVOT (

COUNT(VModel)

FOR VModel IN ([A], , [C], [D])

) AS P;

WITH Totals

AS (

SELECT D.Seq AS StartSeq,

CAST(SUM(T.B) AS DECIMAL(10,4)) AS TotB,

CAST(SUM(T.A) AS DECIMAL(10,4)) AS TotA,

CAST(SUM(T.C) AS DECIMAL(10,4)) AS TotC,

CAST(SUM(T.D) AS DECIMAL(10,4)) AS TotD,

CAST(SUM(T.B) + SUM(T.A) + SUM(T.C) + SUM(T.D) AS DECIMAL(10,4)) AS Tot

FROM #DataTable AS D

JOIN #DataTable AS T

ON T.Seq >= D.Seq

AND T.Seq < D.Seq + @intervals

GROUP BY D.Seq

)

SELECT StartSeq,

CAST((TotA * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntA,

CAST((TotB * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntB,

CAST((TotC * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntC,

CAST((TotD * 100.0)/ Tot AS DECIMAL(10,4)) AS PcntD

FROM Totals

ORDER BY

StartSeq

OPTION (RECOMPILE);

DROP TABLE #DataTable;

GO

-- moving aggregates method #2 --------------------

DECLARE @intervals INTEGER;

SET @intervals = 20;

DECLARE @anchor INTEGER;

DECLARE @init_B DECIMAL(10,4),

@init_A DECIMAL(10,4),

@init_C DECIMAL(10,4),

@init_D DECIMAL(10,4);

DECLARE @moving_B DECIMAL(10,4),

@moving_A DECIMAL(10,4),

@moving_C DECIMAL(10,4),

@moving_D DECIMAL(10,4);

IF OBJECT_ID(N'tempdb..#DataTable', N'U')

IS NOT NULL

DROP TABLE #DataTable;

IF OBJECT_ID(N'tempdb..#Tmp1', N'U')

IS NOT NULL

DROP TABLE #Tmp1;

CREATE TABLE #DataTable

(

Seq INTEGER NOT NULL PRIMARY KEY,

A INTEGER NOT NULL,

B INTEGER NOT NULL,

C INTEGER NOT NULL,

D INTEGER NOT NULL

);

CREATE TABLE #Tmp1

(

-- CLUSTERED INDEX REQUIRED ON Seq FOR

-- THE QUIRKY UPDATE!

Seq INTEGER NOT NULL

PRIMARY KEY CLUSTERED,

A INTEGER NOT NULL,

B INTEGER NOT NULL,

C INTEGER NOT NULL,

D INTEGER NOT NULL,

pcntA DEC (10,4) NULL,

prevNA DEC (10,4) NULL,

pcntB DEC (10,4) NULL,

prevNB DEC (10,4) NULL,

pcntC DEC (10,4) NULL,

prevNC DEC (10,4) NULL,

pcntD DEC (10,4) NULL,

prevND DEC (10,4) NULL,

);

INSERT #DataTable (Seq, A, B, C, D)

SELECT [Sequence Number], A, B, C, D

FROM dbo.MonitorData

PIVOT (

COUNT(VModel)

FOR VModel IN ([A], , [C], [D])

) AS P;

INSERT #Tmp1

(

Seq,

A, B, C, D,

pcntA, pcntB, pcntC, pcntD,

prevNA, prevNB, prevNC, prevND

)

SELECT A.Seq,

A.A,

A.B,

A.C,

A.D,

CAST(NULL as DECIMAL(10,4)) [PcntA],

CAST(B.A AS DECIMAL(10,4)) [PrevNA],

CAST(NULL as DECIMAL(10,4)) [PcntB],

CAST(B.B AS DECIMAL(10,4)) [PrevNB],

CAST(NULL as DECIMAL(10,4)) [PcntC],

CAST(B.C AS DECIMAL(10,4)) [PrevNC],

CAST(NULL as DECIMAL(10,4)) [PcntD],

CAST(B.D AS DECIMAL(10,4)) [PrevND]

FROM #DataTable AS A

LEFT

JOIN #DataTable B ON A.Seq - @intervals = B.Seq;

-- Get initial values

SELECT @init_B = SUM(B),

@init_A = SUM(A),

@init_C = SUM(C),

@init_D = SUM(D)

FROM #tmp1

WHERE Seq <= @intervals;

UPDATE T1

SET @moving_A = CASE

WHEN Seq < @intervals then NULL

WHEN Seq = @intervals then @init_A

WHEN Seq > @intervals then @moving_A + [A] - [PrevNA]

END,

PcntA = (@moving_A * 100.0)/Cast(@intervals as DECIMAL(10,4)),

@moving_B = CASE

WHEN Seq < @intervals then NULL

WHEN Seq = @intervals then @init_B

WHEN Seq > @intervals then @moving_B + - [PrevNB]

END,

PcntB = (@moving_B * 100.0)/Cast(@intervals as DECIMAL(10,4)),

@moving_C = CASE

WHEN Seq < @intervals then NULL

WHEN Seq = @intervals then @init_C

WHEN Seq > @intervals then @moving_C + [C] - [PrevNC]

END,

PcntC = (@moving_C * 100.0)/Cast(@intervals as DECIMAL(10,4)),

@moving_D = CASE

WHEN Seq < @intervals then NULL

WHEN Seq = @intervals then @init_D

WHEN Seq > @intervals then @moving_D + [D] - [PrevND]

END,

PcntD = (@moving_D * 100.0)/Cast(@intervals as DECIMAL(10,4)),

@anchor = Seq

FROM #Tmp1 AS T1 WITH (TABLOCKX)

OPTION (MAXDOP 1);

SELECT Seq - @intervals +1 AS StartSeq,

PcntA,

PcntB,

PcntC,

PcntD

FROM #Tmp1

WHERE Seq >= @intervals;

DROP TABLE #Tmp1;

DROP TABLE #DataTable;

SET STATISTICS IO OFF;

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Tom257 SSChampion Points: 11543 More actions · Answer 5

This is excellent Paul, Thank-you.

The clustered primary key improved performance from a cost of 1.8 to 0.08,

I must remember the discussions on SSC are a vital part of the articles

Paul White SSC Guru Points: 150467 More actions · Answer 6

Tom Brown (4/21/2010)
This is excellent Paul, Thank-you.

No problem. You took the time to check out the article I referenced, so it is only fair that I should return the favour.

The clustered primary key improved performance from a cost of 1.8 to 0.08

Ah. I see I was too brief in my previous reply. The performance improvement does not stem from the clustered index (alone) - the clustered index is required for correct operation of the update method used in query 2.

The query cost was dominated by the table spool, caused by the CTE self join. My primary improvement was to materialize the intermediate result in the #DataTable temporary table.

Second point: query plans only ever include the cost estimate calculated by the optimiser when producing the plan - it never reflects the true cost of running the query. Please do not rely on it for performance comparisons. CPU time used, elapsed time, and reads and writes are all valid performance metrics (at least to some extent).

I must remember the discussions on SSC are a vital part of the articles

Often better than the articles! (Not in this case though, I hasten to add)

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi