SQL Server Statistical Functions

Statistical Functions

Background

When we talk about statistics what we are talking about is a means to evaluate

an outcome or potential outcome against a model

The model is unlikely to provide and absolute answer but should be robust enough to

give a good indicator of behaviour in the real world.

The purpose of this article is to provide some thoughts on the use of statistical

functions provided by SQL Server and what they actually mean.

SQL Server statistical functions

COUNT
SUM
AVG
STDEV
STDEVP
VAR
VARP

The first two you will definitely be familiar with

AVG

To the non-mathematical an average is simply the sum of the

numbers divided by the number of numbers. To a mathematician this is called the "Arithmetic Mean".

A mathematician/statistician probably won’t use the term

"average" as average is a generic term that can mean a variety of things. A statistician can use any of the terms below

to mean average.

Average Term	Meaning
Arithmetic mean	The AVG function. Sum of the numbers divided by the number of numbers.
Assumed mean	An estimated mean. This can be used when information is given in bands i.e. sales figures in bands by customer age 15 to 21 year olds, 22 to25 year olds etc.
MEDIAN	The middle value when the values are arranged in order. When there are an even number of values take the mid-point between the two in the middle.
MODE	The most common value.

An average can even be a number picked at random within a

range of values, though this isn’t generally used.

Problems with the AVG function.

The problem with averages is that what we are trying to do

is describe a typical occurrence.

Let us suppose that you want to know the average price of an

item that a customer buys.

Take the example of a proper hi-fi shop. A shop selling hi-fi the purchase of which

can be considered reasonable grounds for a divorce!

A typical customer may buy a £1,000 amplifier and £100 of

connecting leads. Average item spend is

£550. But no-one spends £550! Your

average figure is completely useless!

To use an analogy, a man with his feet in the fridge and his

head in the oven is, on average, perfectly comfortable.

As another example take one of those catalogues selling

brushes and household knick-knacks. Prices range from £5 to £15 and all products sell equally well.

Average item spend is £10. But there is nothing special about items costing £10! They sell equally well as items selling for

£5 or £15.

Knowing an average value is only useful if the values that

you are measuring tend to cluster around that average.

Catering for extremes

Look at the unit prices in the products table in the NORTHWIND

database.

SELECT AVG(UnitPrice)

FROM Products

Returns £28.87

If you look at the actual products in the table you will see

that there is one product priced way over and above the others. This single product causes the average to be

an artificially high value.

Really we should exclude this extreme and other extreme

values to get a more representative figure for the average. The SQL statement

below chops 3 percent off either end of the selection.

SELECT TOP 97 PERCENT DT.*

FROM

( SELECT TOP 97 PERCENT *

FROM Products

ORDER BY UnitPrice ASC

) AS DT

ORDER BY UnitPrice DESC

The average of this new selection is £25.05

Even this could be considered high when you consider that

the median value (the point at which 50% of products cost more than this and

50% of products cost less)

SELECT TOP 1 DT.*

FROM

( SELECT TOP 50 PERCENT *

FROM Products

ORDER BY UnitPrice ASC

) AS DT

ORDER BY UnitPrice DESC

This returns £19.50!

Inter Quartile Range

Another way of looking at extremes is to use something called the inter-quartile range

The inter-quartile range is worked out by ordering the values that you want to evaluate

and splitting them into 4 equal parts. The inter-quartile range is the range of values represented

by the middle two parts.

The SQL to work out what this range is as follows

SELECT MIN(UnitPrice) AS LowerRange , MAX(UnitPrice) AS UpperRange

FROM (

SELECT D1.ProductId

FROM

( SELECT TOP 25 PERCENT ProductId

FROM Products

ORDER BY UnitPrice ASC

) AS D1

UNION

SELECT D2.ProductId

FROM

( SELECT TOP 25 PERCENT ProductId

FROM Products

ORDER BY UnitPrice DESC

) AS D2

) AS DT RIGHT JOIN Products ON DT.ProductId = Products.ProductId

WHERE DT.ProductId IS NULL

This tells us that the middle 50% or inter-quartile range of prices is between £14.00 & £32.80.

NB The combination of the TOP 25 PERCENT clause and duplicate prices the above SQL

doesn't find the true inter-quartile range. If you know the total number of records within your recordset

then you can use the TOP n clause instead to get a more accurate answer.

As the inter-quartile range is a broad brush measure rather than a precise measure we don't need to be

too fussy about it.

Within the SPSS stats package there was a useful charting tool that produced a graph called a box plot.

The box plot utilised the inter-quartile range and was a very handy way of illustrating extremes. An example is shown below.

Greek Esh, meaning Sum Of

The graph shows box plots for the prices of items bought as a result of two adverts.

The green box represents the inter-quartile range.
The line through the green box represents the median point.
The T shape above and below illustrate the "outliers". In SPSS an outlier was determined as a
point 1.5 times the inter-quartile range.
Any points beyond these "outliers" are considered extremes.

It doesn't matter which method you use, remember the goal is to produce a model that

acts as the most effective predictor or most accurate representation of real world behaviour.

VAR, VARP, STDEV & STDEVP

The VAR, VARP, STDEV & STDEVP functions all give a single figure that describes how

much the figures we are investigating deviate from our "average".

A point to make at this stage is that the way in which a computer or electronic calculator

works out the VAR,VARP, STDEV and STDEVP is an approximation.

Let us take VARP as an example

True VARP	Computer VARP

An explanation of the symbols is as follows:

Greek Esh, meaning Sum Of

Greek miu, meaning Average

The values to be used

Number of Items

The difference in the results between these two formulae are too small to be of much concern.

The reason the computer version of the

formula is different dates back to the days when computers were considerably

less powerful than they are today. Basically, in the computer version of the

formula you can calculate your VARP in one pass through the recordset, where as

the true version you would need to first calculate your average on one

iteration, and then calculate your VARP on the 2nd iteration.

The difference between VAR and VARP

The table below shows the difference between the two functions from a mathematical perspective

Function	Formula	Description
VARP		The variance within the entire population.
VAR		The variance within a sample of that population.

Let us suppose that it is not possible to measure the variance of an entire population.

What we would do would be to take, what we hope would be, a representative sample of that

population and measure that instead.

If we were taking a sample, then wouldn't we expect to build in some factor to allow for

discrepancies in our sample?

Of course we would that is why the two functions have differing

denominators (bottom half of the fractions). The bottom half of VAR is the same as n²-n and is known as Bessels correction.

As with any fraction, the bigger the denominator the smaller the eventual number so as we

increase our sample size so the differences between what the functions return becomes smaller.

Standard Deviation

The Standard Deviation is the square root of the Variance therefore

STDEV is simply the square root of VAR
STDEVP is simply the square root of VARP

Both the Variance and Standard Deviation can be represented by the Greek symbol Sigma as shown below

Symbols of Variance and Standard Deviation

The standard deviation provides statisticians with a very useful tool. For example, let us go back to

the scenario where it is not possible to work with the full dataset and are therefore working with a sample.

If you take the average of your sample you also have to allow for some variability between the average of the

population and the average of your sample.

In statistics there is a term for this called the standard error of mean. and this has a formula as follows

Standard Error of Mean

Let us suppose that we have a sample of 10,000 with an average of 5 and a standard deviation of 0.5.

Our standard error of mean would be 0.005

This means that from our sample we know that the true average within the population could be expected

to fall +/-0.005 of our sample average of 5, or in other words between 4.995 and 5.005.

The Normal Distribution

If we were to measure the weight of a random sample of 500 employees within a company and plot them on a graph

then we would find that the graph forms a bell shape.

In statistics we can produce a model of this shape called the normal distribution.

One of the properties of the normal distribution is that 95% of all occurences occur within

approximately 2 standard deviations of the average.

If our 500 employees weigh on average 200lbs and the standard deviation is 10lbs then we know that

95% of all employees will weigh 200lbs +-20lbs.

In fact any high school maths student will tell you that there are three points that are

generally used in statistics.

Number of Standard Deviations	Percentage of occurrences.
+-1.65	90%
+-1.96	95%
+-2.58	99%

Number of Standard Deviations

Percentage of

occurrences.

+-1.65

90%

+-1.96

95%

+-2.58

99%

The figures listed above are used for large samples such as those you would get when measuring business to consumer

activity. In certain cases i.e. when measuring business to business marketing statistics

where the samples are considerably smaller, we would even be interested in the 80% and 85% cases.

Returning to our weight example, if we wanted to cater for 99% of employees we could say that

we would expect an employee to weigh 200lbs +-26lbs.

Conclusion

From just the brief descriptions here we can see that knowing the average and standard deviation allows us to describe the

characteristics of our sample in more depth than providing a single figure.

We can also see that it is important to prevent our samples being distorted by extreme values.

I tried to apply the above techniques to measuring the performance of SQL Server, however I found that,

even with no-one else connected to a server, the performance of the server was apparently random.

I measured the time per itteration for 1,000 itterations for an UPDATE statement to UPDATE 10,000 rows.

Examining the results showed a huge variation in results and no particular pattern.

The techniques did work well on examining the characteristics and purchasing habits of customers.

SQL Server Statistical Functions

Background

SQL Server statistical functions

AVG

Problems with the AVG function.

Catering for extremes

Inter Quartile Range

VAR, VARP, STDEV & STDEVP

The difference between VAR and VARP

Standard Deviation

The Normal Distribution

Conclusion

Rate

Share

Categories

Share

Rate

SQL Server Statistical Functions

Background

SQL Server statistical functions

AVG

Problems with the AVG function.

Catering for extremes

Inter Quartile Range

VAR, VARP, STDEV & STDEVP

The difference between VAR and VARP

Standard Deviation

The Normal Distribution

Conclusion

Rate

Share

Categories

Share

Rate

Related content

Creating a PDF from a Stored Procedure in SQL Server

Handling The Text Data Type

New MDX Book Published

How to Build Dynamic Stored Procedures

Generating HTML Tables

Cookies on SQLServerCentral