By Steve Bolton
…………This is the last of six articles I’ve segregated in this middle of my mistutorial series on identifying outlying values with SQL Server, because they turned out to be difficult to apply to the typical use cases DBAs encounter. After this detour we’ll get back on track with outlier detection methods like Interquartile Range that are likely to be as useful as the ones the series started with, such as Benford’s Law and Z-Scores, but I’ll first give a brief explanation of Chauvenet’s Criterion for the sake of completeness and the offhand chance that it might prove useful in the right circumstances. Those circumstances are normally those suitable for statistical hypothesis testing, in which researchers attempt to prove narrow, specific points of evidence using relatively small datasets – not exploratory data mining or data quality examinations on datasets of thousands or even billions of rows, as in a typical relational table. This subset is designed with different use cases in mind, so it is not surprising that they come with some common limitations that make them difficult to apply to big tables. Among these are the necessity of prior goodness-of-fit testing to ensure that the data follows a Gaussian or “normal” distribution, i.e. the bell curve, without which the outlier tests are invalid. Furthermore, the lookup tables that many of these tests require for comparisons are plentiful on the Internet and old texts, but finding ones without gaps or that extend beyond a few hundred records are difficult to find; worse still, the formulas for calculating the missing values are often performance hogs or require precisions and scales that choke T-SQL code with arithmetic overflows errors. Drawbacks like these also restrict the usefulness of Chauvenet’s Criterion, which was among the first outlier detection methods ever developed. Naval Academy mathematician William Chauvenet (1820-1870) formulated it in the Civil War era, but recognized from the beginning that there were already more trustworthy means available, like Peirce’s Criterion. Perhaps its crudest limitation is that it calls for recursive reexamination of data after carrying out automatic deletion of data points without further investigation, which as I have discussed in prior articles, is unwise and sometimes even unethical. Thankfully, we can apply the same type of interpretation-hack used in last week’s article to make the Modified Thompsons Tau test more useful and valid to Chavuenet’s formula, which is still apparently in common use today despite all of these well-known issues.
…………Repetetively deleting records until no more outliers remain to test is probably not going to fly with either users or IT managers in a SQL Server environment, but as we saw last week, it is possible to simply flag records as potential outliers and still recursively recalculate the underlying aggregates that the formula is based on, as if they had been deleted. It is easier to have our cake and eat it too thanks to new T-SQL windowing clauses like ROWS UNBOUNDED PRECEDING, which make the code for this week’s stored procedure much shorter and easier to follow. The T-SQL in Figure 2 closely resembles that of the Modified Thompson Tau procedure for another reason: the Chauvenet Criterion is also a hybrid method that marries some of the logic of Z-Scores with that of hypothesis testing. The key difference is that we need to compare the absolute deviation against the standard normal distribution rather than Student’s T-distribution; we’re really just substituting one statistical part for another in order to address different use cases, just as we would swap out a computer component or an automotive part in a car engine. That substitution requires the use of a different lookup table than the ones we’ve used in recent articles, but we only need one of them, since we only need to input the absolute deviation rather than the degrees of freedom and an alpha value. That in turn means we can use a single join rather than a function call, further decomplicating the procedure. The main problem I encountered when implementing this is that it is impossible to find complete lookup tables for the standard normal distribution, which typically only accept just one decimal point of precision despite the fact that possible to the calculate Z-Scores fed into them to far higher precisions. Part of the problem is that they’re continuous values, but as I’ve found out the hard way, it is surprisingly difficult to calculate them to higher precisions with the original cumulative distribution function (CDF). Until I can come up with a more precise approximation for high-precision values, the clumsy lookup table defined in Figure 1 will have to do. I designed it to host the table cited at the Wikipedia page “68–95–99.7 Rule,”[1] which includes the probabilities that values will occur within one to seven standard deviations, at intervals of 0.5. Once I overcome my difficulties with CDFs and can get more accurate measures, it will be possible to replace the clumsy BETWEEN clause and CASE in the procedure that crudely peg data points to these wide limits. Once the probability value has been retrieved, we only need to multiply it by the number of data points and flag the record as an outlier if the result is less than 0.5. The PopulationOutsideRange that the procedure in Figure 2 joins to is a calculated column (which is renamed as Probability in the stored procedure) while the RN ROW_NUMBER value acts as a running count.
Figure 1: Code for the Standard Normal Deviation Lookup Table
CREATE TABLE [Calculations].[StandardNormalDeviationTable](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[StandardDeviations] [decimal](2, 1) NULL,
[PopulationInRange] [decimal](16, 15) NULL,
[PopulationOutsideRange] AS ((1)-[PopulationInRange]),
[ExpectedFrequency] [bigint] NULL,
[ApproximateFrequencyForDailyEvent] [nvarchar](400) NULL,
CONSTRAINT [PK_StandardNormalDeviationTable]
PRIMARY KEY CLUSTERED ( [ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE
= OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]
Figure 2: Code for the Chauvenet Criterion Procedure
CREATE PROCEDURE [Calculations].[ChauvenetCriterionSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50)
AS
DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @DatabaseName = @DatabaseName + ‘.’
SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + @SchemaName + ‘.’ + @TableName
SET @SQLString = ‘SELECT’ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, Probability, ”IsOutlier”
= CASE WHEN (RN * Probability) < 0.5 THEN 1 ELSE 0 END
FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, ”Probability” = CASE WHEN PopulationOutsideRange IS NOT NULL THEN PopulationOutsideRange
WHEN PopulationOutsideRange IS NULL AND AbsoluteDeviation < 1 THEN 1 ELSE 0 END
FROM (SELECT T1.’ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, CAST(ROW_NUMBER() OVER (ORDER BY ‘ + @ColumnName + ‘ ASC) AS bigint) AS RN,
Abs(‘ + @ColumnName + ‘ – Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))) OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING)) /
NullIf(StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))) OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING), 0) AS AbsoluteDeviation
FROM ‘ + @SchemaAndTableName + ‘
WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
LEFT JOIN Calculations.StandardNormalDeviationTable AS T2
ON T1.AbsoluteDeviation BETWEEN T2.StandardDeviations – 0.25 AND T2.StandardDeviations + 0.25
WHERE AbsoluteDeviation IS NOT NULL) AS T3
ORDER BY IsOutlier DESC, AbsoluteDeviation
DESC, Probability DESC‘
–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)
…………If you’ve been following this series, most of the rest is self-explanatory: the first few parameters allow users to apply the test to any column in any database for which they have permissions, while the @DecimalPrecision parameter allows them to adjust the precision and scale to avoid arithmetic overflows in the calculations. The rest is just the usual dynamic SQL, with a couple of nested subqueries to allow calculations like the Z-Score to be bubbled up and handled at higher levels. As usual, I’ve omitted any code to handle spaces in object names (which I never allow when I can get away with it) or SQL injection protections.
Figure 3: Results for the Chauvenet Criterion Procedure
EXEC [Calculations].[ChauvenetCriterionSP]
@DatabaseName = N’DataMiningProjects‘,
@SchemaName = N’Health‘,
@TableName = N’DuchennesTable‘,
@ColumnName = N’Hemopexin‘,
@PrimaryKeyName = N’ID’,
@DecimalPrecision = N’5,1′
…………Chauvenet’s Criterion turned out to be much more conservative than I expected it be, at least in terms of flagging outliers in columns without a lot of repeating values. Only a single value was identified as an outlier in the Hemopexin column of a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from Vanderbilt University’s Department of Biostatistics and have been using for practice throughout this series. On another practice dataset with 1,600 rows it found only three, far fewer than any of the other procedures tested to date. I was startled at how well the procedure performed against the first float column in the Higgs Boson dataset, which I downloaded from the University of California at Irvine’s Machine Learning Repository and turned into a nearly 6-gigabyte table. Despite the fact that the logic for both procedures is quite similar, Chauvenet’s test took only 3 minutes and 15 seconds to run on my poor beat-up six-core imitation of a workstation, compared to an hour and fifteen minutes for the Modified Thompson Tau test procedure. The execution plan in last week’s tutorial was too small to bother to post, since it consisted mostly of a single Sort operation that sucked 95 percent of the computational cost, whereas the Chauvenet execution plan was too wide to fit here and was comprised of several other operators like Compute Scalar, Nested Loops and Parallelism (Gather Streams). It also included a single Sort, but it only accounted for 38 percent of the expense.
…………It may perform surprisingly well and have its uses on columns with few repeating values when a conservative identification threshold is called for, but this century-and-a-half old test has many drawbacks that should not be understated. The requirement of a Gaussian distribution and the difficulty in getting accurate probability values for the size of the datasets DBAs work with are only the tip of the iceberg. The Central Limit Theorem on which it is based is mathematically based is much more trustworthy than other probabilistic methods extant today, but it is a fallacy to believe that probabilities represent guarantees or have any effect on causation. I’ve barely touched on this issue yet, but the aforementioned Wikipedia article on the 68–95–99.7 Rule put it a lot better than I can: “…it is important to be aware of the fact that there is actually nothing in the process of drawing with replacement that specifies the order in which the unlikely events should occur, merely their relative frequency, and one must take care when reasoning from sequential draws. It is a corollary of the gambler’s fallacy to suggest that just because a rare event has been observed, that rare event was not rare. It is the observation of a multitude of purportedly rare events that undermines the hypothesis that they are actually rare.”[2] The remedy for this uncertainty is the same as for the more serious issue of deletion: further investigation, not knee-jerk deletion of records. As geneticist David M. Glvoer and oceanographers Scott Christopher Doney and Wiliam J. Jenkins put it in their 2011 book, Modeling Methods for Marine Science:
“Now the truly clever researcher might be tempted to perform this rejection iteratively. That is, why not compute a mean and standard deviation, Z-score the data and reject the fliers, then compute an even better mean and standard deviation and do the same thing all over again, rejecting more data. The advice of all the statistical sages and texts is do Chauvenet rejection only once in a given distribution. If the data were normally distributed, and there weren’t many fliers, you’ll probably find that the second iteration will not yield any more rejectable points. If it does, then it suggests that your data may not be normally distributed. The philosophy is that filtering once is a valid thing to do, but iterative filtering may dramatically alter the data distribution in a fundamental way, invalidating the assumptions behind your statistical calculations, and leading to erroneous results. Moreover, you may accused of being a Chauvenet Chauvinist.”[3]
This is professional confirmation of the Catch-22 I’ve always fretted about with the normal distribution: the more outliers that are found, the less likely it is that a Gaussian bell curve is active, in which case most of these hypothesis-testing based outlier detection methods are invalid. Another Catch-22 is operative when we’re recklessly deleting data in a recursive routine like Chauvenet’s Criterion and the Modified Thompson Tau test: the more we delete, the bigger the impact on the dataset will be. If we follow Glover et al.’s suggestion and limit the criterion to a single use, it’s hardly applicable to a SQL Server database where we may need to find tens of thousands of outliers, while looking for data quality issues or doing exploratory data mining. Such a wide scope also calls for degrees of precision that aren’t readily available in regular lookup tables and would probably be quite costly to compute. The criterion may have been better than nothing when Chauvenet wrote his paper back in the Civil War era, but it’s really hard to justify its use, even in many of the hypothesis testing scenarios it was designed for. Nevertheless, academia and research labs across the planet are apparently still staffed by many of those “Chauvenet Chauvinists” today. While researching this article (including reading parts of Chauvenet’s original paper in .pdf format, which I’ve since lost) I ran across many comments like this one from Stephen Ross, a professor of mechanical engineering at the University of New Haven:
“Peirce’s criterion has been buried in the scientific literature for approximately 150 years. It is virtually unknown today in the scientific community. In its place, Chauvenet’s criterion is commonly used for rational elimination of “outlier” data by government laboratories, (e.g., Environmental Protection Agency, U.S. Army Corps of Engineers, Agency for Toxic Substances and Disease Registry, Institute for Telecommunication Sciences), industry (e.g., Boeing, Sikorsky), foreign laboratories (e.g., Laboratoire National Henri Becquerel, Joint Astronomy Centre), and universities (e.g., research and courses at University of Michigan, Texas A&M, University of California, Vanderbilt, University of Alberta, Ohio State). Methods of elimination of data “outliers” are useful for anyone working in industry or in an educational institution where statistical information concerning product runs or experimental data is of interest. In an engineering, technology or science program, laboratory courses in chemistry, physics and engineering can, and do, find use for rational spurious data elimination. In the BSME program at the University of New Haven, we have used Chauvenet’s criterion in our instrumentation and fluid/thermal laboratory courses for many years. Other universities have similarly used this criterion in their undergraduate laboratories. Typically, students take several measurements of a quantity, say pressure, at one setting (meaning the experimental conditions are maintained at the same level). Assuming the systematic errors are negligible, each measurement will vary slightly due to random errors (e.g., reading instrument values, flow rate may change slightly, etc.). Often, however, one or two datum points seem to fall “far” outside the range of the others obtained. These outliers greatly impact the mean and standard deviation of the, measurements. A data elimination method can be used to obtain a realistic average value of pressure and an “uncertainty” in the true value given by the standard deviation…Chauvenet’s criterion is in common use today for elimination of suspect data.”[4]
…………Ignorance is bliss. I started off this series with some dire warnings about how haphazardly statistics are handled today, especially in fields like medicine where they can do the most damage. The more I’ve learned while writing this series, the less reassured I’ve become. One of the clearest lessons I’ve learned from this exercise is that, if the SQL Server community and the rest of the database server field get in the habit of routinely doing outlier detection (as I suspect they will, in time), they really need to avoid simply copying the means used in other fields. Chauvenet’s Criterion and the other five hypothesis-testing based methods don’t seem to be well-suited to the Big Data buzzword at all, but it doesn’t stop there: in many cases, they’re not even applied correctly in industries where they’re used on a daily basis, such as medical research. So far in this series, only Benford’s Law and Z-Scores appear to fit our use cases well, although I have high hopes for upcoming topics like Interquartile Range, Cook’s distance and Mahalanobis distance, as well as the various visual means that can be implemented in Reporting Services. Next week’s article on Peirce’s Criterion is also likely to be more valuable to DBAs. As Ross points out in an article on that topic, even Chauvenet recommended it in place of his own test: “Chauvenet himself believed that Peirce’s work was more rigorous and could be applied more generally, and in Chauvenet’s words, ‘For the general case….. when there are several unknown quantities and several doubtful observations, the modifications which the rule (meaning his own criterion) requires renders it more troublesome than Peirce’s formula……What I have given may serve the purpose of giving the reader greater confidence in the correctness and value of Peirce’s Criterion.’”
…………What’s good enough for Chauvenet is good enough for me. Why his advice not to use his own test apparently isn’t heeded in academia and private sector research is beyond me. Perhaps it is only a matter of habit, like the completely arbitrary custom of using confidence levels like 95 percent in hypothesis testing. Hopefully it is a custom that DBAs won’t adopt without some thought; perhaps Chauvenet’s Criterion has a place in our tool belts for unusual use cases, but it ought to be a very small place, considering how many more fitting outlier detection methods we have available to us.
[1] See the Wikipedia pages “68–95–99.7 Rule” and “Standard Deviation” at http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule “68–95–99.7 Rule” and
http://en.wikipedia.org/wiki/Standard_deviation “Standard Deviation” respectively.
[2] IBID.
[3] p. 29, Glover, David M.; Jenkins, William J. and Doney, Scott Christopher, 2011, Modeling Methods for Marine Science. Cambridge University Press: New York. I found this reference at the Google Books web address http://books.google.com/books?id=OYAkMs85848C&q=chauvenet#v=snippet&q=chauvenet&f=false
[4] pp. 3-4, Ross, Stephen M. “Peirce’s Criterion for the Elimination of Suspect Experimental Data,” pp. 1-12 in the Journal of Engineering Technology, Fall 2003. Vol. 2, No. 2. http://newton.newhaven.edu/sross/piercescriterion.pdf