By Steve Bolton
…………There are apparently many subtle variations on Z-Scores, a ubiquitous measure that is practically a cornerstone in the foundation of statistics. The popularity and ease of implementation of Z-Scores are what made me decide to tackle them early on in this series of amateur self-tutorials, on using various components of SQL Server to detect those aberrant data points we call “outliers.” As discussed in the last two posts, there are many different means of identifying outliers, which may be understood best by categorizing them by their use cases; the right choice of detection tools is essentially a function of the questions one wants to ask of the data, the number and types of inputs, the desired mathematical properties of the outputs and in between, the performance and other requirements used in transforming the inputs into outputs. From my scant understanding of what little literature I’ve read on the topic, statisticians and other researchers commonly encounter use cases where the sensitivity of ordinary measurements to outliers has to be toned down, often in response to fat-tailed (i.e. highly skewed) distributions. The Modified Z-Scores developed by Temple University Prof. Boris Iglewicz and University of Massachusetts Prof. David C. Hoaglin are one means of adjusting Z-Scores for such cases, but hardly the only one. I’m highlighting it merely because I was introduced to it early on, while trying to learn the subject of stats from the National Institute for Standards and Technology’s Engineering Statistics Handbook, one of the best online resources for anyone trying to wade through this notoriously dry subject.
Iglewicz and Hoaglin suggest specific cut-off criteria for their measure, that can of course be adjusted as needed by users – which raises the whole question of researchers “moving the goalposts” or setting them haphazardly when using any means of outlier detection. Correct classification is a thorny issue with every method we’ll discuss in this series; I’m merely using the hard boundary associated with Modified Z-Scores as an introduction to the all-important topic of subjectivity in category definitions. As an amateur, I can’t give any guidance on whether or not to use a specific cut-off point for classifying outliers, although it would seem to be common sense that more detailed rough and fuzzy sets ought to be more commonly used in place of hard limits than they are. It is worth reiterating that outliers also vary significantly in their causes and the responses made to them, not just the means of differentiation. The frequent mismatches between the causes and responses and the lack of attention paid to discerning them both leave the door wide open to innumerable fallacies and sometimes outright fraud, which as I discussed earlier in this series, is frighteningly common in certain fields. The definition of an outlier is subjective, depending on the kind of investigation a researcher chooses to perform, but whether or not a particular data point meets the chosen criteria is wholly objective. Fallacies and fraud arise when the distinction in the right uses and proper places of subjectivity and objectivity are blurred; the whole history of human philosophy demonstrates that when the former is loosed from such bonds, the result is always maniacal madness. For example, a person can choose to affix the name “pepperoni pizza” to anything they want; but once they’ve set tomato sauce, bread, cheese and the like as part of the criteria, then they can’t pretend that a pencil sharpener or a Planck length fits the definition, because whether or not they consist of the same ingredients set forth in the criteria is an objective matter. That’s plain common sense, which suddenly becomes uncommon when the labeler has an incentive to fudge their definitions, or worse yet, a pedantic justification for it, like solipsism (i.e., one of the major symptoms of schizophrenia). Outlier detection presents a serious temptation to simply ignore the distinctions between causes and put no effort to differentiating the correct response to others, so that data miners and others who use these tools frequently just delete records that don’t fit their models and theories, or adjust their definitions of the term to achieve the same purpose. I’ll delve into outlier deletion in more depth a few posts from now, but the issue of subjective limits can serve as a transition into my usual dire disclaimer that math formulas, including those underpinning data mining algorithms and outlier detection, resides in a Pandora’s Box. The Modified Z-Scores under discussion today do not open the box any wider than any other formula; this is merely the context in which all statistical measures naturally reside, in which the slightest logical deviation in their use may lead to erroneous, misleading or even fraudulent conclusions. Data mining tools can be used quite easily by amateurs like myself for exploratory data analysis, but need to be handled like scalpels when attempting to prove a specific point. Nevertheless, they’re often employed carelessly like sledge hammers by professionals in many different fields, particularly health care. The specter of fallacious reasoning hems us in all sides, and wielding these tools properly for this purpose requires more skill than the old board game of Operation. The difference with math and logic is that there is no buzzer to warn us when we’ve used them wrong; there may be terrible consequences down the line in the form of falling bridges and adverse medical reactions, but the intelligentsia also has the intellectual power to explain those away using the same poor reasoning. What is called for here is not intelligence, but wisdom; without it, outlier detection methods merely prove the old adage, “There are three kinds of lies: lies, damned lies and statistics.”[ii] No data mining tool or math formula is going to going to provide a quick fix for this overarching problem, which hangs like a Sword of Damocles over everything that researchers, scientists, mathematicians and data miners do; the only fix is to apply the use of reason rigorously, which requires a deep understanding of logical fallacies and in turn, painful self-examination. Neither I nor most of the DBAs who read this probably have that kind of training, so our use cases ought to be limited to exploratory analysis – which can be a highly productive exercise, even for the unqualified – rather than hypothesis testing and the like.
…………The point of using Modified Z-Scores is to address situations where it is desirable to reduce the sensitivity to outliers, so that there are for all intents and purposes fewer false positives when classifying them. Whether or not such reduced sensitivity is a wise choice to fit the problem at hand is one question; whether or not Modified Z-Scores succeed in doing so seems to be an open and shut case. In this series I’m trying to grasp the mechanisms that make these algorithms and formulas work as they do, which is something I didn’t delve into adequately in my series on SQL Server Data Mining (SSDM). The reason why Iglewicz and Hoaglin’s Z-Scores are less sensitive to outliers without being completely blind to them is that they use medians rather than means, which are an alternate measure of central tendency that is known for being less affected by unusual values. Both medians and means are basically primordial forms of clustering that identify a specific location near the center of a dataset, but the former is less affected by the most distant points. The formula given at the NIST website is not all that difficult to decipher or code in T-SQL; I was unable to get ahold of a copy of their original paper to see what the reasoning was behind the constant that appears in it, but it is child’s play to simply include it in the code and be done with it.[iii] This was my first introduction to median absolute deviation (MAD), which is a variation of the average absolute deviation that is even less affected by extremes in the tail because the data in the tails have less influence on the calculation of the median than they do on the mean.”[iv] I initially confused it with a more common calculation, mean absolute deviation because of the similar names. The idea is basically the same though: instead of taking a mean of a mean, we compare each data point to the median of the whole dataset, then calculate a new median for the absolute value of those distances. Then we take subtract the median from each data point again and multiply that result by a constant, 0.6745, the divide the result by the MAD. The equations are actually quite easy to read; most of the T-SQL involved in implementing them is dedicated to calculating the two medians, using some subqueries and windowing functions. I’ve precalculated both in Common Table Expressions (CTEs) at the beginning of this week’s T-SQL stored procedure, because this reduces them to one-time operations (I think the technical term might an “T(n) operation”) and makes the complicated dynamic SQL a little more legible. The T-SQL in Figure 1 could be streamlined further to suit your needs by removing the DENSE_RANK calculation, the OutlierCandidate column and the @OrderByCode logic, which are dispensable elements I’ve added as conveniences.
Figure 1: Code for the Modified Z-Score Stored Procedure[v]
CREATE PROCEDURE [Calculations].[ModifiedZScoreSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)–1 is by PK ASC, 2 is by
PK Desc, 3 is by ColumnName ASC, 4 is by ColumnName DESC, 5 is by ZScore ASC, 6 is by ZScore DESC
AS
SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + @SchemaName + ‘.’ + @TableName –I’ll change
this value one time, mainly for legibility purposes
SET @SQLString =
‘DECLARE @OrderByCode as tinyint ,– pass the outer value like a parameter of sorts
@Median AS decimal(‘ + @DecimalPrecision + ‘),
@MedianAbsoluteDeviation AS decimal(‘ + @DecimalPrecision + ‘)
— PRECALCULATED STATS
————————–
— precalculating these 3 stats not only makes the code more legible, but is more efficient because it is a one-time operation
– first get the median
WITH MedianCTE
(‘ + @ColumnName + ‘, RN, DenseRank)
AS
(
SELECT ‘ + @ColumnName + ‘, RN, DENSE_RANK() OVER (PARTITION BY 1 ORDER BY RN DESC) AS DenseRank
FROM (SELECT ‘ + @ColumnName + ‘, ROW_NUMBER() OVER (ORDER BY ‘ + @ColumnName + ‘) AS RN
FROM ‘ + @SchemaAndTableName + ‘
WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
)
SELECT @Median = Avg(‘ + @ColumnName + ‘) FROM MedianCTE WHERE RN BETWEEN DenseRank – 1 AND DenseRank +1;
— get the MedianAbsoluteDeviation
WITH MedianAbsoluteDeviationCTE
(‘ + @ColumnName + ‘, RN, DenseRank)
AS
(
SELECT NewMedian, RN, DENSE_RANK() OVER
(PARTITION BY 1 ORDER BY RN DESC) AS DenseRank
FROM (SELECT NewMedian, ROW_NUMBER() OVER (ORDER BY NewMedian) AS RN
FROM (SELECT ABS(‘ + @ColumnName + ‘ – @Median) AS NewMedian
FROM ‘ + @SchemaAndTableName + ‘
WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2)
SELECT @MedianAbsoluteDeviation = Avg(‘ + @ColumnName + ‘)
FROM MedianAbsoluteDeviationCTE
WHERE RN BETWEEN DenseRank – 1 AND DenseRank +1;
–SELECT @Median
–SELECT @MedianAbsoluteDeviation
SET @OrderByCode = ‘ + CAST(@OrderByCode AS nvarchar(50) ) + ‘
— OUTLIER COMPARISON OPERATIONS
— now check each data point
SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ModifiedZScore, DENSE_RANK () OVER (ORDER BY ModifiedZScore) AS GroupRank,
”OutlierCandidate” = CASE WHEN Abs(ModifiedZScore) > 3.5 THEN 1
ELSE 0
END
FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ((0.6745 * (‘ + @ColumnName + ‘ – @Median)) / @MedianAbsoluteDeviation) AS ModifiedZScore
FROM ‘ + @SchemaAndTableName + ‘) AS T1
ORDER BY
CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
CASE WHEN @OrderByCode = 5 THEN ModifiedZScore
END ASC,
CASE WHEN @OrderByCode = 6 THEN ModifiedZScore END DESC’
–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)
…………Note that I’m using almost all of the same parameters and dynamic SQL format as in the article on regular Z-Scores. The combination of the first three parameters allows you to execute the procedure against any table in any database, assuming you have the requisite permissions. Just like in the last tutorial, the @OrderByCode parameter allows you to sort the results by 1) the primary key values, ascending; 2) the primary key value descending; 3) the column values ascending; 4) the column values descending; 5) the Modified Z-Score ascending and 6) the Modified Z-Score descending. You’ll have to supply your own @DecimalPrecision values and tweak them to avoid arithmetic overflows, which are tricky to handle when multiple calculations can change the number of decimal places repeatedly. I usually try setting these values to the original precision and scale of decimal and numeric columns if they’re high enough, but when working with integers you’ll have to decide how many decimal places are appropriate for your output. You can debug the dynamic SQL by uncommenting the next-to-last line and two others beginning with comment marks and SELECTS. I’ve also used DENSE_RANK windowing function again to assign identical results to specific groups by their Modified Z-Score values, which comes in handy with columns that have few distinct values that are repeated many times. The OutlierCandidate is merely a bit column that reveals whether or not the ModifiedZScore falls outside the ±3.5 threshold set by Iglewicz and Hoaglin. Your requirements may be different, so feel free to change the threshold or eliminate it altogether; it wouldn’t be all that difficult either to replace hard thresholds like this with more flexible fuzzy set criteria with graded memberships. If you use @OrderByCode 5 or 6, values where OutlierCandidate = 1 will be sorted to the top and bottom of the results. As usual, you’ll have to add your own brackets and logic to handle spaces if you allow them in your object names (I have a ruthless ban on them in my own code, for legibility purposes) and program in your own security to handle risks like SQL injection.
Figure 2: Results for Column1 of the HiggsBosonTable
EXEC [Calculations].[ModifiedZScoreSP]
@DatabaseName = N’DataMiningProjects’,
@SchemaName = N’Physics’,
@TableName = N’HiggsBosonTable’,
@ColumnName = N’Column1′,
@PrimaryKeyName = N’ID’,
@OrderByCode = 6,
@DecimalPrecision = N’33,29′
Figure 3: Client Statistics for the Modified Z-Scores Procedure
…………In last week’s tutorial, I tested my Z-Score stored procedure on the first float column of a nearly 6-gigabyte table from the Higgs Boson Dataset, which is made publicly available by the University of California at Irvine’s Machine Learning Repository. In future mistutorials I will use a dataset on the Duchennes form of muscular dystrophy provided by Vanderbilt University’s Department of Biostatistics, as well as transcriptions of the Voynich Manuscript, a creepy tome whose medieval author encrypted it so well that no one has been able to crack it since, including the National Security Agency (NSA). For the sake of consistency, I tested my Modified Z-Scores procedure against the same Higgs Boson column. Using the query at the top of Figure 2 returned the corresponding results, plus about 11 million more rows that I somehow couldn’t cram into the article. There were some records at the very bottom with Modified Z-Scores near -1, but none that qualified for Iglewicz and Hoaglin’s cut-off point for outliers.
…………I didn’t bother to post screenshots of the execution plans because they weren’t pretty, nor would they fit on an NFL scoreboard. The novel approach I took of comparing the middle point of two windowing functions moving in opposite directions added a lot of expensive sorts, which even the addition of a non-clustered index couldn’t fix. As depicted in Figure 3, the index improved the Client Processing Time and Total Execution Time significantly, but the procedure still consumed far too much memory on my poor beat-up development machine and took too long for my tastes. It will do just fine on columns in small tables, but expect it to take a while if you’re executing it on 11 million rows of a 6-gig database using an outdated workstation in place of a real server.
…………That drawback ought to refocus attention on one of the caveats I want to stress in this series: I’m posting these articles because I don’t know what I’m doing and want to learn, not because I have any real expertise. As with my series A Rickety Stairway to SQL Server Data Mining, I’m passing on my misadventures so that others don’t repeat them. Another error I rectified along the way was accidentally substituting a mode for the median while wool-gathering; that procedure might actually be useful in catching certain outliers and I will post it if anyone thinks they can benefit, but the bottom line is that I almost posted an article based on the wrong formula. Just keep in mind that my code samples in this series will always need further testing before going into a production environment. Consider these posts an introduction to the topic, not the last word. If all goes according to plan, I’ll be introducing both myself and my readers to Chauvenet’s Criterion, which is a means of outlier detection that is intrinsically dependent on a Gaussian distribution. I may follow these up by going on a tangent with some fairly easy means of outlier detection, like Grubbs’ Test and the Tietjen-Moore Test, the Generalized Extreme Studentized Deviate (ESD) Test, Interquartile Range and Dixon’s Q-Test. At some point I’ll also get into a discussion of Visual Outlier Detection with Reporting Services (featuring a lot of eye candy) and do a quick recap of Clustering with SSDM. Towards the end of the series I’ll tackle Cook’s Distance the Modified Thompson Tau Test, then end with the daunting task of coding Mahalanobis Distance. I hope to use that last post as a springboard towards a much longer and more difficult series months down the line, Information Measurement with SQL Server.
That’s more “dry humor,” but not as bad as the hydrology joke in the last column.
[ii] Although frequently attributed to Mark Twain and Benjamin Disraeli, the quip apparently originated with British politician Leonard H. Courtney in 1895. See the TwainQuotes webpage “Statistics” at http://www.twainquotes.com/Statistics.html.
[iii] See “Detection of Outliers,” an undated article published at the National Institute for Standards and Technology’s Engineering Statistics Handbook website. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm. The page in turn cites Iglewicz, Boris and Hoaglin, David, 1993, “Volume 16: How to Detect and Handle Outliers,” The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
[iv] See “Measures of Scale,” an undated article published at the National Institute for Standards and Technology’s Engineering Statistics Handbook website. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm#MAD
[v] While coding this, I forgot how to use modulo properly and made use of See Byers, Mark, 2009, response to the thread “How Can I Determine is Digit Even Number?” published Nov. 26, 2009 at StackOverflow.com. Available online at http://stackoverflow.com/questions/1805420/how-can-i-determine-is-digit-even-number. I also double-checked my median calculations against the MathIsFun webpage “How to Find the Median Value” at http://www.mathsisfun.com/median.html.