By Steve Bolton
…………In the last two installments of this series of amateur self-tutorials (apart from the Fisher Information article, which I deliberately published out of order), I introduced a family of distance measures that included the ordinary Euclidean Distance in their underlying formulas. Like the L1, the L2 family I’ll touch on today also uses the Euclidean Distance, but in its squared form; essentially, each subtracts one set of data points or probabilities from another and squares the result, then applies a couple of other simple math operations like squaring, multiplication or addition on the same values in both the dividend and divisor, before or after feeding them to a SUM aggregate. As I’ve addressed in recent articles, there are many ways of organizing the confusing array of distance measures used in data mining activities (particularly clustering), but I’ve settled on Computer Science Prof. Cha Sung-Hyuk’s well-written taxonomy for this segment of the Information Measurement series.[1] Another good source with a more exhaustive list of this hodge-podge of measures is Deza’s Dictionary of Distances, which is chock full of simple formulas but light on explanation.[2] As Sung-Hyuk points out, the “cornerstone”[3] of the L2 group is the Pearson Divergence, which seems to be mentioned in the data mining and information theory literature far more often than any of its kin, except perhaps the Squared Euclidean Distance it is derived from. Fortunately, calculating these distances in SQL Server is surprisingly cheap, to the point of being trivial; the difficulty consists mainly in matching the right measures to the problems at hand, which can be tricky even for trained professionals, which I am not. To establish some criteria for selecting measures from the L2 group, it helps to understand some of the properties of the Pearson Divergence, the pole star around which the others revolve. The main justification for its kin seems to be to supply additional properties which the Pearson Divergence sorely lacks.
…………Almost all of the members of L2 have the symbol ?2 baked right into their names to link them to the Chi-Squared Tests and distribution, which were developed by pioneering statistician Karl Pearson back around 1900. It’s also why the L2 is also referred to as the ?2 family. In Pearson’s own words, the point of developing these statistical devices was essentially to gauge the “probability of improbability of a given system of deviations.”[4] The ?2 tests and distribution have since earned a place among the most popular and indispensable tools in statistics, particularly for cases when distribution outside the “Gaussian” normal case (i.e. the bell curve) are involved. In fact, the formula for the Pearson Divergence is identical to the one I discussed in Goodness-of-Fit Testing with SQL Server, Part 5:The Chi-Squared Test, except that we’re plugging in probabilities (raw values might also work in some circumstances) from two different distributions to assess how far apart they are, rather than observed and actual frequencies from the same distribution. The ?2 test of independence does a good job of detecting linkages between variables but the formula is intrinsically different, although it operates on similar principles. For these reasons, the Pearson Divergence might be my first choice in DIY data mining scenarios where we’re measuring the similarities between two sets of variables or probabilities in which a ?2 distribution might be involved, or where we need a general tried-and-true method of differentiating between categories. Ascertaining exactly when to use particular distances measures is a tricky job even for professionals (which I am not – yet) but this might at least give us a starting point. The fatal flaw in the Pearson Divergence is that it’s precisely that, a divergence, which in a technical sense is a distance metric that is missing the desirable properties of symmetry and subadditivity. To grasp how useful the first quality can be, imagine the difficulties that might arise if you measured the distance from one point of your house to another – then ended up with a completely different figure when you measured the same span in reverse. Subadditivity performs a similar service, by guaranteeing that any two measurements between three points will add up to less than or equal to the third distance, hence the alternate term “triangle inequality.” The raison d’etre of the other six members of the group seems to be to add these missing properties to the Pearson Divergence. Sung-Hyuk says point-blank that it is “of particular concern to mathematicians is that Pearson ?2 divergence is asymmetric.”[5] If a particular use case required this property, I’d try the Probabilistic Symmetric ?2 and Additive Symmetric ?2 versions next. As we shall see, it is quite easy to calculate them all at once, so we can afford to indulge in some experimentation to meet whatever requirements our data demands, whether it is product inventories, Customer Relationship Management (CRM) information, order-taking systems or any other common SQL Server application in between.
Implementing the Latest Members of the L2 in T-SQL
Some alternatives of greater complexity in the L2 class might also be worth of investigation. In his original paper, Jerzy Neyman (1894-1981) – a Polish mathematician who was probably fortunate he emigrated to California in 1939, ahead of the Second World War[6] – focused on using it to derive Maximum Likelihood Estimates (MLEs) for several variants of the ?2 tests.[7] Honestly, however, I had trouble wrapping my head around MLEs until my Fisher Information tutorial, so I won’t comment much further on it at this point. One of the measures is known simply as “Divergence,” which makes finding information on it in casual Internet searches akin to looking for a needle in a haystack, since there are so many other divergences in use (particularly the Kullback-Leibler, Bregman and Rényi). A couple of articles ago, I noted how the ordinary ?2 Distance is something of an oddball, since it has certain features in common with the Minkowski, L1 and L2. Among these relationships is the fact that it represents half of this Divergence, so this might also be a candidate for investigation. I also can’t find many references on the Clark Distance (probably because I’m an amateur whose still unfamiliar with the material). Suffice it to say that the square roots in its formula are useful in providing subadditivity[8] and it is the only choice in the L2 with one its definition, so I strongly suspect that it is designed to enhance the ?2 distance measures with that property. The absence of a root in the Squared Euclidean formula is probably why all of these measures dependent on upon it also seem to lack one, unless more math operations are tacked on to restore it; simple Euclidean Distance, in contrast, remains a true metric because it is based on the root of the Squared Euclidean. I’ve included the ordinary Euclidean Distance in Figure 1 for comparison purposes even though it’s not part of the family, i.e. the same reason I included the Squared Euclidean in Information Measurement with SQL Server Part 4.3: Euclidean Distance and the Minkowski Family even though it’s actually a member of L2. As noted in that article, one of the important distinctions between the Euclidean and Squared Euclidean is that the latter basically weights values on a logarithmic scale, so that the distances between points increase by leaps and bounds rather than a steady pace. This property can be useful in certain use cases and might be preserved in some of the ?2 derived from it. Keep in mind, however, that the squaring operations are more problematic on platforms like SQL Server where we have to contend with hard limits on our float and numeric data types, unlike in the realm of ordinary statistics; plug a “Big Data” dataset within large values into them and you might just run into arithmetic overflow errors. Mathematicians routinely add squaring operations as a convenient means of rescaling, so I’ve often wondered whether or not it would be safe to replace them with ABS operations that perform the same service, but aren’t susceptible to overflows. In addition, seven out of the eight members run the risk of divide-by-zero errors – more than any other family in although it would be a piece of cake to add CASE logic to Figure 1 to prevent them.[9] I’ve also included the Kumar Distance that Sung-Hyuk classifies as part of a separate family based on combinations of other distances, in part because that family’s too small and obscure to discuss separately and partly because its definition includes a Squared Euclidean dividend.[10]
…………The T-SQL in Figure 1 is structured almost exactly like my sample code in the last article, except that some of the simple math operations like squaring, subtraction, addition and multiplication are rearranged a little. It also executes in pretty much the same time, around 23 milliseconds, which improves to 3 milliseconds on a third or fourth use; it’s been awhile since I’ve re-read the copy of Kalen Delaney’s Microsoft SQL Server 2008 Internals[11] I won at the Rochester PASS conference a couple of years back, but I suspect it has something to do with the query engine caching the @Count variable after its uses reach some threshold. I’m not going to look a gift horse in the mouth though. The data is ultimately derived from the same two float columns in the same 11-million-row Higgs Boson dataset I’ve used for stress-testing for several tutorial series[12], except that I used the code found in the End Notes of Information Measurement with SQL Server, Part 4.4: Sørensen Distance and Its Relatives to precalculate the probabilities and standardize the raw values along the same 0 to 1 scale used in the Implementing Fuzzy Sets in SQL Server series. That might seem like cheating, but even that takes just 3 seconds on my desktop machine, a time that would vanish to nothing on a true server; as we saw three articles ago, retrieving the raw values from the original table is actually more costly, to the point of taking 23 seconds. The CrossProductCTE calculates some underlying sums and differences on the precalculated probabilities retrieved from the Physics.HiggsBosonProbabilityTable and bubbles them up to the outer SELECT for further refinement. The various types of difference produced in that common table expression (CTE) are summed in the following SELECT, then refined a little more and displayed in the last.
Figure 1: Sample T-SQL for the Squared L2 Distances
DECLARE @EuclideanDistance float,
@SquaredEuclideanDistance float,
@ChiSquareDistance float,
@SquaredChiSquareDistance float,
@PearsonDistance float,
@NeymanDistance float,
@ClarkDistance float,
@AdditiveSymmetricDistance float,
@KumarJohnsonDistance float
DECLARE @Count bigint
SELECT @Count=Count(*)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
; WITH CrossProductCTE
(Probability1, Probability2, SquaredDifference, AbsoluteDifference, ChiSquareDifference, SquaredChiSquareDifference, PearsonChiSquareDifference,
NeymanChiSquareDifference, ClarkDifference, AdditiveSymmetricDifference, KumarJohnsonInput)
AS
(
SELECT Probability1, Probability2,
Power(DifferenceAmount, 2) AS SquaredDifference,
Abs(DifferenceAmount) AS AbsoluteDifference,
Power(DifferenceAmount, 2) / Power(AdditionResult, 2) AS ChiSquareDifference,
CASE WHEN AdditionResult = 0 THEN 0 ELSE Power(DifferenceAmount, 2) / AdditionResult END AS SquaredChiSquareDifference,
CASE WHEN Probability2 = 0 THEN 0 ELSE Power(DifferenceAmount, 2) / Probability2 END AS PearsonChiSquareDifference,
CASE WHEN Probability1 = 0 THEN 0 ELSE Power(DifferenceAmount, 2) / Probability1 END AS NeymanChiSquareDifference,
CASE WHEN AdditionResult = 0 THEN 0 ELSE Abs(DifferenceAmount) / AdditionResult END AS ClarkDifference,
CASE WHEN Probability1 * Probability2 = 0 THEN 0 ELSE (Power(DifferenceAmount, 2) * AdditionResult) / (Probability1 * Probability2) END AS AdditiveSymmetricDifference,
CASE WHEN Power(Probability1 * Probability2, 1.5) = 0 THEN 0 ELSE Power(Power(Probability1, 2) – Power(Probability2, 2), 2) / (2 * Power(Probability1 * Probability2, 1.5)) END AS KumarJohnsonInput
FROM (SELECT Probability1, Probability2, Probability1 – Probability2 AS DifferenceAmount, Probability1 + Probability2 AS AdditionResult
FROM Physics.HiggsBosonProbabilityTable) AS T1
)
SELECT @SquaredEuclideanDistance = Sum(SquaredDifference),
@ChiSquareDistance= SUM(ChiSquareDifference),
@SquaredChiSquareDistance = SUM(SquaredChiSquareDifference),
@PearsonDistance = SUM(PearsonChiSquareDifference),
@NeymanDistance = SUM(NeymanChiSquareDifference),
@ClarkDistance = SUM(ClarkDifference),
@AdditiveSymmetricDistance = SUM(AdditiveSymmetricDifference),
@KumarJohnsonDistance = SUM(KumarJohnsonInput)
FROM CrossProductCTE
SELECT @EuclideanDistance = Power(@SquaredEuclideanDistance, 0.5)
— by separating this into two statements, we can reduce this to a single math operation, instead of having to do the SUM all over again as we would if both measures were on the same line
SELECT Power(@SquaredEuclideanDistance, 0.5) AS EuclideanDistance,
@EuclideanDistance / CAST(@Count as float) as EuclideanRatio,
@SquaredEuclideanDistance AS SquaredEuclideanDistance,
@SquaredEuclideanDistance / @Count as SquaredEuclideanRatio,
@ChiSquareDistance AS ChiSquareDistance,
@SquaredChiSquareDistance AS SquaredChiSquareDistance,
2 * @SquaredChiSquareDistance AS ProbabilisticSymmetricChiSquareDistance,
2 * @ChiSquareDistance AS Divergence,
@PearsonDistance AS PearsonDistance,
@NeymanDistance AS NeymanDistance,
@ClarkDistance AS ClarkDistance,
@AdditiveSymmetricDistance AS AdditiveSymmetricDistance,
@KumarJohnsonDistance AS KumarJohnsonDistance
Figure 2: Results for the First Two Float Columns in the Higgs Boson Dataset (click to enlarge)
…………Note how the Squared Euclidean Distance is several orders of magnitude smaller than the Euclidean in Figure 2, but was several orders larger in the tutorial on the Minkowski family. This interesting property of the Squared Euclidean only emerges when we standardize the values from 0 to 1 this way, because multiplying reverses the direction on this scale. This behavior could be helpful in certain applications, but detrimental in others. Also note how the ChiSquaredDistance and Clark Distance return similar results, while the former equals half of the plain “Divergence” as expected. The rest of the measures fall between 0 and 2 and in some cases might be comparable to each other. As an amateur, I’m hardly aware of all of the intricate relationships between these distances, many of which are still being discovered by theoreticians to this day; I only know that each of them works in our favor in the same way that buttressing bridges with more concrete and steel supports leads to greater stability. Each new connection may require more mental energy, but each also provides new opportunities to validate their associated measures and to “trap” new information. Look at it as weaving a net of metrics that becomes progressively more capable of ensnaring the valuable underlying information as we add new stitches. I also know that it is quite practical to code these things right in SQL Server, without having to rely on buggy open source software that doesn’t scale at all or for the analytic marketplace to catch up with the theoreticians, who are decades ahead of the current software. I’m not entirely certain exactly which of these metrics should be the default choice, because we’re on ground that hasn’t been tread very often before; in fact, you may be the first perform DIY data mining using these particular tools on your particular types of data. Some experimentation is going to be needed to figure exactly which measure matches your specific problem set best. In fact, I believe there are mathematical proofs to the effect that no single clustering methods can serve to meet most applications, and distance measures are one of the key components of these algorithms. There are at least several hundred distance metrics out in the wild, along with several thousand clustering algorithms, so don’t expect any software company to make them commercially available anytime soon; even if they succeeded, such a package would probably be ridiculously overpriced in comparison to the cost of coding them yourself in T-SQL or Multidimensional Expressions (MDX).[13] In the next couple of articles, I’ll introduce distances like the Jaccard, Bhattacharyya and Hellinger which seem to meet a wider range of use cases, given the frequency with which they’re mentioned in the data mining and information theory literature. The Jaccard is of particular interest, in part because it has many associations to the Sørensen Distance from last month’s article, while also overlapping the L2.
[1] Sung-Hyuk, Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4.
[2] Deza, Elena and Deza, Michel Marie, 2006, Dictionary of Distances. Elsevier: Amsterdam.
[3] p. 303, Sung-Hyuk.
[4] p. 159, Pearson, Karl, 1900, “On the Criterion That a Given System of Deviations from the Probable in the Case of Correlated System of Variables is Such That It Can Be Reasonable Supposed to Have Arisen from Random Sampling,” pp. 157-172 in Philosophical Magazine, Vol. 50. Available online at the University of Southampton web address http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf
[5] p. 303, Sung-Hyuk.
[6] See the Wikipedia page “Jerzy Neyman” at http://en.wikipedia.org/wiki/Jerzy_Neyman
[7] Neyman, Jerzy, 1949, “Contributions to the Theory of the ?2 Test,” pp. 239-273 in Proceedings of the First Berkley Symposium on Mathematical Statistics and Probability, Neyman, Jerzy, ed. University of California Press: Berkeley, California. Available online at the University of California, Berkeley web address http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s1_article-14.pdf
[8] See the reply by the user named Zyx at the StackExchange Mathematics thread “Is Symmetric Chi-squared Distance “A” Metric?”on Nov. 29, 2013 at http://math.stackexchange.com/questions/586151/is-symmetric-chi-squared-distance-a-metric
[9] p. 304, Sung-Hyuk.
[10] IBID., pp. 303-304.
[11] Delaney, Kalen, 2009, Microsoft SQL Server 2008 Internals. Microsoft: Redmond, Washington.
[12] This was downloaded long ago from University of California at Irvine’s Machine Learning Repository and converted to a SQL Server table of nearly 6 gigabytes.
[13] I don’t often include Data Analysis Expressions (DAX) in these discussions because I’m just not crazy about it as a language. I think it came along with some Microsoft acquisition and was well-matched to the underlying architecture, but I wish it had been feasible to scrap it in favor of MDX.