By Steve Bolton
…………As I discussed in more depth in the last three articles (not counting last month’s digression into Fisher Information, which is out of numerical order), theoreticians are continually on the lookout for better distance measures, which can be differentiated by their mathematical properties and tailored to meet specific use cases. This is particularly useful in clustering, where certain theorems have established that no single distance metric can possibly be optimal for all scenarios; it is useless to search for a “Holy Grail” of distances that can be applied to them all. That is the motivation that has driven researchers to develop a whole taxonomy of distances – many of which can be implemented quite efficiently in T-SQL for DIY data mining purposes, without having to wait for the analytics software marketplace to catch up with academia. One of the main methods of classification is to separate them by whether or not they possess properties known as symmetry and subadditivity, i.e. that they arrive at the same answer when measured from either direction and when two distances are added together, the result will be less than or equal to the distance of both to any third point (a.k.a. the triangle inequality). Technically speaking, a “metric” is a distance measure that possesses both of these properties (in addition to positivity and positive definiteness) while a true “distance” is missing subadditivity. A divergence lacks symmetry as well.
…………Many other properties can emerge from the underlying equations, some of which can be deleterious, as absolute continuity proved to be in Information Measurement with SQL Server, Part 4.1: The Kullback-Leibler Divergence. One of the most readable guides I’ve found on the subject to date is a taxonomy published by Pace University Computer Science Prof. Cha Sung-Hyuk[1], who organizes distance measures for probabilities by the structure of their underlying formulas, which gives rise to eight families of distances. This week, I’ll tackle the L1 family, who members all incorporate Euclidean Distance (the ordinary measure across a straight line we all use every day) in their definitions. As I mentioned last time around, Euclidean Distance would normally be our first choice when matching distance measures to problems, but in certain cases alternatives to do a better job – as Manhattan Distance does in measuring distance across grid patterns like city blocks, where straight lines aren’t an option. The use cases for the L1 family are a step beyond Euclidean and Manhattan Distance in sophistication; in fact, the chief difficulty consists in discerning when to match them to particular problems, since calculating them in T-SQL on large datasets turns out to surprisingly easy.
The L1 Taxonomy
The leading member of the L1 family is Sørensen Distance, which is closely related to a measure of similarity so ubiquitous that it has garnered a dozen nicknames, These include Sørensen–Dice Coefficient, Dice’s Coefficient, the Sørensen Index, the Dice Index, the Bray-Curtis Similarity, which can provoke the same kind of confusion seen in regression, with its unending array of synonyms for response and explanatory variables.[2] Additional confusion can arise over the fact that the underlying formulas change somewhat if we plug sets into them, which a set-based language like T-SQL might be ideal for. In such cases the term “index” is used more frequently, for example in connection with related measures like the Jaccard Index. I’ll be limiting my discussion here to the versions in Sung-Hyuk’s guide, which is technically focused on distances between probabilities, although the same formulas can often be applied to raw data values with little alteration. As he points out, it is “widely used in ecology,”[3] probably because classification is its main strength. It also has complex interrelationships with other measures we’ll be covering throughout this segment, like the Hellinger, Jaccard and Czekanowski’s Distances, but this actually works to our advantage, in the same way that scaffolding is strengthened by the addition of more bars. It is also a superset that subsumes five members of the Intersection family I’ll be covering at the end of this segment.[4] Not only does it allow us to cast a more complete net over the information hidden in our databases, but some of the relationships are useful in validation; for example, the accuracy of the Hellinger and Sørensen Distances can be assessed together because each is equal to 1 minus the other. On the other hand, one of the drawbacks of Sørensen Distance is that it is not subadditive, and therefore only qualifies as a semimetric, i.e. just one step above divergences in sophistication.
…………Sørensen Distance is basically identical to the Canberra, except that the former calculates sums on both the dividend and divisor before the division operation, while Canberra does a single sum after dividing. This apparently changes the scale, so that it is not comparable to the other ratios, hence my addition of a CanberraDistanceRatio to my sample T-SQL; I also added a ratio calculation for Kulczynski’s Distance for the same reason. This small change also renders Canberra Distance more “sensitive to small changes near zero,” which could be desirable in some applications but detrimental in others.[5] “It is a weighted version of L1 (Manhattan) distance… used as a metric for comparing ranked lists and for intrusion detection in computer security.”[6] I seem to see more cautions in regards to the Kulczynski Distance, which requires standardization and is not a true metric.[7] It is among several distance measures like the Jaccard and Sørensen which have corresponding indexes based on intersections and unions of sets rather than differences and sums, but are not identical; I may discuss these indexes separately at a later date. The Kulczynski Similarity is derived from a simple trick to switch back and forth between dissimilarity and similarity measures, by subtracting either one from 1 to produce the other; this greatly simplifies the confusing jumble of various distances measures in a way that even a high school math student can follow. The difference is really quite intuitive: dissimilarities and similarities measure the same things in opposite directions, so that two objects are judged to be further apart as a similarity declines but closer together if a dissimilarity does the same.
…………The Gower Similarity has been used in hierarchical clustering and principal coordinate analysis since the ‘60s and “has been found sufficiently flexible to cope with nearly all forms of character coding so far encountered, and unlike many coefficients currently in use does not require any recoding for multistate or quantitative characters.”[8] When there are no null values, it exhibits positive semi-definiteness (a more flexible variation of positive definiteness in which greater and less thans are allowed in the inequalities, rather that exact equalities) that is useful to translating to Euclidean Distance.[9] Soergel Distance is the same as the Kulczynski, except that we’re taking the MAX; this small change is apparently enough to turn it into a true metric (meaning it all four metric properties, including subadditivity and symmetry) that has applications in chemistry.[10] Lorentzian Distance has properties that are useful in hyperdimensional structures like Riemann manifolds, which are used in both quantum physics (a topic not often encountered by SQL Server DBAs) and information geometry (a topic SQL Server data miners are more likely to hear of in the future). It’s a topic I’ve run across before, but when trying to brush up on it in a casual Internet search, I found that it has weird relationships to causality and non-commutative dimensions.
Surprising SQL Speed on the Sørensen, Soergel and Similarities
Like all of the families we’ll cover in this segment of the Information Measurement series, the difficulty with the L1 consists mainly in discerning these properties and matching them to the right problems, which takes some hard thinking. The calculations are a breeze in comparison. I almost always omit the underlying equations to avoid burdening end users, but the gist ought to be self-evident in the sample T-SQL in Figure 1, which is structured much like the query in the last amateur tutorial on information distances. The CrossProductCTE calculates a couple of simple differences and sums then perform some simple tweaks to them, like squaring, ABS and MAX operations, then feeds them to the first SELECT, where simple aggregates like SUMs are taken over the results and a few cosmetic math operations are applied. The key difference from the tutorial on Euclidean Distance is that this time around, we’re retrieving probabilities calculated on the actual proportions of the same two float columns in the same Higgs Boson dataset,[11] rather than the raw values in the original tables. This allowed me to cut out the first common table expression (CTE) used to retrieved the raw values directly from the HiggsBosonTable.
…………Yet theoretically, this ought to be a lot more expensive, if we factor in the scaling method used in the Implementing Fuzzy Sets in SQL Server tutorial series plus the derivation of the probabilities, all of which requires a whole new layer of simple math operations on top of the raw values. The results in Figure 2 took a mere 23 milliseconds to execute on Physics.HiggsBosonProbabilityTable, compared to about 23 seconds for the simpler measures in the last tutorial, but not all of the savings were due to the precalculation of the aggregates; the real shocker is that the code used to standardize all of the values on the same 0 to 1 scale, plus derive the probabilities, took only 3 seconds. I’ve buried the Physics.HiggsBosonProbabilityTable table definition[12] and the MERGE query to populate it[13] in the end notes to avoid distracting from the main topic of this article, but somehow they managed to shave about 20 seconds off of the execution time compared to last week’s sample, even though it performs a whole new layer of calculations on precisely the same data. I’m not given to looking a gift horse in the mouth, but I haven’t yet managed to isolate the source of this serendipity. Strangely, Figure 1 represents the first time I’ve ever seen an execution plan where 99 percent of the costs were accounted for in the retrieval of a simple Count(*). Ninety percent of the other query in the batch is taken care of in a single scan on a non-clustered index I put on the probability table, which really isn’t many milliseconds faster than a scan of its clustered index. I could also retrieve this “costly” @Count by taking a SUM over either of the ValueCount columns in the Physics.HiggsBosonProbabilityTable, but I didn’t want to complicate the lesson further. Nor do I think it would improve things much, given that on the third or fourth use, this query’s execution time was reduced to meager 3 milliseconds. This improvement on a third or fourth use occurred consistently in the sample code used for the next four distance tutorials, which get progressively more sophisticated yet still execute in just 3 milliseconds. This was done on a poor beat-up desktop machine, so imagine how blazingly fast it would be to calculate these distances together on a real server, even on a billion-row Big Data tables.
Figure 1: Sample T-SQL for the L1 Family of Distances
Figure
1: Sample T-SQL for the L1 Family of Distances
DECLARE @SquaredEuclideanDistance float,
@SørensenDistance float,
@GowerDistance float,
@CanberraDistance float,
@KulczynskiDistance float,
@SoergelDistance float,
@LorentzianDistance float
DECLARE @Count bigint
SELECT @Count=Count(*)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
–CALCULATE THE DIFFERENCES FIRST
; WITH CrossProductCTE
(Probability1, Probability2, SquaredDifference, AbsoluteDifference, LesserOfTwoValues, GreaterOfTwoValues, SumAmount)
AS (
SELECT Probability1, Probability2,
Power(DifferenceAmount, 2) AS SquaredDifference,
Abs(DifferenceAmount) AS AbsoluteDifference,
CASE WHEN Probability1 <= Probability2 THEN Probability1 ELSE Probability2 END AS LesserOfTwoValues,
CASE WHEN Probability1 <= Probability2 THEN Probability2 ELSE Probability1 END AS GreaterOfTwoValues, SumAmount
FROM(SELECT Probability1, Probability2, Probability1 – Probability2 AS DifferenceAmount,
Probability1 + Probability2 AS SumAmount
FROM Physics.HiggsBosonProbabilityTable) AS T1
), MaxAbsouteDifferenceCTE
(MaxAbsoluteDifference)
AS
(
SELECT Max(AbsoluteDifference) AS MaxAbsoluteDifference
FROM CrossProductCTE
)
— PLUG THE DIFFERENCES INTO VARIOUS SIMPLE AGGREGATES
SELECT @SquaredEuclideanDistance = Sum(SquaredDifference),
@SørensenDistance = SUM(AbsoluteDifference) / SUM(SumAmount),
@CanberraDistance = SUM(AbsoluteDifference/ SumAmount),
@GowerDistance = SUM(AbsoluteDifference) / @Count,
@KulczynskiDistance = SUM(AbsoluteDifference) / SUM(LesserOfTwoValues),
@SoergelDistance = SUM(AbsoluteDifference) / SUM(GreaterOfTwoValues),
@LorentzianDistance = Log(1 + SUM(AbsoluteDifference))
FROM CrossProductCTE AS T1
INNER JOIN (SELECT MaxAbsoluteDifference FROM MaxAbsouteDifferenceCTE) AS T2
ON 1 = 1
GROUP BY MaxAbsoluteDifference
SELECT Power(@SquaredEuclideanDistance, 0.5) AS EuclideanDistance,
@SquaredEuclideanDistance AS SquaredEuclideanDistance,
@SørensenDistance AS SørensenDistance,
@CanberraDistance AS CanberraDistance,
@CanberraDistance / CAST(@Count as float) AS CanberraDistanceRatio,
@GowerDistance AS GowerDistance,
@KulczynskiDistance AS KulczynskiDistanceD,
1 / @KulczynskiDistance AS KulczynskiSimiliarity,
@KulczynskiDistance / CAST(@Count as float) AS KulczynskiDistanceRatio,
@SoergelDistance AS SoergelDistance,
@LorentzianDistance AS LorentzianDistance
Figure 2: Results from the Higgs Boson Dataset
…………This performance represents a powerful illustration of one of the main points I want to get across in this series: many of these information metrics can be coded quite efficiently in T-SQL, without having to rely on analytics vendors to code it for us and charge an arm and a leg for it. In some respects, the analytics marketplace is decades behind the research, which has such a large backlog of algorithms, stats and indexes that no single company will be able to implement them all for a long time to come, let alone corner the market. At some point, organizations that can bridge this huge gulf by coding their own metrics on the fly will enjoy certain competitive advantages over their rivals, at least in certain niches of analytics. As I discussed in the occasional Integrating Other Data Mining Tools with SQL Server series, open source and third-party products can bridge some of the gap, but they generally don’t scale well at all – unlike T-SQL and other set-based languages like Multidimensional Expressions (MDX), which may be the most efficient way to perform certain analytics tasks on Big Data-sized sets.
…………I’m writing amateur tutorials like this in order to acquire the skills to translate the underlying formulas into these SQL Server languages, which is why I always recommend that my sample code be rechecked for inevitable mistakes. Yet along the way, I’ve discovered that the real challenge consists of proper interpretation, even among brilliant analysts who have been putting data science and statistics into practice their whole lives. Many of the properties of these metrics are unknown or at least not fully understood, even among older ones like the Kulczynski and Jaccard Distances. I’ve mentioned some of the implementations and applications associated with the L1 family, but some groping through trial-and-error will be called for to ascertain which are ideal for your particular use cases. Both the questions we want to ask through analytics tools and their answers might be unique to our particular data and whatever drives it, whether it is customer preferences, online ordering of niche products, manufacturing processes or any SQL Server uses in between. If we already knew the answers to the questions we want to ask, we wouldn’t need data science, which by definition entails venturing into the unknown. The good news is that calculating these distance measures on large databases is shockingly cheap and might be made cheaper still, through the addition of columnstore indexes, expert tuning and parallel replicas. More good tidings can be found in the fact that all of the complex interrelationships can be leveraged to our advantage for validation and eliciting new information, even though they might seem confusing at first. Look at it this way: a toolbox stocked with rulers, tape measures and 3D laser distance meters will probably prove much more useful in building a house than gauging everything by thumb. Sometimes the uses of such tools are not clear until we’ve gained some practice in applying them to novel problems. In the next installment, we’ll add to the L2 family to our toolbelt, which isn’t much more difficult to explain than the L1, in large part because its underlying formulas are also built upon ordinary Euclidean Distance.
[1] Sung-Hyuk , Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4. My second-place favorite is Deza, Elena and Deza, Michel Marie, 2006, Dictionary of Distances. Elsevier: Amsterdam.
[2] For a list, see the Wikipedia article “Sørensen–Dice Coefficient” at http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
[3] IBID., p. 301.
[4] IBID.
[5] IBID.
[6] See the Wikipedia entry “Canberra Distance” at http://en.wikipedia.org/wiki/Canberra_distance
[7] pp. 48-49, McCune, Bruce; Grace, James B. and Urban, Dean L., 2002, Analysis of Ecological Communities. MjM Software Design: Gleneden Beach, Oregon.
[8] p. 865, Gower, J. C., 1971, “A General Coefficient of Similarity and Some of Its Properties,” pp. 857-871 in Biometrics, December, 1971. Vol. 27, No. 4. Available online at the ResearchGate web address http://www.researchgate.net/profile/John_Gower2/publication/232128574_A_General_Coefficient_of_Similarity_and_Some_of_Its_Properties/links/0c960524e95bedf928000000.pdf
[9] IBID., p. 860.
[10] p. 71, 2014, Jalal-Kamali, Ali; Shahriar Hossain, M. and Kreinovich, Vladik , “How to Understand Connections Based on Big Data: From Cliques to Flexible Granules,” pp. 63-88 in Information Granularity, Big Data, and Computational Intelligence, Vol. 8. Pedrycz, Witold and Shyi-Ming, Che, eds. Cham Springer International Publishing: Switzerland.
[11] Which I downloaded several tutorial series ago from the University of California at Irvine’s Machine Learning Repository and converted to a SQL Server table, which now takes up close to 6 gigs of space in a sham DataMiningProjects database.
[12] CREATE TABLE [Physics].[HiggsBosonProbabilityTable](
[StandardizedValue] [decimal](38, 37) NOT NULL,
[Value1] [decimal](33, 29) NULL,
[Value2] [decimal](33, 29) NULL,
[ValueCount1] [bigint] NULL,
[ValueCount2] [bigint] NULL,
[Probability1] [decimal](38, 37) NULL,
[Probability2] [decimal](38, 37) NULL,
CONSTRAINT [PK_HiggsBosonTable2] PRIMARY KEY CLUSTERED
([StandardizedValue] ASC)WITH
(PAD_INDEX =
OFF, STATISTICS_NORECOMPUTE =
OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [Physics].[HiggsBosonProbabilityTable]
ADD CONSTRAINT [DF_HiggsBosonProbabilityTable_ValueCount1] DEFAULT ((0)) FOR [ValueCount1]
GO
ALTER TABLE [Physics].[HiggsBosonProbabilityTable]
ADD CONSTRAINT [DF_HiggsBosonProbabilityTable_ValueCount2] DEFAULT ((0)) FOR [ValueCount2]
GO
ALTER TABLE [Physics].[HiggsBosonProbabilityTable]
ADD CONSTRAINT [DF_HiggsBosonProbabilityTable_Probability1] DEFAULT ((0)) FOR [Probability1]
GO
ALTER TABLE [Physics].[HiggsBosonProbabilityTable]
ADD CONSTRAINT [DF_HiggsBosonProbabilityTable_Probability2] DEFAULT ((0)) FOR [Probability2]
GO
[13] DECLARE @Count bigint, @Max1 float, @Min1 float, @Max2 float, @Min2 float
SELECT @Count = Count(*), @Max1 = Max(Column1), @Min1 = Min(Column1),
@Max2 = Max(Column2), @Min2 = Min(Column2)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
— the outer selects are just so we can calculate the probabilities without havingto make another trip across the table in an UPDATE
— since we’re using a global @Count variable, a computed column won’t fly
INSERT INTO Physics.HiggsBosonProbabilityTable –@ProbabilityTable
(StandardizedValue, Value1, ValueCount1, Probability1)
SELECT StandardizedValue, Value1, ValueCount1, ValueCount1 / CAST(@Count AS float) AS Proportion
FROM (SELECT CAST((Column1 – @Min1) / (@Max1 – @Min1) AS decimal(38,37)) AS StandardizedValue, Column1 AS Value1, Count(*) AS ValueCount1
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
GROUP BY Column1) AS T1
MERGE Physics.HiggsBosonProbabilityTable AS T1 –@ProbabilityTable AS T1
USING (SELECT StandardizedValue, Value2, ValueCount2, ValueCount2 / CAST(@Count AS float) AS Proportion2
FROM (SELECT CAST((Column2 – @Min2) / (@Max2 – @Min2) AS decimal(38,37)) AS StandardizedValue, Column2 AS Value2, Count(*) AS ValueCount2
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
GROUP BY Column2) AS T0) AS T2
ON T1.StandardizedValue = T2.StandardizedValue
WHEN MATCHED THEN UPDATE
SET T1.Value2 = T2.Value2, ValueCount2 = T2.ValueCount2, Probability2 = T2.Proportion2
WHEN NOT MATCHED BY TARGET THEN INSERT (StandardizedValue, Value2, ValueCount2, Probability2)
VALUES (T2.StandardizedValue, T2.Value2, T2.ValueCount2, T2.Proportion2);
— this table definition is unusual in that I do not use a bigint ID, since the StandardizedValue does the trick
SELECT *
FROM Physics.HiggsBosonProbabilityTable –@ProbabilityTable
ORDER BY StandardizedValue