by Steve Bolton
…………In the last installment of this amateur series of self-tutorials on DIY data mining metrics, we saw how a mathematical property known as absolute continuity hampered the Kullback-Leibler Divergence. It is still perhaps the most widely mentioned information distance measure based on Shannon’s Entropy, but a search has been under way since its development in 1951 to find alternatives that have less restrictive properties. In this article I’ll cover some of the most noteworthy alternatives, which are often used for mind-blowing applications like quantum physics, but which can easily be ported for ordinary day-to-day data mining uses in SQL Server databases; in fact, one of the main points I want to get across in this series is that it doesn’t take a rocket scientist to harness these Space Age information measures for more down-to-earth purposes. The key is to match the right metrics to the right use cases, a decision that is often strongly influenced by their mathematical properties.
…………As we saw in the last article, absolute continuity means that we can’t use the KL Divergence to measure the similarity between distributions in cases where one value has a probability of 0 and its counterpart does not. This is often the culprit when the KL Divergence and its relatives don’t intersect with other distance measures at the expected points, or stay within the expected bounds, which also constitute properties in their own right. Towards the end of that article, I also pointed out that divergences are technically differentiated from distances by the fact that they’re not required to have the usually desirable property of symmetry, whereas true metrics possess symmetry in addition to subadditivity. The motivation behind the most widely known metric I’ll cover today is to calculate a variant of the KL Divergence that possesses symmetry, which means we can measure it in either direction and get the same result. Imagine how cumbersome it would be to take a few tables of projected sales for various products and use the KL Divergence to measure the information gain based on the actual results, to discern which model would fit best – only to discover that the results depended on which direction we made the comparison in. The Jensen-Shannon Divergence also “provides both the lower and upper bounds to the Bayes probability of error” and can be weighted in ways that are useful in Decision Theory.[1] It is also used in assessing random graphs and electron orbits, among others.[2]
The Jeffreys Divergence and Other Jensen-Shannon Kin
The primary motivation for development of the KL’s contemporaneous cousin, the Jeffreys Divergence[3], was to measure Bayesian a priori probabilities in a way that exhibited the property of invariance.[4] This means that it is consistent when stretched across a dimension, as multiplying by threes or adding by twos is in ordinary math; obviously, it would be awkward to compare projections of things like customer satisfaction across data models if they were warped like carnival mirrors by ordinary math operations. The Jeffreys Divergence is a “special case” of the measure I discussed in Outlier Detection with SQL Server, part 8: Mahalanobis Distance, which is merely one example of the multifarious interrelationships such measures exhibit. Don’t let this put you off though: the more we can connect them together, the easier it is to construct a web of sorts out of them, to ensnare the actionable information hidden in our tables and cubes.
…………I’ll only mention a handful of the boundaries and intersections these measures must obey, such as the fact that the Jensen–Shannon Divergence is bounded by 1 when bits are used as the units (i.e. a logarithm base of 2). The K-Divergence is bounded by the Kullback-Leibler Divergence and simpler measure known as the Variational Distance, which is just the absolute difference between the two probabilities. It exhibits a different slope, which fits within the concentric curves of relatives like the KL Divergence, as depicted in a handy graph in the original paper.[5] One of Jianhua Lin’s primary motivations was to devise a measure like the KL that didn’t require absolute continuity, thereby allowing us to make comparisons of information distance between models that previously weren’t valid[6]. The Topsøe Distance is just a symmetric form of the K-Divergence, whereas the Jensen Difference leverages the concavity of the Jensen-Shannon Divergence, meaning that on a graph, it would bow outwards towards more extreme values rather than inwards towards 0. There are a whole smorgasbord of information metrics out there in the wild, exhibiting all kinds of useful properties that can be easily leveraged for productive purposes on SQL Server tables and cubes.[7] I’m introducing myself to the topic and merely passing along my misadventures to readers along the way, but it is plainly clearly from sifting through the literature that there is a yawning gap between the research and the potential applications in real-world situations. Not only is this low-hanging fruit going unpicked, but I strongly suspect that set-based languages like T-SQL are the ideal tools to harvest it with. I cannot possibly code all of the available metrics or even adequately explain the fines points of validating and interpreting them, but I will consider this segment a success if I can at least convey the idea that they can and probably ought to be implemented in SQL Server on a DIY basis. Don’t let the arcane formulas and jargon fool you: it takes very little code and very few server resources to calculate many of these entropic measures. In fact, it is a pleasant surprise to discover that if you calculate one of them across a large table, it costs very little to calculate the rest of them in the same pass.
Looking Past the Mathematical Formulas
The Jeffreys Divergence is the simplest of them, in that we merely have to replace the multiplicand (the first term in a multiplication operation) with the difference between the two probabilities. In the K-Divergence, we instead replace the multiplier of the KL Divergence with the result of two times the first probability divided by the sum of the two probabilities. The Topsøe is similar, in that we merely take the same input to the K-Divergence and add it to the result we’d get if we swapped the positions of the two probabilities in the same K-Divergence formula. The Jensen-Shannon Divergence is almost identical to the Topsøe, except that we SUM over the two inputs before adding them together, then halving the results. The Jensen Difference is the most complex of the lot, in the sense that it requires precalculation of Shannon’s Entropy for both probabilities and adding them together, then halving the result and subtracting half the sum of the two probabilities times the LOG of the same sum. I always recommend checking my formulas over before putting them into a production environment, given that I’m an amateur learning as I go; I highly recommend Cha Sung-Hyuk’s excellent taxonomy of distance measures, which became my go-to source for the equations.[8]
…………As explained in previous articles, I rarely post equations, in order to avoid taxing either the brains of readers or my own any more than I have to, on the grounds that they’re certainly indispensable for helping mathematicians communicate with each other but often a hindrance to non-specialists. There is a crying need in these fields for middlemen to translate such formulas into code with as little mathematical jargon and as many user interface affordances as possible, so that end users can apply them to real-world problems; as I’ve pointed out often, it doesn’t require a PhD in algorithm design to use data mining tools, any more than a doctorate in automotive engineering is required to ride a Harley Davidson. It is preferable to have some understanding of these fields, for the same reasons that the bikers at rallies like Sturgis normally have some knowledge of how to repair their motorcycles. There is very little middle ground these days in the analytics marketplace, which is so far behind the research that it will take decades to bridge the gap – hence the need for DIY data mining skills of the kind I’m trying to acquire and pass on in series like this. These are precisely the same reasons why I usually dispense with proofs and theorems, which I typically assume to be proven; if they were not acceptable for peer-reviewed journals, there would be a greater uproar than you normally see in fields like psychology and history where controversy is a given. Suffice it to say that these differences in the internal calculations serve the purpose of deriving the mathematical properties that make these divergences useful for particular use cases, like symmetry and concavity. The one thing we can never dispense with in data mining is proper interpretation. This can be tricky, since it revolves around mind-blowing issues like “What is randomness?” or “What is the meaning of ‘order’?” It is much easier to detect errors in calculations like the ones in Figure 1 than to deal with these issues, which have humbled much smarter men than us and on occasion, have driven some of the “great minds” of academia over the edge of madness. Beyond exercising common sense, the key with this particular family of distances is that they’re based on the entropy measures introduced earlier in this series. This means that they measure a particular kind of information, which might best be termed as “newsworthiness.” Information entropy is also ultimately derived from probability theory, which stems from combinatorics. When interpreting the Jeffreys Divergence in terms of Bayesian prior and posterior probabilities, it is not unwise to think through some of the logical implications, which as I noted in Information Measurement with SQL Server, Part 3: Inverse Probability and Bayes Factors, have led to serious turf wars among statisticians. Cranking out the numbers isn’t particularly difficult, especially with the aid of set-based languages like T-SQL; assigning precisely the right meaning to the symbols is where data mining and related fields often go awry.
Figure 1: Sample T-SQL Code for the Jensen-Shannon Divergence and Related Entropic Distances
DECLARE @LogarithmBase decimal(38,36)
SET @LogarithmBase = 2 — 2.7182818284590452353602874713526624977 — 10
DECLARE @ShannonsEntropy1 float, @ShannonsEntropy2 float, @JensenShannonDivergence float, @JensenDifference float, @JeffreysDivergence float, @ReverseKullbackLeiblerDivergence float, @KDivergence float, @TopsoeDistance float, @IsAbsolutelyContinuous bit
DECLARE @Count bigint, @Mean float, @StDev float
SELECT @Count=Count(*), @Mean = Avg(Column2), @StDev= 0.61 –, @StDev= StDev(Column2)
FROM Physics.HiggsBosonTable
WHERE Column2 IS NOT NULL
DECLARE @EntropyTable table
(Value decimal(33,29),
ActualProportion decimal(38,37),
GaussianProbability decimal(38,37),
InputForEntropy1 float,
InputForEntropy2 float,
InputForKDivergence1 float,
InputForKDivergence2 float,
InputForJensenDifference float,
InputForJeffreys float,
InputForReverseKL float,
InputForTopsoe float
)
INSERT INTO @EntropyTable
(Value, ActualProportion, GaussianProbability, InputForEntropy1, InputForEntropy2, InputForKDivergence1, InputForKDivergence2, InputForJensenDifference, InputForJeffreys, InputForReverseKL, InputForTopsoe)
SELECT Value, ActualProportion, GaussianProbability,InputForEntropy1, InputForEntropy2, InputForKDivergence1, InputForKDivergence2,
((InputForEntropy1 + InputForEntropy2) / CAST(2 AS float)) – CASE WHEN ProbabilitySumDividedBY2 <=0 THEN 0 ELSE (ProbabilitySumDividedBY2 * Log(ProbabilitySumDividedBY2)) END AS InputForJensenDifference,
InputForJeffreys, InputForReverseKL, InputForKDivergence1 + InputForKDivergence2 AS InputForTopsoe
FROM (SELECT Value, ActualProportion, GaussianProbability,
ActualProportion * Log(ActualProportion, @LogarithmBase) AS InputForEntropy1,
CASE WHEN GaussianProbability = 0 THEN 0 ELSE GaussianProbability * Log(GaussianProbability, @LogarithmBase) END AS InputForEntropy2,
CASE WHEN DivisionOperationOnProbability1 <=0 THEN 0 ELSE ActualProportion * Log(DivisionOperationOnProbability1, @LogarithmBase) END AS InputForKDivergence1,
CASE WHEN DivisionOperationOnProbability2 <=0 THEN 0 ELSE GaussianProbability * Log(DivisionOperationOnProbability2, @LogarithmBase) END AS InputForKDivergence2,
CASE WHEN GaussianProbability <=0 THEN 0 ELSE ActualProportion * Log(GaussianProbability, @LogarithmBase) END AS InputForJensenDifference,
CASE WHEN GaussianProbability <=0 THEN 0 ELSE ProbabilityDifference * Log(ActualProportion / GaussianProbability, @LogarithmBase) END AS InputForJeffreys,
CASE WHEN ActualProportion <=0 OR GaussianProbability <= 0 THEN 0 ELSE GaussianProbability * Log(GaussianProbability / ActualProportion, @LogarithmBase) END AS InputForReverseKL,
(ProbabilitySum / CAST(2 AS float)) AS ProbabilitySumDividedBY2
FROM (SELECT Value, ActualProportion, GaussianProbability, ProbabilityDifference,
(2 * ActualProportion) / CAST(ProbabilitySum AS float) AS DivisionOperationOnProbability1,
(2 * GaussianProbability) / CAST(ProbabilitySum AS float) AS DivisionOperationOnProbability2, ProbabilitySum
FROM (SELECT Value, ActualProportion, GaussianProbability, ActualProportion – GaussianProbability AS ProbabilityDifference, ActualProportion – GaussianProbability AS ProbabilitySum
FROM (SELECT Value, ActualProportion, Calculations.NormalDistributionCDFFindProbabilityFunction(ActualValueLag, Value, @Mean, @StDev) AS GaussianProbability
FROM (SELECT Value, ActualProportion, Lag(Value, 1, Value – 0.1) OVER (ORDER BY Value ASC) AS ActualValueLag
FROM (SELECT Value, ValueCount / CAST(@Count as float) AS ActualProportion
FROM (SELECT Column2 AS Value, Count(*) AS ValueCount
FROM Physics.HiggsBosonTable
GROUP BY Column2) AS T1) AS T2) AS T3) AS T4) AS T5) AS T6) AS T7
— VALIDATION – uncomment to debug
–SELECT SUM(CAST(ActualProportion AS float)), SUM(CAST(GaussianProbability AS float))
–FROM @EntropyTable
–SELECT * FROM @EntropyTable
— if this count is not = 0, then the distributions are not absolutely continuous and the KL Divergence is not defined
SELECT @IsAbsolutelyContinuous = CASE WHEN Count(*) > 0 THEN 0 ELSE 1 END
FROM @EntropyTable
WHERE GaussianProbability = 0
SELECT @ShannonsEntropy1 = SUM(InputForEntropy1) * –1,
@ShannonsEntropy2 = SUM(InputForEntropy2) * –1,
@JensenShannonDivergence = 0.5 * (SUM(InputForKDivergence1) + SUM(InputForKDivergence2)),
@JensenDifference = ABS(SUM(InputForJensenDifference)),
@JeffreysDivergence = Sum(InputForJeffreys),
@ReverseKullbackLeiblerDivergence = Sum(InputForReverseKL) * –1,
@KDivergence= SUM(InputForKDivergence1),
@TopsoeDistance = SUM(InputForTopsoe)
FROM @EntropyTable
SELECT @IsAbsolutelyContinuous AS IsAbsolutelyContinuous,
@ShannonsEntropy1 AS ShannonsEntropy1,@ShannonsEntropy2 AS ShannonsEntropy2,
@JeffreysDivergence AS JeffreysDivergence,
@ReverseKullbackLeiblerDivergence AS ReverseKullbackLeiblerDivergence,
@KDivergence AS KDivergence,
@TopsoeDistance AS TopsoeDistance,
@JensenShannonDivergence AS JensenShannonDivergence,
@JensenDifference AS JensenDifference
Figure 2: Results from the First Two Float Columns of the Higgs Boson Dataset (click to enlarge)
…………Figure 1 is lengthy because I used descriptive variable names and went out of my way to demonstrate how each measure is calculated instead of resorting to shortcuts, but it’s not particularly hard to follow. All I did was take the code from the last tutorial and tack on the declarations for the new variables and add new summation inputs to the @EntropyTable, which is where most of the calculations take place. There are also a couple of extra SELECTs in the INSERT, in order to bubble up certain intermediate values that are plugged into some of the formulas at different points. Once again, I’m deriving the first probability from the actual proportions from the same Higgs Boson dataset I’ve been using for practice purposes for the last few tutorial series,[9] then using the mean, standard deviation and values of the same float column to derive an approximation based on the normal distribution. For the sake of brevity, the function I used to derive the second probabilities is not depicted, but I can post it if anyone needs it; we could derive probabilities from all sorts of other distribution formulas and other methods like likelihood estimation methods and random number generation, but this suffices to get the point across.
…………The Jeffreys Divergence is the same as the Kullback-Leibler, except that we multiply the results of the LOG operation by the difference between the first and second distributions. The K-Divergence is also the same as the KL, except that the LOG operation is performed on twice the ActualProportion divided by the sum of the two probabilities. This same division operation is also plugged into the Topsøe and Jensen-Shannon, hence the insertion of another couple of subqueries in the INSERT to bubble it up to the individual divergence calculations. The Topsøe, Jensen-Shannon and Jensen Difference use both this measure and a similar one in which the second probability is used in the divisor. These latter three look a lot more complex on paper and require more CASE logic in the INSERT, but they really just represent a more detailed arrangement of the same building blocks. They all require summations over similar LOG operations on the two probability distributions, albeit on different levels. The last two SELECTs take up a lot of lines, but are really just trivial matters of performing some aggregates on the @EntropyTable and then returning the results to the user.
Validating the Entropic Distances
The @ReverseKullbackLeiblerDivergence variable is included merely for validation, not because there is actually such a measure; it is merely the KL Divergence with the two probabilities swapped, which ought to add up to the KL Divergence of the first when summed with the Jeffreys Divergence[10]. Figure 2 depicts values of 0.893438982055162 and -0.331423755462311 respectively, which equals 0.562015226592851, i.e. just one off the sixteenth decimal place of the result we received for the KL Divergence in the last tutorial. My main validation concern is with the K-Divergence, but it is at least below the upper bound of 1 when we use nats instead of bits, which yields 0.719948562354324. The Jensen-Shannon Divergence is 0.511712949639355 if we use the same yardstick based on the natural logarithm (users can easily switch back and forth by uncommenting parts of the first declaration), which is also below its boundary of 1.As expected, the Jensen-Shannon Divergence is likewise half of the Topsøe Distance, save for the fact that the final decimal place is missing; in practice, we can probably just calculate the Topsøe by multiplying by the final Jensen-Shannon result, unless we want to incorporate all of the intermediate steps for validation purposes as I’ve done here. In practice the Jensen-Shannon and Jeffreys Divergence seem to be used much more often than the other four, but it costs us very little to compute them all at once, even without these shortcuts to computing the Jeffreys and Topsøe measures. In fact, we could tack last week’s calculations of the Kullback-Leibler Divergence and Cross Entropy on to all this and incur very little extra cost. I omitted them merely to shorten the code, which is mainly for demonstration purposes. The code executed in three seconds, the same time as the routine in the last post, which would probably be several orders of magnitude better on a real server instead of my clunky desktop machine. A lot of fat can be trimmed out of the T-SQL if we’re using it for production purposes instead of demonstration, such as taking the aforementioned shortcuts to calculate the Jeffreys and Topsøe measures; most of the cost is also incurred in the derivation of the fake probabilities, which can be drastically reduced by reading precalculated figures out of tables rather than computing them on the fly. This might be the only means of simplifying the execution plan, whose costs are almost entirely incurred in a single nonclustered index seek and a scan on the same index, which kick off two of the three queries in the batch.
…………One of the silver linings to this lengthy segment on distances and divergences is that they’re all surprisingly cheap to calculate. Perhaps this is why the algorithms I covered in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering performed better than any of their rivals in SSDM, and why the clustering functionality has worked so well in other data mining and statistical products I’ve tried (I’m trying to slowly wade through the whole marketplace in my occasional Integrating Other Data Mining Tools with SQL Server series). This unexpectedly good performance seems to hold even when the measures are defined in fairly complex equations, on the same order of difficulty as the Jensen Difference; once we’ve incurred the penalties of traversing an entire table and perhaps deriving the probabilities on the fly, it’s relatively painless to tack whatever math we need on at the end. This is basically all that happens in Figure 1, where most of the calculations are performed in the outer rings of SELECTs in the INSERT, then summed in the stand-alone SELECT after it.
…………Some of the distances I’ll cover in the future, like the ordinary Euclidean and Hamming Distances, are based on really simple formulas than any college student can follow; it is much more difficult to explain why the Hamming Distance is a coarse measure that wastes a lot of information. The difficulties consist mainly of interpreting these distances and divergences correctly (which calls for self-discipline in not assigning them incorrect shades of meaning) and matching them up with the right use cases. Since they’re so numerous and I’m not familiar with many of them, I’m not yet sure how I’ll organize this particular segment of the series. We already dispensed with Shared Information Distance a few articles ago in Information Measurement with SQL Server, Part 2.5: Mutual, Lautum and Shared Information. At some point I will have to cover all of the big names seen frequently in the literature, like the Bhattacharya, Jaccard, Kendall-Tau, Canberra, Chebyshev, Hellinger, Earth Mover’s, Manhattan and Minkowski Distances. Some of the measures discussed in the last two tutorials are members of broader families like the F-, a-Skew, M- and S- Divergences, which are far too numerous to cover exhaustively. Some, like the Bregman and Rényi Divergences (a.k.a. the a-Divergence), subsume various other measures under one umbrella in the same manner than the Rényi Entropy incorporates many of its cousins, as I discussed in Information Measurement with SQL Server, Part 2.2 – The Rényi Entropy and Its Kin. There are many different ways I could organize this, but for the moment I’m leaning towards grouping them according to Sung-Hyuk’s guide. His taxonomy begins with the Minkowski family, which incorporates the simplest and most intuitive measures. Its members include the Euclidean Distance we all use on a daily basis, even if we don’t know it by name, so that may make a good launching pad from the entropic divergences into other information distanc metrics.
[1] p. 147, Lin, Jianhua, 1991, “Divergence Measures Based on the Shannon Entropy,” pp. 145-151 in IEEE Transactions on Information Theory, Jan. 1991. Vol. 37, No. 1. Available online at the University of Florida Department of Computer & Information Science & Engineering web address https://www.cise.ufl.edu/~anand/sp06/jensen-shannon.pdf
[2] IBID., p. 148.
[3] Jeffreys, Harold, 1946, “An Invariant Form for the Prior Probability in Estimation Problems,” pp. 453-461 in Proceedings of the Royal Society of London, Series A, Mathematical and Physical Sciences, Sept. 24, 1946. Vol. 186, No. 1007.
[4] p. 79, Kullback, Solomon and Leibler, Richard, 1951, “On Information and Sufficiency,” pp. 79-86 in The Annals of Mathematical Statistics, March 1951. Vol. 22, No. 1. Available online at the West Virginia University Lane Department of Computer Science and Electrical Engineering web address http://www.csee.wvu.edu/~xinl/library/papers/math/statistics/Kullback_Leibler_1951.pdf KL
[5] p. 146, Lin.
[6] Lin also writes about an L-Divergence in the same paper, which augments the Jeffreys Divergence in the same way. I only became aware of it when writing this column (keep in mind that I’m learning these topics as I go and am therefore not aware of much at all) and will put off coding it for a later date.
[7] Some of the other sources I consulted for this article included:
Abou-Moustafa, Karim T. and Ferrie, Frank P., 2012, “A Note on Metric Properties for Some Divergence Measures: The Gaussian Case,” pp. 1-15 in Workshop and Conference Proceedings of the Asian Conference on Machine Learning, Vol. 25. Available online at the Massachusetts Institue of Technolog web address http://jmlr.csail.mit.edu/proceedings/papers/v25/aboumoustafa12/aboumoustafa12.pdf -Moustafa
Crooks, Gavin E., 2008, “Inequalities Between the Jenson-Shannon and Jeffreys Divergences,” techical note 004 available online at the ThreePlusOne web address http://threeplusone.com/Crooks-inequality.pdf
Dragomir, S. S., Šunde, J. and Buse, C., 2000, “Some New Inequalities for Jeffreys Divergence Measure In Information Theory,” pp. 235-243 in Research Group in Mathematical Inequalities and Applications Research Report Collection, Vol. 3, No. 2. Available online at the RGMIA.org web address http://rgmia.org/v3n2.php
Kumar, Pranesh and Hunter, Laura, 2004, “On an Information Divergence Measure and Information Inequalities,” pp. 51-66 in Carpathian Journal of Mathematics, Vol. 20, No. 1. Available online at the University of Northern British Columbia web address http://web.unbc.ca/~kumarp/d7.pdf
Taneja, Inder Jeet ; Llorente, Leandro Pardo ; González, Domingo Morales; Menéndez, María Luisa, 1989, “On Generalized Information and Divergence Measures and Their Applciations: A Brief Review,” pp. 47-73 in Questio, Vol. 13, No. 1. Available online at the Biblioteca Digital Española de Matemáticas web address http://dmle.cindoc.csic.es/pdf/QUESTIIO_1989_13_01-02-03_04.pdf
Ullah, Aman, 1996, “Entropy, Divergence and Distance Measures with Econometric Applications,” pp. 137-162 in Journal of Statistical Planning and Inference, Vol. 49. Available online at the University of California, Riverside Department of Economics web address http://economics.ucr.edu/people/ullah/papers/70.pdf
Yamano, Takuya, 2014, “A Note on Bound for Jensen-Shannon Divergence by Jeffreys,” published Nov. 3, 2014 in The Proceedings of the 1st International Electronic Conference on Entropy and Its Applications. Available online at the SciForum web address http://www.sciforum.net/conference/ecea-1/paper/2630/download/pdf
[8] Sung-Hyuk Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4.
[9] It is made publicly available by the University of California at Irvine’s Machine Learning Repository. I converted it to a SQL Server table of about 5 gigabytes a few tutorial series ago, as part of a sham DataMiningProjects database.
[10] This relationship to the Jeffreys Divergence and the KL Divergence for the second probability is cited in p. 145, Lin.