by Steve Bolton
…………If all goes according to plan, my blog will return in a few weeks with two brand new series, Using Other Data Mining Tools with SQL Server and Information Measurement with SQL Server. Yes, I will be attempting what amounts to a circus act among SQL Server bloggers, maintaining two separate tutorial series at the same time. Like my series A Rickety Stairway to SQL Server Data Mining, it might be more apt to call them mistutorials; once again I’ll be tackling subject matter I’m not yet qualified to write about, in the hopes of learning as I go. The first series will compare and contrast the capabilities of SSDM against other tools in the market, with the proviso that it is possible to run them on data stored in SQL Server. That eliminates a lot of applications that run only on Linux or otherwise cannot access SQL Server databases right off the bat. The software packages I survey may include such recognizable names as RapidMiner and WEKA, plus a few I promised others to review long ago, like Autobox and Predixion Software. The latter is the brainchild of Jamie MacLennan and Bogdan Crivat, two former members of Microsoft’s Data Mining Team who were instrumental in the development of SSDM.
…………The second series of amateur tutorials may be updated more frequently because those posts won’t require lengthy periods of time to evaluate unfamiliar software. This will expand my minuscule knowledge of data mining in a different direction, by figuring out how to code some of the building blocks used routinely in the field, like Shannon’s Entropy, Bayes factors and the Akaike Information Criterion. Not only can such metrics be used in the development of new mining algorithms, but they can be applied out-of-the-box to answer myriad basic questions about the type of information stored in our SQL Server databases – such as how random, complex, ordered and compressed it might be. No mining company would ever excavate a new quarry without first performing a geological survey of what precious metals might be beneath the surface; likewise, it may be helpful for data miners to have a surface impression of how much useful information might be stored in our tables and cubes, before digging into them with sophisticated algorithms that can have formidable performance costs. Scores of such measures are scattered throughout the mathematical literature that underpins data mining applications, so it may take quite awhile to slog through them all; while haphazardly researching the topic, I ran across a couple of quite interesting measures of information that seem to have been forgotten. I will try to make these exercises useful to SQL Server users by providing T-SQL, MDX, Common Language Runtime (CLR) functions in Visual Basic and perhaps even short SSDM plug-in algorithms, as well as use cases for when they might be appropriate. To illustrate these uses, I may test them on freely available databases of interest to me, like the Higgs Boson dataset provided by the Univeristy of California at Irvine’s Machine Learning Repository. I may also make use of a tiny 9 kilobyte dataset on Duchenne’s form of muscular dystrophy, which Vanderbilt University’s Department of Biostatistics has made publicly available, and transcriptions of the Voynich Manuscript, an enigmatic medieval manuscript with an encryption scheme so obscure that even the National Security Agency (NSA) can’t crack it. Both tutorial series will use the same datasets in order to cut down on overhead and make it easier for readers to follow both. When and if I manage to complete both series, the next distant item on this blog’s roadmap will be a tutorial series on how to use various types of neural nets with SQL Server, which is a topic I had some intriguing experiences with many years ago.