By Steve Bolton
…………In order to divide the Herculean task of describing custom algorithms into bite-sized chunks, I omitted discussion of some plug-in functionality from previous installments of this series of amateur tutorials on SQL Server Data Mining (SSDM). This week’s installment ought to be easily digestible, since it deals with a single method in the AlgorithmBase class that is relatively easy to implement – at least in comparison to the material covered in the last four articles, which is fairly thick. For primers on such indispensable topics as the benefits of the plug-in architecture, the object model, the arrangement of nodes in the mining results and debugging and deployment of custom mining methods, see those previous posts. The Predict method is marked Must Override, which signifies that it must be implemented in order for any SSDM plug-in to compile and run successfully, but it is possible to leave the code blank as we did in these previous tutorials in order to get the bare bones plug-in to return some simple results. Until it is fleshed out, the plug-in sample code we’ve been working with in recent weeks won’t return any results for prediction queries.
…………As I mentioned in those previous posts, I’m writing the sample plug-in code in Visual Basic .Net, in part because there’s a distinct lack of SSDM code written in VB on the Internet, and secondly because it’s my favorite language. I am much more familiar with it than C#, which is what most of the available plug-in tutorials and sample code are written in. Like most of the code I’ve posted so far in this segment of the Rickety series, my implementation of the Predict method originally came from the samples provided by Bogdan Crivat (one of the founding SSDM programmers) in the plug-in software development kit (SDK). After converting to VB using online tools, changing the variable names to my liking, adding implementations of some objects found in the SDK help file and otherwise reworking it, the code is now practically unrecognizable. The design pattern and logical flow are pretty much the same, however, and would not have worked at all without Crivat’s indispensable sample code to guide me. My goal is merely to describe the basic tasks any plug-in programmer typically has to perform in order to return the simplest of results from prediction queries, which is a fairly straightforward task. That does not mean, however, that the Predict method cannot be insanely complex when a particular algorithm requires really advanced functionality; like most of the other plug-in class methods we’ve covered so far, there is plenty of room to add intricate conditional logic when the situation calls for it. An introductory tutorial like this is not one of those situations, because of the risk of clutter, so I’ll just to stick to the basics of what can or must be done in the Predict method. I’ll leave fancier implementations of that functionality up to the imaginations of intrepid plug-in programmers.
…………Most of the plug-in methods we’ve deal with so far are commonly linked to a couple of classes in the object model, but the relationship between the Predict method and the PredictionResult method is even tighter than in many of these cases. SSDM automatically populates the method with MiningCase objects, which we worked with back in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, plus a single PredictionResult that reflects the structure of the prediction query specified by the end user. For example, the class has a MaxPredictions member that is equivalent to the n parameter of the Predict function, to limit the number of records returned; an IncludeNodeID member to return a string value for the NODE_ID when the user specifies the INCLUDE_NODE_ID flag in the query; and AttributeSortBy and AttributeSortDirection members that describe the arrangement of columns specified in the ORDER BY clause. If the user also used the INCLUDE_STATISTICS flag, then SSDM will also return a NODE_DISTRIBUTION table with the results that is limited to the number of predictions set in the MaxStates property. Any order assigned to this dependent table will likewise be reflected in the StateSortBy and StateSortDirection properties. StateSortBy and AttributeSortBy both take values of the SortBy enumeration, which includes values like SortBy.AdjustedProbability, SortBy.Probability, SortBy.Support, SortBy.Variance and SortBy.None, while the other two properties take either SortDirection.Ascending or SortDirection.Descending values. The good news is that we don’t have to worry about any of these sorting properties, because as Crivat says in the SDK tutorial, “…you do not need to take them into account during the prediction, except if your patterns layout allows you to dramatically optimize the performance. The infrastructure will push the attribute predictions in a heap structure, sorted by probability, support, or adjusted probability as requested in the prediction operation.” In other words, it’s an advanced lesson far beyond the scope of this crude tutorial series. I imagine that in such a scenario, you would add conditional logic to alter the results returned by the query, depending on which sort method was in use. For example, you might implement a Select Case like the one in Figure 1, which could also be performed on a PredictionResult.StateSortBy value:
Figure 1: Sample Conditional Logic for a SortBy Value
Select
Case PredictionResult.SortBy.AdjustedProbability
Case PredictionResult.SortBy.Probability
Case PredictionResult.SortBy.Support
Case PredictionResult.SortBy.Variance
Case PredictionResult.SortBy.None
End Select
…………That begs the question: what belongs in each Case clause? The answer might include any of the remaining operations we’ll perform with PredictionResult in the rest of this tutorial, many of which could be encapsulated in one of these Case statements. A typical implementation of the Predict method, would begin a While loop like the one in Figure 2, which iterates over the OutputAttributes collection of the supplied PredictionResult. The PredictionCounter variable also limits the loop by the value of the MaxPredictions property and is incremented with each iteration just before the End While, which occurs near the end of the function. The AttributeSet.Unspecified variable supplies the index of the currently selected output attribute in the AttributeSet collection; it is zero-based, so we won’t be subtracting one from some of the arrays we’ll deal with in the meat and potatoes of the function, unlike in some other methods we’ve dealt with in the past. I normally prefer to iterate over explicit arrays, but had to use a While loop in this case because AttributeGroup objects like the OutputAttributes member have only two commands to control iteration, Next and Reset. The Select Case that follows the While declaration merely determines whether or not the currently selected attribute is disallowed by the PredictionResult.InputAttributesTreatment property. If not, then we create an AttributeStatistics object to hold the results we’ll be returning.
Figure 2: Complete Sample Code for the Predict Method
Protected Overrides Sub Predict(InputCase As Microsoft.SqlServer.DataMining.PluginAlgorithms.MiningCase, PredictionResult As Microsoft.SqlServer.DataMining.PluginAlgorithms.PredictionResult)
PredictionResult.OutputAttributes.Reset() ‘an AttributeGroup object
Dim PredictThisAttributeBoolean = False
Dim PredictionCounter As UInt32 = 0 ‘for determining whether or not we’ve reached the limit that may be defined in PredictionResult.MaxPredictions
‘I’m not sure yet but I think MaxPredictions is unlimited when = 0
While PredictionResult.OutputAttributes.Next(AttributeSet.Unspecified) = True And PredictionCounter < PredictionResult.MaxPredictions
‘OK if the current attribute is input, or output, matches AttributeSet.GetAttributeFlags(1), but
Select Case PredictionResult.InputAttributeRestrictions
Case PredictionResult.InputAttributesTreatment.ExcludeInputAttributes And Me.AttributeSet.GetAttributeFlags(AttributeSet.Unspecified).ToString Like “*Input*”
PredictThisAttributeBoolean = False
Case PredictionResult.InputAttributesTreatment.PredictOnlyInputAttributes And Me.AttributeSet.GetAttributeFlags(AttributeSet.Unspecified).ToString Like “*Output*”
PredictThisAttributeBoolean = False
Case PredictionResult.InputAttributesTreatment.None ‘this one really belongs in the Case Else, but I want to demonstrate it
PredictThisAttributeBoolean = True
Case Else
PredictThisAttributeBoolean = True
End Select
If PredictThisAttributeBoolean = True Then
Dim AttributeStatsToAdd As New AttributeStatistics()
AttributeStatsToAdd.Attribute = AttributeSet.Unspecified ‘the target attribute
AttributeStatsToAdd.AdjustedProbability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).AdjustedProbability
AttributeStatsToAdd.Max = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Max
AttributeStatsToAdd.Min = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Min
AttributeStatsToAdd.Probability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Probability
AttributeStatsToAdd.Support = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Support
‘test to see if the Missing state ought to be returned (SSDM will filter out data that is not requested in the query, but there’s no point wasting resources letting it get to that point
If PredictionResult.IncludeStatistics = True Then
For I As Integer = 0 To MarginalStats.GetAttributeStats AttributeSet.Unspecified).StateStatistics.Count – 1
If (I <> 0 And I > PredictionResult.MaxStates) Then Exit For ‘exit if we’ve hit the MaxStates, if there is a limit
If MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(I).Value.IsMissing = True And PredictionResult.IncludeMissingState = False Then
‘don’t add this node, because it’s the missing state
Else ‘in all other cases, add it
‘AttributeStatsToAdd.StateStatistics.Add(MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)))
Dim SkewnessState As New StateStatistics()
SkewnessState.ValueType = MiningValueType.Other
SkewnessState.Value.SetDouble(SkewnessKurtosisClassArray(AttributeSet.Unspecified).Skewness)
‘don’t subtract 1 because Unspecified is zero-based
SkewnessState.AdjustedProbability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).AdjustedProbability
SkewnessState.Probability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Probability
SkewnessState.Support = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Support
SkewnessState.ProbabilityVariance = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).ProbabilityVariance
SkewnessState.Variance = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Variance
AttributeStatsToAdd.StateStatistics.Add(SkewnessState)
End If
Next I
End If
If PredictionResult.IncludeNodeId Then
AttributeStatsToAdd.NodeId = AttributeStatsToAdd.NodeId = AttributeSet.GetAttributeDisplayName(AttributeSet.Unspecified, False)
End If
‘ Add the result to the predictionResult collection
PredictionResult.AddPrediction(AttributeStatsToAdd)
PredictionCounter = PredictionCounter + 1 ‘increment it only if a prediction is actually made (which it may not be if a comparison to the Input/Output to InputAttributeRestrictions leads to an exclusion
End If
End While
End Sub
…………This sample (I apologize for the sloppy code formatting, but the WordPress GUI is sometimes more challenging to work with than SSDM Extensibility) merely assigns the values of a column’s MarginalStats object to the new AttributeStatistics object, which ought to be a familiar operation at this point in the tutorial series. It then tests the IncludeStatistics member to see if we ought to include the NODE_DISTRIBUTION table, followed by a test of the IncludeMissingState property, then creates a new StateStatistics object if everything passes. The VALUETYPE of the single NODE_DISTRIBUTION table row this represents is set to MiningValueType.Other and the AdjustedProbability, Probability, Support, ProbabilityVariance (i.e. Standard Deviation, or StDev) and Variance are set to the corresponding values in the MarginalStats object for that column. The most important part is the Value.SetDouble statement, which sets the ATTRIBUTE_VALUE to the skewness statistics for that column, which we computed in previous tutorials. The StateStatistics object is then added to the AttributeStatistics object. Note how that section of code resembles the assignments of state stats in Post 14.4. You could conceivably add multiple rows of StateStatistics for each AttributeStatistics, or multiple AttributeStatistics for each column you return, depending on the needs of your algorithm. The next order of business is to add conditional logic to deal with instances in which the IncludeNodeID is True; in this case, I’m simply returning the full name of the output column. The newly populated PredictionResult is then added to the collection of predictions that will be returned by the method, the PredictionCounter variable I created is incremented, then the While loop either makes another pass or the method returns.
…………The logic you implement in the Predict method might be complex but the basics of its operation are not, in part because most of the PredictionResult properties are set by the query writer and therefore read-only. Debugging of the Predict method is far easier than with the AlgorithmNavigationBase class we discussed in the last post, which is in turn far less of a hassle than troubleshooting the other AlgorithmBase members and several orders of magnitude easier than dealing with errors in the AlgorithmMetadataBase class. One of the few issues I ran into initially were errors with text including the word “GetResultFromAttStats,” which might belong on Jeannine Takaki’s incredibly valuable thread on undocumented SSDM stored procedures. Presently, it is so undocumented that it only returns two hits on Google. One of these is an MSDN thread titled problem with managed code plug-in to analysis services which had no replies. I was fortunate to find this sentence in Gaël Fatou’s invaluable reply at the MSDN thread Managed Plugin-Algorithm: “…if (trainingStats.StateStatistics[0].Value.IsMissing) will not work, as it will always only consider the first state (which is always missing). The current state must be considered instead: if (trainingStats.StateStatistics[nIndex].Value.IsMissing).” This is due to an error in the SDK sample code that has apparently bitten at least one other plug-in developer, which necessitates the replacement of the [0] in the StateStatistics[0].Value.IsMissing statement with the proper state stats index. I never encountered GetResultFromAttStats errors after applying this solution, but first-time plug-in developers ought to be aware of this stumbling block in advance. I also received a couple of “Object reference not set to an instance of an object” errors that referenced GetAttributeFlags, a method of AttributeSet that I was calling incorrectly in my IncludeNodeID code. Another problem I encountered that was equally easy to fix was forgetting to enable the prediction functions I called in my test query, which led SQL Server Management Studio (SSMS) to return error messages with the text “…function cannot be used in the context” every time I ran the queries in Figures 3 and 4. It was a simple matter to edit the code of GetSupportedStandardFunctions in AlgorithmMetadataBase to return the correct SupportedFunction enumeration values. Out of the 37 available values, I selected PredictAdjustedProbability, PredictCaseLikelihood, PredictHistogram, PredictProbability, PredictSupport, PredictStdDev, PredictVariance, PredictScalar and PredictNodeId, which enabled me to return the results in the two figures below. I also enabled PredictTable, CustomFunctionBase and IsInNode for future experimentation. The .chm included with the SDK mentions that other Data Mining Extensions (DMX) functions like DAdjustedProbability, DNodeId, DStdDev, DVariance, DSupport, PredictAssociation and PredictSequence are also dependent on the output of the Predict method, but I didn’t implement some of these because they’re applicable to other types of algorithms, which are beyond the scope of this series. Note that the PredictHistogram function in the top graphic includes a row for the Missing state, while the first row contains the same values as the PredictSupport, PredictProbability, PredictAdjustedProbability, PredictVariance and PredictStDev functions in the bottom graphic. The Predict function at the bottom returns the same skewness value for the InputStat column we saw in last week’s post, as well as the name of the InputStat column for the PredictNodeID function. Most of these columns are standard return values in prediction queries, but in next week’s tutorial, I’ll demonstrate how to use custom mining functions to return more flexible result formats and perform tailor-made processing.
Figures 3 and 4: The Results of Various DMX Prediction Functions