By Steve Bolton
…………In order to divide this segment of my amateur tutorial series on SQL Server Data Mining (SSDM) into digestible chunks, I deferred discussion of some functionality that can be implemented in custom algorithms. In the last installment I explained how to write custom data mining functions, which is a quite advanced but powerful feature of the plug-in architecture – even by the standards of SSDM, which is in turn perhaps the most underrated and least publicized component of SQL Server. This week we’ll be covering a few odds and ends that aren’t absolutely necessary to implement in a bare bones plug-in, yet are more commonly used than custom mining functions and not nearly as challenging. It is really quite easy to code custom mining model flags, for instance, whereas parameters for custom algorithms are only slightly more difficult. In order to kill two birds with the same stone, so to speak, we’ll also dispose of the topic of feature selection by implementing custom mining flags and parameters that control it.
…………Mining flags merely require a class-scoped declaration of the MiningModelingFlag object type, as well as the implementation of two optional methods in AlgorithmMetadataBase. The GetSupModelingFlags method merely returns an array of MiningModelingFlag objects, including such enumeration values as MiningModelingFlag.Binary, Invalid, NotNull and Regressor. Returning a value of MiningModelingFlag.Null apparently causes a value of “error” to be listed in the SUPPORTED_MINING_FLAGS column of the DMSCHEMA_MINING_SERVICES schema rowset, which is roughly the equivalent of a relational dynamic management view (DMV). This can be seen in the screenshot I posted of DMSCHEMA_MINING_SERVICES in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, which should be corrected by the code depicted in Figure 1. For whatever reason, returning values of MiningModelingFlag.CustomBase rather than references to the flags themselves invariably raised an exception, so be vigilant for this fine distinction when implementing your own flags.
Figure 1: Sample AlgorithmMetadataBase Class Code for Custom Mining Flags
Friend Shared MyCustomMiningFlag1 As MiningModelingFlag = MiningModelingFlag.CustomBase + 1
Friend Shared MyCustomMiningFlag2 As MiningModelingFlag = MiningModelingFlag.CustomBase + 2
Public Overrides Function GetSupModelingFlags() As MiningModelingFlag()
Dim TempMiningFlags As MiningModelingFlag() = New MiningModelingFlag() {MiningModelingFlag.Regressor, MyCustomMiningFlag1,MyCustomMiningFlag2}
Return TempMiningFlags
End Function
Public Overrides Function GetModelingFlagName(TempFlag As MiningModelingFlag) As String
Select TempFlag
Case MyCustomMiningFlag1
Return “MyCustomMiningFlag1″
Case MyCustomMiningFlag2
Return “MyCustomMiningFlag2″
Case Else
Throw New System.Exception(“Unknown modeling flag : “ & TempFlag.ToString())
End Select
End Function
…………I also corrected the declarations of the flags, which should have been Friend Shared in the original version. GetModelingFlagName merely returns string names associated with the flags, which is also trivial to implement. As noted in previous tutorials in this segment of the Rickety series, most of the code I’m providing initially came from the C# samples provided by Bogdan Crivat, one of the original SSDM programmers, in the plug-in software development kit (SDK). I converted it to Visual Basic .Net in part because it’s my favorite language and secondarily because the Internet is practically devoid of SSDM plug-in code written in VB; after that, I made so many changes that the original material is practically unrecognizable. One of the common changes I made was to use Select Cases rather than If statements in order to accommodate wider ranges of values in future projects, as I did in Get ModelingFlagName. After replacing these declarations and methods in sample code I posted for the AlgorithmMetadataBase class a while back, the SUPPORTED_MINING_FLAGS column of DMSCHEMA_MINING_SERVICES should now depict a value of “REGRESSOR, MyCustomMiningFlag1, MyCustomMiningFlag2” rather than “Error.” As you can see in Figure 2, SSDM capitalized the MiningModelingFlag.Regressor value I returned in GetSupModelingFlags – as it does with the other out-of-the-box flag types – but I didn’t stick to this convention when naming my custom flags.
Figure 2: Custom Mining Flags in the SUPPORTED_MINING_FLAGS Column
…………Using parameters in a plug-in algorithm is a little trickier and requires the implementation of three methods, plus invocation code somewhere around the AlgorithmMetadataBase constructor. I’m not sure yet if it is necessary to add a constructor to that class, but it makes an ideal place to instantiate your parameters and add them to a class-scoped MiningParameterCollection object, as depicted in Figure 3. Crivat encapsulates declaration code of this kind in the constructor by calling a DeclareParameters method he added to the base class, which is probably a good idea in many circumstances; I opted not to add methods of my own devising to this class for now, just to avoid causing any confusion with other methods that are marked Must Override or optional in the base class. As Figure 3 also shows, MiningParameter objects have several properties that can be used to perform basic tasks like assigning defaults and descriptions, as well as identifying the parameters as hidden or required. The expected return type and name are specified in the same line of code as the parameter declaration.
Figure 3: Parameter Declaration Code for the AlgorithmMetadataBase Class
Protected MyMiningParameterCollection As MiningParameterCollection
‘#2 declare the members of the collection – possibly in AlgorithmMetadataBase.New or a method called by it
Public Sub New()
Dim MiningParameter1 As New MiningParameter(“MyFeatureSelectionParameter“, GetType(System.Int32))
Dim MiningParameter2 As New MiningParameter(“MyTrainingParameter“, GetType(System.String))
MiningParameter1.DefaultValue = “0”
MiningParameter1.Description = “A Description for My Only Feature Selection Parameter”
MiningParameter1.IsExposed = True
MiningParameter1.IsRequired = False
MyMiningParameterCollection.Add(MiningParameter1)
MiningParameter1.DefaultValue = “My Default Value”
MiningParameter1.Description = “A Description for My Only Training Parameter”
MiningParameter1.IsExposed = True
MiningParameter1.IsRequired = False
MyMiningParameterCollection.Add(MiningParameter2)
End Sub
…………We must also implement three other methods which are essentially inconsequential. The GetParametersCollection returns either the MiningParameterCollection we declared at the beginning of the class, or Nothing if there are no parameters and therefore no such collection. There’s nothing particularly difficult about the ParseParameterValue method either, although it does require a few more lines of code. My implementation is pretty close to Crivat’s original sample code, in part because there are only so many ways the simple task of parsing the parameter values supplied by the user can be done. The parameter types are different and the exception handling syntax varies a little, with the major change being the Select Case. The logic is easy to grasp: when the user supplies the parameter at the ordinal index of the zero-based collection, we merely return the data type we originally declared for that parameter. In his original code, Crivat notes that the parameters must be cast to the exact type, which I think means that they won’t be implicitly cast. All of the code we’ve added so far in this tutorial belongs in AlgorithmMetadataBase, but the GetTrainingParametersActualValue method is part of the AlgorithmBase class. It is no more difficult to implement, however, because all we have to do is return any default value if the user declined to specify one. As Crivat notes, special logic can be added here to substitute “The best value automatically (heuristically) detected by the algorithm for the current training set.” In a related vein, the .chm help file included with the SDK says that “The actual value may be different from what was requested by the user in situations like the following: the value specified by the user indicates some ‘auto-detect’ mechanism, that computes an optimized value based on the input data.” It also says that “implementing this function usually requires persisting the values of the parameters in the SaveContent implementation of your algorithm,” but we already demonstrated how to persist variable values of any kind in Post 14.2.
Figure 4: Three Methods Relevant to Parameters in AlgorithmMetadataBase and AlgorithmBase
Public Overrides Function GetParametersCollection() As Microsoft.SqlServer.DataMining.PluginAlgorithms.MiningParameterCollection
Return MyMiningParameterCollection
End Function
Public Overrides Function ParseParameterValue(ParameterIndex As Integer, ParameterValue As String) As Object
Dim ReturnValue As Object
Select Case ParameterIndex
Case 0
ReturnValue = System.Convert.ToInt32(ParameterValue)
Case 1
ReturnValue = ParameterValue
Case Else
Throw New System.Exception(“An error occurred when casting the parameter values.”)
End Select
Return ReturnValue
End Function
Protected Overrides Function GetTrainingParameterActualValue(ParameterOrdinal As Integer) As System.Object
If MyMiningParameterCollection.Item(ParameterOrdinal).Value Is Nothing Then
Return MyMiningParameterCollection.Item(ParameterOrdinal).DefaultValue
Else
Return MyMiningParameterCollection.Item(ParameterOrdinal).Value
End If
End Function
…………The ValidateAttributeSet method of AlgorithmMetadataBase is also an ideal place to validate mining flags for incorrect or missing mandatory values. SSDM automatically passes it a reference to the AttributeSet, which can be parsed to make sure that the user passed valid values; for example, you’d obviously want to iterate over the AttributeSet and verify that at least one attribute is marked as an input column. The code for this method is even more trivial, in the sense that it doesn’t even have a return type; you merely iterate over your AttributeSet and custom flags and throw an exception if a bad value is encountered, which halts training. On the other hand, the .chm states “that the attribute set here is not fully constructed, so only a limited set of methods can be invoked: GetAttributeCount, GetAttributeFlags, IsFunctionallyDependant and IsCaseAttribute.” I have yet to verify this, but this may preclude you from using it to parse parameter values as well. Normally, however, they would be checked in ParseParameterValue or GetTrainingParameterActualValue. Regardless of how these methods are implemented, the code in the constructor should be sufficient to expose the parameters to the end user, which can be verified by running a query like the one in Figure 5 on the DMSCHEMA_MINING_SERVICE_PARAMETERS schema rowset. Note that the values depicted for the PARAMETER_NAME, PARAMETER_TYPE, IS_REQUIRED and DESCRIPTION columns correspond to the property values we assigned in the constructor.
Figure 5: DMSCHEMA_MINING_SERVICE_PARAMETERS
…………The only difficulty I encountered when implementing parameters was that SSDM never called my GetTrainingParameterActualValue code as expected, which forced me to create a declare a public MiningParameterCollection in AlgorithmBase as well. I then added code to the CreateAlgorithm method of AlgorithmMetadataBase to make the two collections equivalent; there may be some kind of performance penalty for this workaround since AlgorithmMetadataBase is only called when the algorithm is instantiated, but I have yet to see it. I then iterate over AlgorithmBase’s version of the collection in the InsertCases method and call GetTrainingParameterActualValue once for each constituent parameter. As Crivat suggests in his code examples, the beginning of InsertCases is an ideal place to handle training parameters; in my updated sample code for AlgorithmBase, I left a placeholder If…End If block to encapsulate conditional processing based on these parameter values. Since we’re using a class-scoped parameters collection, we can reference them in many other routines though. We could, for example, provide end users more control over model processing by implementing new training phases, or performance parameters that change the way SSDM performs calculations in the ProcessCase method, depending on the available system resources. I had originally planned to use parameters to illustrate how to use ProgressNotification objects, which update users on the status of model processing. As we have seen throughout this series, processing can be quite resource-intensive in terms of CPU, RAM and I/O, as well as time-consuming, thereby necessitating regular communication between algorithms and their users so that the latter can cancel processing or move on to some other task while they wait. The ProgressNotification object is designed to perform this task, by providing messages to the users that are displayed in the Visual Studio GUI, SQL Server Management Studio (SSDM) or in Profiler. As relational DBAs know, polling events in Profiler too frequently can also impact performance, so it is necessary to strike a fine balance between collecting too much and too little performance information. It is technically possible to expose parameters that allow users to control how often the algorithm emits its own developer-defined messages, such as at intervals based on the case count in ProcessCase. Yet to date, I have not been able to use either the Start, Progress or EmitSingleTraceNotification methods of SSDM’s ProgressNotification object without triggering a System.AccessViolationException. I can instantiate ProgressNotification variables and manipulate their other properties and members without incident, but the three methods that actually fire notifications invariably trigger exceptions that can’t even be caught in standard Try…Catch blocks. The concept is really simple; you merely set a class-scoped ProgressNotification object equal to the output of a Model.CreateTaskNotification method call, then use properties like Format, Total and Current to specify a message and track the intervals at which it fires. Strangely, the exact same code in one plug-in project works but triggers the fatal exception in another, even when the code from the first is directly cut and pasted into the second. I’m considering filing a bug report at Technet, but can’t dismiss the possibility that there’s an overlooked error in my project settings or something. Either way, I’ll hold off on posting code that might be buggier than usual until I get to the bottom of it.
…………The code I wrote for the three Overridable methods relevant to feature selection is also buggy, but the good news is that it is irrelevant. HasFeatureSelection is the easiest one, because it is another one of those plug-in methods that behaves more like a property, which in this case is merely a Boolean choice. If you are going to perform feature selection, you merely return a value of True; if not, you can return False or not implement it at all. If you wish, you can also add trivial conditional logic to turn it on and off, depending on the values of any feature selection parameters you’ve implemented, or any other variable for that matter. Instead of using my example parameter in this way, I check the value in GetAttributeFeatureSelectionInfo and then return an AttributeFlags object that identifies it as an Input parameter that participates in feature selection, if the value the user supplied matches the ordinal index passed by SSDM as the TempAttributes unsigned integer. The increment in the conditional logic merely addresses the fact that MiningParameterCollection objects are one-based, while the model’s AttributeSet object is zero-based. As the .chm points out, there are really only four values that this method can return: zero for attributes that feature selection isn’t applied to; AttributeFlags.Input; AttributeFlags.Output; and AttributeFlags.Input|AttributeFlags.Output for those columns which are feature selected both ways. Another optional method, GetAllFeatureSelectionInfo, returns both the set of attributes which were feature selected and the attribute flags associated with them. In this case, the latter are limited to values of AttributeFlags.Input AttributeFlags.Output and a combination of both. This method is on the tricky side, since the return values are passed through output parameters. They are a native part of the syntax of the C family of languages; in Visual Basic, we have to work around this by adding the <Out()> tags and passing them ByRef. The sample implementation in Figure 6 is a little more convoluted than GetAttributeFeatureSelectionInfo because we need to create an array of attributes with smaller dimensions than the AttributeSet of all attributes, so their indexes won’t match; hence the need for the ArrayCounter variable. This method performs pretty much the same function as GetAttributeFeatureSelectionInfo, except that it populates an array of every attribute that is feature selected along with the correct flags, instead of just returning the flags one attribute at a time.
Figure 6: Feature Selection Code
Protected Overrides Function HasFeatureSelection() As Boolean
Return True
End Function
Protected Overrides Function GetAttributeFeatureSelectionInfo(TempAttribute As UInteger) As AttributeFlags
If MyMiningParameterCollection.Item(0).Value = TempAttribute Then ‘
Return AttributeFlags.Input
End If
End Function
Protected Overrides Sub GetAllFeatureSelectionInfo(<Out()> ByRef TempAttributes() As UInteger, <Out()> ByRef TempAttributeFlags()
As AttributeFlags)
Dim ArrayCounter As UInteger
For I = 0 To AttributeSet.GetAttributeCount – 1
If MyMiningParameterCollection.Item(0).Value = I Then
TempAttributes(ArrayCounter) = I
TempAttributeFlags(ArrayCounter) = AttributeFlags.Input
ArrayCounter = ArrayCounter + 1
End If
Next I
End Sub
…………The problem with the code above is that these methods appear to be designed to be triggered internally by msmdsrv.exe, but I have yet to see Analysis Services hit the breakpoints I set in these routines, with one exception: when rendering Predictive Model Markup Language (PMML) representations of mining models, which we will discuss in the next installment of this series. I could work around this by calling them directly in another routine, like I did with the methods relevant to parameters in InsertCases, but there’s not much of a point in doing that. They don’t seem to do much of anything relevant to processing that can’t be implemented by adding our own methods to AlgorithmBase and calling them in other routines, so we’re not losing much of value by skipping over these methods. The plug-in architecture performs any special processing based on the values returned by them; it’s not like we can feed them values and have SSDM calculate values based on feature selection methods included in most of the nine out-of-the-box algorithms discussed much earlier in this series, like Interestingness Score, Shannon’s Entropy, Bayesian with K2 Prior, Bayesian Dirichlet Equivalent with Uniform Prior. Nor is there apparently any relevant information out there on these relatively irrelevant functions; there are no implementations of them in the sample SDK code (at least based on a Windows search of the files) and nothing in the tutorial I found online titled Microsoft Business Intelligence with Numerical Libraries: A White Paper by Visual Numerics, Inc. In fact, there are no hits at all on Google for any of them. At best, I did manage to find references to them at the old MSDN tutorial A Tutorial for Constructing a Plug-in Algorithm written by Max Chickering and Raman Iyer back in August of 2004, but that was designed for users working with COM in C++. My sample code may be buggy, but it is the only code I’ve seen published anywhere on the planet to date for any of these methods, in Visual Basic, C# or any other language. As I’ve said a few times throughout this series and my earlier series An Informal Compendium of SSAS Errors (which remains one of the few centralized sources of information for certain arcane Analysis Services exception messages), troubleshooting SSAS is a bit like blasting off into space, since you’re basically “boldly going where no man has gone before” thanks to the lack of documentation. This is doubly true of SSDM, which has even less documentation thanks to its status as SQL Server’s most underrated and under-utilized component, despite the fact that it can be put to such productive uses. Plug-in algorithms are in turn the cutting edge of SSDM, so it should be no surprise that we’re entering a vacuum – and just as in deep space, we can do without a lot of the material we’re encountering now. We’ve already traversed the most challenging topics in this segment of the Rickety series, after passing the apex of difficulty in A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment. Now the series is winding down to the point where we’re addressing features that aren’t of any great necessity in plug-in development. In the next installment, we’ll tackle SQL Server’s implementation of PMML, a powerful XML means of making mining models interchangeable with other products. Unfortunately, SQL Server is still stuck on the PMML standard promulgated back in March of 2003. Despite the lack of attention from the top brass at Microsoft in that time period, SSDM remains one of the best data mining tools available, but PMML support is the one area in which the usefulness of SSDM has really been noticeably reduced by this neglect. In contrast, the product’s support for custom data mining viewers is one of its greatest unsung features, as we shall see in a few weeks as we close out the Rickety series.