Starting from SQL Server 2016 Microsoft put a lot of effort into integrating various languages used in machine learning and statistical analysis inside their database engine. In SQL Server 2016 Microsoft introduced the integration of the R language and in SQL Server 2017 Python was added.
Both these languages give you the advantages of bringing analytics directly to the data, making it possible to perform near real-time predictions using a variety of machine learning algorithms. But what methods can we use to leverage this analytical power inside SQL Server? As it turns out there are a total of three options we can use, some limited to specific languages like R, others directly integrated in the T-SQL language.
In this article we will look at the three methods we can use to perform in-database analytics and sum up their advantages, disadvantages and limits. Before we start predicting we need to train a model first, so let’s start off by building our machine learning model and store it directly inside SQL Server!
Building a machine learning model directly in SQL Server
For all the actions I am going to perform in this article I used the Automobile dataset that is available through the UC Irvine Machine Learning Repository here: https://archive.ics.uci.edu/ml/datasets/automobile.
I cleaned up the data a bit and loaded it into a database table called “automobile” inside my SQL Server 2017 instance, ending up with the table you can see in the figure below:
As a next step I performed a split on the original table, copying 90% of the data inside a training table (automobile_train) to train the model, and 10% into a test table (automobile_test). The contents of the test table will be used to test the performance of the machine learning model later on.
-- Select 10% of the rows into a new test table SELECT TOP 10 PERCENT * INTO Automobile_test FROM Automobile
-- Place the remaining 90% into a train table SELECT * INTO Automobile_train FROM Automobile EXCEPT SELECT TOP 10 PERCENT * FROM Automobile
Now that we got the data sorted out, we need a place to store our trained model inside SQL Server. For this I created a model table that will hold the name of the model, version and most importantly, the trained model itself stored as a VARBINARY(MAX).
-- Create a table to hold our model CREATE TABLE models ( model_name nvarchar(100) not null, model_version nvarchar(100) not null, model_object varbinary(max) not null ) GO
Now that we have a place to store our trained model let’s move on to the next step, actually training the model. You can take different paths to get a trained model inside a SQL Server table, for instance you can load in a pre-trained model you build inside R Studio, or you can choose to train the model directly in SQL Server as well which is the path I took for this article. Having the option to import a pre-trained model directly in SQL Server is very useful for many scenario’s. For instance, your data scientists can build the model and optimize it in their own workspace inside R Studio then import it in SQL Server when the model is ready for operationalization.
Before we can train our model, we have to make sure we enabled the external scripts feature. This feature allows us to write R or Python code directly inside a SQL Server query.
EXEC sp_configure 'external scripts enabled',1 RECONFIGURE WITH OVERRIDE GO
After the feature is enabled we can use R or Python to train the model and store it inside the model table we created earlier. The code below does just that using the R programming language.
DECLARE @model VARBINARY(MAX)
EXEC sp_execute_external_script @language = N'R', @script = N' automobiles.linmod <- rxLinMod(price ~ wheel_base + length + width + height + curb_weight + engine_size + horsepower, data = automobiles) model <- rxSerializeModel(automobiles.linmod, realtimeScoringOnly = FALSE)', @input_data_1 = N' SELECT * FROM Automobile_Train', @input_data_1_name = N'automobiles', @params = N'@model varbinary(max) OUTPUT', @model = @model OUTPUT
INSERT models ( model_name, model_version, model_object ) VALUES ( 'automobiles.linmod', 'v1', @model )
So what’s going on with the code above?
First of all we are declare the @model parameter which will hold the model till we insert it into the model table we created earlier.
Using the R language in the sp_execute_external_script procedure, we train a machine learning model using linear regression since I am interested in predicting the price of a car based on various numerical features. In this case I use the rxLinMod algorithm which is part of the RevoScaleR machine learning algorithms. It isn’t necessary to use a RevoScaleR algoritm, instead we could have used the built-in R lm() algorithm, but some prediction functions integrated inside SQL Server only support RevoScaleR algorithms and for that reason I went for the RevoScaleR version of a linear model.
As input for the model we define a T-SQL query you can see in the @input_data_1 variable which we then give a name using the @input_data_1_name and use inside the R code as the automobiles dataset.
Finally, we use the rxSerializeModel function to serialize the model. We need to perform this step to save the model in a raw format which we can store inside a SQL Server table. We declare the @model parameter and map it to the model variable inside the R code and return it as an output of the sp_execute_external_script procedure. Again, we could use the R function serialize as well, but just like the selection of an RevoScaleR algorithm, some in-database prediction methods expect a model that has been serialized using the RevoScaleR function.
The last bit of code grabs the @model variable and inserts it into the models table we created together with a model name and version.
Now that we trained our machine learning model and stored it inside the models table we are ready to look at the various methods available inside SQL Server 2016 and 2017 to perform in-database predictions!
Predicting using sp_execute_external_script
For the first in-database prediction method we are going to use the same method we used for training our machine learning model in the previous step, sp_execute_external_script. Sp_execute_external_script gives us a lot of flexibility and is not dependent on RevoScaleR algorithms or functions. The code below grabs the serialized model from the models table and performs a prediction of the test data we stored inside the automobiles_test table.
DECLARE @lin_model_raw VARBINARY(MAX) = (SELECT model_object FROM models WHERE model_name = 'automobiles.linmod')
EXECUTE sp_execute_external_script @language = N'R', @script = N' model = rxUnserializeModel(lin_model); automobiles_prediction = rxPredict(model, automobiles_test) automobiles_pred_results <- cbind(automobiles_test, automobiles_prediction)', @input_data_1 = N' SELECT wheel_base, length, width, height, curb_weight, engine_size, horsepower, price FROM Automobile_Test', @input_data_1_name = N'automobiles_test', @output_data_1_name = N'automobiles_pred_results', @params = N'@lin_model varbinary(max)', @lin_model = @lin_model_raw WITH RESULT SETS (("wheel_base" FLOAT, "length" FLOAT, "width" FLOAT, "height" FLOAT, "curb_weight" FLOAT, "engine_size" FLOAT, "horsepower" FLOAT, "price" FLOAT, "predicted_price" FLOAT))
In the code above we select some of the numerical values from the automobile_test table and use the rxPredict function to score the values against the linear model we stored earlier. I then use the R function cbind to add the predicted labels to the dataset we used as input and return the merged dataset as output.
The results of the code above are displayed in the figure below:
Using sp_execute_external_script has some clear advantages. We can use both R and Python and are able to use R or Python functions, like cbind shown in the example code. It doesn’t only support RevoScaleR algorithms and even allows you to use imported libraries like ggplot or dplyr if you want to. It does require you to enable the in-database R feature during SQL Server 2016 setup or the in-database Machine Learning feature for SQL Server 2017.
Predicting using sp_rxPredict
sp_rxPredict is a CLR-based library that you can load inside a database and is optimized for very fast, real-time, predictions.
Before we can use sp_rxPredict we have to enable CLR integration inside SQL Server using the code below.
sp_configure 'clr enabled', 1 GO RECONFIGURE GO
After CLR integration is enabled we need to take some additional steps to load the code of the sp_rxPredict library into our database. All the steps required for this are written inside this MSDN article: https://docs.microsoft.com/en-us/sql/advanced-analytics/r/how-to-do-realtime-scoring#realtime-scoring-with-sprxpredict
Some important things to keep in mind when you intent to use sp_rxPredict is that you will need to enable CLR integration and mark your database as trustworthy. Both of these configurations can have an impact on the security of your system.
Unlike sp_execute_external_script, the sp_rxPredict function actually expects a pre-trained model to be available inside a table within SQL Server. Even though we used the same serialized model for the sp_execute_external_script method, we are not required to use a pre-trained model. As a matter of fact, we could have trained and tested the model in the same R code for the sp_execute_external_script method if we wanted to.
The code below shows you an example of using sp_rxPredict using the model we trained earlier and stored inside our model table.
DECLARE @lin_model_raw VARBINARY(MAX) = (SELECT model_object FROM models WHERE model_name = 'automobiles.linmod')
EXEC sp_rxPredict @model = @lin_model_raw, @inputData = N' SELECT wheel_base, length, width, height, curb_weight, engine_size, horsepower, price FROM Automobile_Test'
As you can see, there is a lot less code required than in the sp_execute_external_script method. We basically tell the sp_rxPredict procedure where our serialized model is and supply the procedure with the data we would like to score.
The results of the sp_rxPredict are a bit different than the previous method though. Since we can’t use the full power of R or Python inside the sp_rxPredict procedure we cannot add the predictions to the input dataset like we did in the sp_execute_external_script example and only the predicted labels are returned. For the use-cases where you want to use the sp_rxPredict method, mainly real-time predictions, this would probably be fine since you could fire of the method using a SQL trigger and store the predicted label directly with the inserted data.
The main advantage of sp_rxPredict is that it is blazingly fast. In my test it was on average 20 times faster than sp_execute_external_script. It does come with quite a lot of disadvantages though, like the fact you need to enable CLR integration and set your database to trustworthy. Next to that you are also limited in algorithm choice, since only RevoScaleR and MicrosoftML algorithms can be used with it. Just like sp_execute_external_script it also requires you to enable the in-database R feature during SQL Server 2016 setup or the in-database Machine Learning feature for SQL Server 2017.
Predicting using PREDICT
SQL Server 2017 introduced a new T-SQL function called PREDICT. PREDICT gives you an easy to use method of integrating predictions directly into a T-SQL syntax and as such removes to need to write R or Python code as we did in sp_execute_external_script or deal with a limited return dataset in sp_rxPredict. The big drawback here is that it is only available in SQL Server 2017 and can only work with RevoScaleR algorithms. A big compensation for that fact is that, unlike the other methods, it works without the in-database Machine Learning feature enabled!
Just like sp_rxPredict, the PREDICT method requires a pre-trained model stored using rxSerializeModel. When we have to model stored we can very easily perform in-database predictions using T-SQL:
DECLARE @lin_model_raw VARBINARY(MAX) = (SELECT model_object FROM models WHERE model_name = 'automobiles.linmod')
SELECT a.*, p.* FROM PREDICT(MODEL = @lin_model_raw, DATA = dbo.Automobile_test as a) WITH("price_Pred" float) as p;
Again, we start by grabbing the serialized model from the models table and placing it inside a variable. What happens next is that we use the PREDICT function inside a T-SQL statement and point to the model variable and supply the name of the table that hold our test data. Using the regular SELECT statement we join the predicted label to the test dataset and return the original contents of the test table together with the predicted label as you can see in the image below.
For people who are not very familiar with R or Python, PREDICT seems to be the perfect method of performing predictions inside the familiar T-SQL syntax!
Pros and Cons summary
Now that we looked at the various methods available in SQL Server 2016 and 2017 and went through a couple of examples, let’s summarize the pros and cons of each method.
sp_execute_external_script
+ Supports both R and Python
+ Very flexible, support R/Python code and is not dependent on RevoScaleR or MicrosoftML algorithms
– Has the most overhead since multiple external processes are involved
– Requires in-database R services (2016) / Machine Learning (2017) feature enabled
sp_rxPredict
+ Fastest method available
– Only supports R
– Requires CLR integration and database set to TRUSTWORTHY
– Only supports RevoScaleR and MicrosoftML algoritms
(rxLinMod, rxLogit, rxBTrees, rxDTree, rxDForest, rxFastTrees, rxFastForest,
rxLogisticRegression, rxOneClassSvm, rxNeuralNet, rxFastLinear)
– Expects a pre-trained model serialized using rxSerializeModel
– Only returns predicted values
– Requires in-database R services (2016) / Machine Learning (2017) feature enabled
PREDICT
+ Supports both R and Python
+ Very fast
+ Easy integration into the familiar T-SQL syntax
+ Does not require in-database R services (2016) / Machine Learning (2017) feature enabled
– Only available in SQL Server 2017
– Expects a pre-trained model serialized using rxSerializeModel
– Only supports RevoScaleR algoritms
(rxLinMod, rxLogit, rxBTrees, rxDTree, rxDForest)
Summary
In this article we looked at the various methods available in both SQL Server 2016 and 2017 to perform in-database predictions. For every method we looked at a demo that used a serialized model we created in this article and we summarized the pros and cons of each method.