We have shown many cases in previous articles about where and how SQL Server R Services can be used, how to plot and visualize the data, and discussed the benefits. This article will tackle, for many people in the SQL community, an even more interesting topic: how a database administrator can profit from using R Services for the SQL Server that he is administrating on a daily, weekly, or monthly basis. Some approaches might be unusual, some very much obvious, but it is using such approaches that really brings the benefit.
If we were focusing on the methodology on analyzing sales data in my previous two articles. The focus in this article we will not that much on the methodology or algorithms but more on tasks that a DBA may face. I will not go into the discussion what the DBA tasks are, because it might be different from DBA to DBA. Work at a particular company might include or exclude particular tasks. Rather, I will focus on tasks that every enterprise needs to monitor, to have in operation, and to know when and what action to take when particular events are triggered or alarmed.
So the typical and much needed tasks that need attention are:
- database growth and potential free disk space
- queries that take higher amount of resources to execute
- statistics that can be gathered around these items.
Therefore, we will focus on these tasks, bringing R and statistics usage in and give the DBA additional insight. Prior to exploring this, there are couple of ground rules that we need to point out. What is abaseline and how to have a baseline created and measured. Thresholds are also important, as they represent triggers to particular events: automatic ticket creating, escalating to different people, stopping processes, or any kind of business rule. Beside the baseline, there are always thresholds for benchmarks and monitoring. These approaches are very important as they usually provide very good ways of handling exceptions or unwanted event in your system.
Having said that, in this article I will focus on following cases, where we will be able to use R Services with SQL Server:
- Disk space usage
- Clustering queries
- Plotting a heatmap of executed queries based on query statistics
Let me know your feedback, so we can also do following three topics in next article: parameter sniffing, stale statistics and heap tables
Both cases will include usage of R Language to do the statistical analysis, data visualization and predictive analysis.
Disk space usage
Authors, database administrators, developers, MVPs, and others have written about storage, volumes of data, and monitoring capacities over time. Systems have requirements and businesses have needs in order for application to work normally. I am sure we all have found ourselves in situation of a “DBA by accident” situation, when disk usage went sky high during the night and nobody knew why. This is why it is important to have baseline captured. A baseline is like a control group that will tell you that a situation is unusual or it is normal with your test group. In IT, normally the baseline is the reference state, which was measured and recorded in a planned time or environment. Therefore, the DBA can spot any unplanned and unwanted activities or unusually system behavior, especially when they compare it to the baseline. In this manner, raising red flag would result in less false positive states.
In addition, we always come across disk size problems. Based on this problem, we will demo database growth, store the data, and then run predictions against the collected data to be able at the end, to predict, when DBA can expect the disk space problems.
To illustrate this scenario, I will create a small database of 8 MB and no possibility of growth. I will create two tables – one will serve as a baseline - DataPack_Info_SMALL, and one will serve as a so called everyday log, where everything will be stored for unexpected cases or undesired behavior. This will persist in table DataPack_Info_LARGE.
First, create a database:
USE [master]; GO CREATE DATABASE FixSizeDB CONTAINMENT = NONE ON PRIMARY ( NAME = N'FixSizeDB', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FixSizeDB.mdf' , SIZE = 8192KB , FILEGROWTH = 0) LOG ON ( NAME = N'FixSizeDB_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FixSizeDB_log.ldf', SIZE = 8192KB , FILEGROWTH = 0) GO ALTER DATABASE [FixSizeDB] SET COMPATIBILITY_LEVEL = 130 GO ALTER DATABASE [FixSizeDB] SET RECOVERY SIMPLE GO
We will use following table for generating inserts and later deletes:
CREATE TABLE DataPack ( DataPackID BIGINT IDENTITY NOT NULL ,col1 VARCHAR(1000) NOT NULL ,col2 VARCHAR(1000) NOT NULL )
Populating the DataPack table will be done with the following simple WHILE loop:
DECLARE @i INT = 1; BEGIN TRAN WHILE @i <= 1000 BEGIN INSERT dbo.DataPack(col1, col2) SELECT REPLICATE('A',200) ,REPLICATE('B',300); SET @i = @i + 1; END COMMIT; GO
Instead of using the sp_spaceused procedure, we will be capturing disk space changes with following query:
SELECT t.NAME AS TableName ,s.Name AS SchemaName ,p.rows AS RowCounts ,SUM(a.total_pages) * 8 AS TotalSpaceKB ,SUM(a.used_pages) * 8 AS UsedSpaceKB ,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB FROM sys.tables t INNER JOIN sys.indexes AS i ON t.OBJECT_ID = i.object_id INNER JOIN sys.partitions AS p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id INNER JOIN sys.allocation_units AS a ON p.partition_id = a.container_id LEFT OUTER JOIN sys.schemas AS s ON t.schema_id = s.schema_id WHERE t.NAME NOT LIKE 'dt%' AND t.is_ms_shipped = 0 AND i.OBJECT_ID > 255 AND t.Name = 'DataPack' GROUP BY t.Name, s.Name, p.Rows
The Log table will be filled along with the DataPack table in order to gather immediate changes in disk space:
DECLARE @nof_steps INT = 0 WHILE @nof_steps < 15 BEGIN BEGIN TRAN -- insert some data DECLARE @i INT = 1; WHILE @i <= 1000 -- step is 100 rows BEGIN INSERT dbo.DataPack(col1, col2) SELECT REPLICATE('A',FLOOR(RAND()*200)) ,REPLICATE('B',FLOOR(RAND()*300)); SET @i = @i + 1; END -- run statistics on table INSERT INTO dbo.DataPack_Info_SMALL SELECT t.NAME AS TableName ,s.Name AS SchemaName ,p.rows AS RowCounts ,SUM(a.total_pages) * 8 AS TotalSpaceKB ,SUM(a.used_pages) * 8 AS UsedSpaceKB ,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB ,GETDATE() AS TimeMeasure FROM sys.tables AS t INNER JOIN sys.indexes AS i ON t.OBJECT_ID = i.object_id INNER JOIN sys.partitions AS p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id INNER JOIN sys.allocation_units AS a ON p.partition_id = a.container_id LEFT OUTER JOIN sys.schemas AS s ON t.schema_id = s.schema_id WHERE t.NAME NOT LIKE 'dt%' AND t.is_ms_shipped = 0 AND t.name = 'DataPack' AND i.OBJECT_ID > 255 GROUP BY t.Name, s.Name, p.Rows WAITFOR DELAY '00:00:02' COMMIT; END
And this will serve as our baseline when comparing the results. When querying table DataPack_Log_Small, the results are the following:
We can use R to check how the columns, Rowcounts and UsedSpaceKB, are correlating. In order to do this, we will use external procedure, sp_execute_external_script:
DECLARE @RScript nvarchar(max) SET @RScript = N' library(Hmisc) mydata <- InputDataSet all_sub <- mydata[2:3] c <- cor(all_sub, use="complete.obs", method="pearson") t <- rcorr(as.matrix(all_sub), type="pearson") c <- cor(all_sub, use="complete.obs", method="pearson") c <- data.frame(c) OutputDataSet <- c' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N'SELECT TableName ,RowCounts ,UsedSpaceKB ,TimeMeasure FROM DataPack_Info_SMALL' EXECUTE sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @SQLScript WITH result SETS ((RowCounts VARCHAR(100) ,UsedSpaceKB VARCHAR(100))); GO
As a result we get a strong and positive correlation between columns RowCounts and UsedSpaceKB.
This can easly be interepreted as this. When the value for RowCounts goes up, the value for UsedSpaceKB also goes up. This is the only logical explanation. It would be somehow strange to have a negative correlation. Now we will try to simulate the random deletes and inserts and observe the similar behavior with following code:
DECLARE @nof_steps INT = 0 WHILE @nof_steps < 15 BEGIN BEGIN TRAN -- insert some data DECLARE @i INT = 1; DECLARE @insertedRows INT = 0; DECLARE @deletedRows INT = 0; DECLARE @Rand DECIMAL(10,2) = RAND()*10 IF @Rand < 5 BEGIN WHILE @i <= 1000 -- step is 100 rows BEGIN INSERT dbo.DataPack(col1, col2) SELECT REPLICATE('A',FLOOR(RAND()*200)) -- pages are filling up differently ,REPLICATE('B',FLOOR(RAND()*300)); SET @i = @i + 1; END SET @insertedRows = 1000 END IF @Rand >= 5 BEGIN SET @deletedRows = (SELECT COUNT(*) FROM dbo.DataPack WHERE DataPackID % 3 = 0) DELETE FROM dbo.DataPack WHERE DataPackID % 3 = 0 OR DataPackID % 5 = 0 END -- run statistics on table INSERT INTO dbo.DataPack_Info_LARGE SELECT t.NAME AS TableName ,s.Name AS SchemaName ,p.rows AS RowCounts ,SUM(a.total_pages) * 8 AS TotalSpaceKB ,SUM(a.used_pages) * 8 AS UsedSpaceKB ,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB ,GETDATE() AS TimeMeasure ,CASE WHEN @Rand < 5 THEN 'Insert' WHEN @Rand >= 5 THEN 'Delete' ELSE 'meeeh' END AS Operation ,CASE WHEN @Rand < 5 THEN @insertedRows WHEN @Rand >= 5 THEN @deletedRows ELSE 0 END AS NofRowsOperation FROM sys.tables AS t INNER JOIN sys.indexes AS i ON t.OBJECT_ID = i.object_id INNER JOIN sys.partitions AS p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id INNER JOIN sys.allocation_units AS a ON p.partition_id = a.container_id LEFT OUTER JOIN sys.schemas AS s ON t.schema_id = s.schema_id WHERE t.NAME NOT LIKE 'dt%' AND t.is_ms_shipped = 0 AND t.name = 'DataPack' AND i.OBJECT_ID > 255 GROUP BY t.Name, s.Name, p.Rows WAITFOR DELAY '00:00:01' COMMIT; END
We have also added a delete statement, as well as rowcounts so that the demo will not be so straightforward.
By calculating the correlation coefficient,
DECLARE @RScript nvarchar(max) SET @RScript = N' library(Hmisc) mydata <- InputDataSet all_sub <- mydata[2:3] c <- cor(all_sub, use="complete.obs", method="pearson") c <- data.frame(c) OutputDataSet <- c' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N'SELECT TableName ,RowCounts ,UsedSpaceKB ,TimeMeasure FROM DataPack_Info_LARGE' EXECUTE sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @SQLScript WITH result SETS ( ( RowCounts VARCHAR(100) ,UsedSpaceKB VARCHAR(100) )); GO
We can see, that we still get very strong and positive correlation:
We will now compare our LARGE test with the baseline by running same correlation coefficient on different datasets. First on our baseline (DataPack_Info_SMALL) with the second one from our test table (DataPack_Info_LARGE):
DECLARE @RScript1 nvarchar(max) SET @RScript1 = N' library(Hmisc) mydata <- InputDataSet all_sub <- mydata[4:5] c <- cor(all_sub, use="complete.obs", method="pearson") c <- data.frame(c) OutputDataSet <- c' DECLARE @SQLScript1 nvarchar(max) SET @SQLScript1 = N'SELECT TableName ,RowCounts ,TimeMeasure ,UsedSpaceKB ,UnusedSpaceKB FROM DataPack_Info_SMALL WHERE RowCounts <> 0' EXECUTE sp_execute_external_script @language = N'R' ,@script = @RScript1 ,@input_data_1 = @SQLScript1 WITH result SETS ( ( RowCounts VARCHAR(100) ,UsedSpaceKB VARCHAR(100) )); DECLARE @RScript2 nvarchar(max) SET @RScript2 = N' library(Hmisc) mydata <- InputDataSet all_sub <- mydata[4:5] c <- cor(all_sub, use="complete.obs", method="pearson") c <- data.frame(c) OutputDataSet <- c' DECLARE @SQLScript2 nvarchar(max) SET @SQLScript2 = N'SELECT TableName ,RowCounts ,TimeMeasure ,UsedSpaceKB ,UnusedSpaceKB FROM DataPack_Info_LARGE WHERE NofRowsOperation <> 0 AND RowCounts <> 0' EXECUTE sp_execute_external_script @language = N'R' ,@script = @RScript2 ,@input_data_1 = @SQLScript2 WITH result SETS ( ( RowCounts VARCHAR(100) ,UsedSpaceKB VARCHAR(100) ) ); GO
Here are the results:
The results are very interesting. The baseline shows no correlation between UsedSpaceKB and UnusedSpaceKB (it is -0.049) whereas our test shows almost 3x stronger negative correlation (it is -0,109). A couple of words on this correlation: this shows that UsedSpaceKB is negatively correlated with UnUsedSpaceKB, which is still too small to draw any concrete conclusions but it shows, how slight change can cause difference in a simple correlation.
You can gather disk space usage information with T-SQL, or using PowerShell, or by implementing .NET assembly, or creating SQL Server Job or any other way. The important part and the biggest advantage is that, with using R and data collected is that now you will not be only monitoring and reacting upon past data, but also be able to predict what will happen.
Let’s go a step further and assume that with following query and dataset taken from our sample created:
SELECT TableName ,Operation ,NofRowsOperation ,UsedSpaceKB ,UnusedSpaceKB FROM dbo.DataPack_Info_LARGE
We will give a prediction on the size of the usedSpaceKB based on historical data. Out input will be TableName, Operation and NofRowsOperation for a give number to predict on. I will be using general linear model (Glm algorithm for predicting usedDiskSpace!
Before you all jump and start saying this is absurd, this cannot be done due to dbcc caching, page brakes, indexes, stall statistics and many other parameters, I would point out, that all thid information can be added into the algorithm and make the prediction even better. Since my queries are very simple INSERT and DELETE statements, you should also know, based on what kind of queries are you predicting. In addition, such approach can be good also for code testing, unit testing, stress testing before deployment etc.
With following R code we can start creating predictions:
-- GLM prediction DECLARE @SQL_input AS NVARCHAR(MAX) SET @SQL_input = N'SELECT TableName ,CASE WHEN Operation = ''Insert'' THEN 1 ELSE 0 END AS Operation ,NofRowsOperation ,UsedSpaceKB ,UnusedSpaceKB FROM dbo.DataPack_Info_LARGE WHERE NofRowsOperation <> 0'; DECLARE @R_code AS NVARCHAR(MAX) SET @R_code = N'library(RevoScaleR) library(dplyr) DPLogR <- rxGlm(UsedSpaceKB ~ Operation + NofRowsOperation + UnusedSpaceKB, data = DataPack_info, family = Gamma) df_predict <- data.frame(TableName=("DataPack"), Operation=(1), NofRowsOperation=(451), UnusedSpaceKB=(20)) predictions <- rxPredict(modelObject = DPLogR, data = df_predict, outData = NULL, predVarNames = "UsedSpaceKB", type = "response",checkFactorLevels=FALSE); OutputDataSet <- predictions' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @R_code ,@input_data_1 = @SQL_input ,@input_data_1_name = N'DataPack_info' WITH RESULT SETS (( UsedSpaceKB_predict INT )); GO
With this code, I predict the size of UsedSpaceKB based on following data:
As presented in R, the code is:
df_predict <- data.frame(TableName=("DataPack"), Operation=(1), NofRowsOperation=(451), UnusedSpaceKB=(20))
Of course couple of things to clear out: The following R code with the xp_execute_external_script would work much better as a stored procedure with input parameters for these columns: TableName, Operation, NofRowsOperation and UnusedSpaceKB. Furthermore, to avoid unnecessary computational time for model building, it is usually the practice to store a serialized model in a SQL table and just deserialize it when running predictions. At last, since this was just a demo, make sure that numbers used in predictions make sense; as seen in our example, the UsedSpaceKB would much better be predicted if absolutely calcualated, rather than cumulative value. Only later, the cumulative value is calculated.
To sum up this rather long demo, let’s create a procedure and run couple of predictions, to get the feeling how efficient this is. The stored procedure is:
CREATE PROCEDURE Predict_UsedSpace ( @TableName NVARCHAR(100) ,@Operation CHAR(1) -- 1 = Insert; 0 = Delete ,@NofRowsOperation NVARCHAR(10) ,@UnusedSpaceKB NVARCHAR(10) ) AS DECLARE @SQL_input AS NVARCHAR(MAX) SET @SQL_input = N'SELECT TableName ,CASE WHEN Operation = ''Insert'' THEN 1 ELSE 0 END AS Operation ,NofRowsOperation ,UsedSpaceKB ,UnusedSpaceKB FROM dbo.DataPack_Info_LARGE WHERE NofRowsOperation <> 0'; DECLARE @R_code AS NVARCHAR(MAX) SET @R_code = N'library(RevoScaleR) DPLogR <- rxGlm(UsedSpaceKB ~ Operation + NofRowsOperation + UnusedSpaceKB, data = DataPack_info, family = Gamma) df_predict <- data.frame(TableName=("'+@TableName+'"), Operation=('+@Operation+'), NofRowsOperation=('+@NofRowsOperation+'), UnusedSpaceKB=('+@UnusedSpaceKB+')) predictions <- rxPredict(modelObject = DPLogR, data = df_predict, outData = NULL, predVarNames = "UsedSpaceKB", type = "response",checkFactorLevels=FALSE); OutputDataSet <- predictions' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @R_code ,@input_data_1 = @SQL_input ,@input_data_1_name = N'DataPack_info' WITH RESULT SETS (( UsedSpaceKB_predict INT )); GO
Now running the procedure two times in a row:
EXECUTE Predict_UsedSpace @TableName = 'DataPack' ,@Operation = 1 ,@NofRowsOperation = 120 ,@UnusedSpaceKB = 2; GO EXECUTE Predict_UsedSpace @TableName = 'DataPack' ,@Operation = 1 ,@NofRowsOperation = 500 ,@UnusedSpaceKB = 12; GO
We get these results:
Both predictions on used space disk are predictions based on our demo data. Of course, for even better predictions, some baseline statistics could aslo be included, to have the prediction even better. With every model, we also need to test the predictions, how good they are.
Clustering queries
In previous articles, we have talked about clustering sales data. Because the visualization technique is relatively easy to read and to quickly find the queries that might take larger amout of resources, I have decided to add clustering. Clustering of executed queries will help to detect where problems are pilling up.
For this demo, I will use WideWorldImportersDW database and create some random reads and IO. First we will clean buffers and cache:
-- clean buffer DECLARE @dbid INTEGER SELECT @dbid = [dbid] FROM master..sysdatabases WHERE name = 'WideWorldImportersDW' DBCC FLUSHPROCINDB (@dbid); GO
and then create some random load:
USE WideWorldImportersDW; GO -- Arbitrary query: SELECT cu.[Customer Key] AS CustomerKey, cu.Customer, ci.[City Key] AS CityKey, ci.City, ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory, d.Date, d.[Calendar Month Label] AS CalendarMonth, d.[Calendar Year] AS CalendarYear, s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color, e.[Employee Key] AS EmployeeKey, e.Employee, f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS cu ON f.[Customer Key] = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.[City Key] = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.[Stock Item Key] = s.[Stock Item Key] INNER JOIN Dimension.Employee AS e ON f.[Salesperson Key] = e.[Employee Key] INNER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date; GO 3 -- FactSales Query SELECT * FROM Fact.Sale GO 4 -- Person Query SELECT * FROM [Dimension].[Customer] WHERE [Buying Group] <> 'Tailspin Toys' OR [WWI Customer ID] > 500 ORDER BY [Customer],[Bill To Customer] GO 4 -- Purchase Query SELECT * FROM [Fact].[Order] AS o INNER JOIN [Fact].[Purchase] AS p ON o.[Order Key] = p.[WWI Purchase Order ID] GO 3
And we will collect query statistics for this purpose:
SELECT (total_logical_reads + total_logical_writes) AS total_logical_io ,(total_logical_reads / execution_count) AS avg_logical_reads ,(total_logical_writes / execution_count) AS avg_logical_writes ,(total_physical_reads / execution_count) AS avg_phys_reads ,substring(st.text,(qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN datalength(st.text) ELSE qs.statement_end_offset END - qs.statement_start_offset) / 2) + 1) AS statement_text ,* INTO query_stats_LOG_2 FROM sys.dm_exec_query_stats AS qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st ORDER BY total_logical_io DESC
Once we have the data collected, we can use these data in the R query:
CREATE PROCEDURE [dbo].[SP_Query_Stats_Cluster] AS DECLARE @RScript nvarchar(max) SET @RScript = N' library(cluster) All <- InputDataSet image_file <- tempfile() jpeg(filename = image_file, width = 500, height = 500) d <- dist(All, method = "euclidean") fit <- hclust(d, method="ward.D") plot(fit,xlab=" ", ylab=NULL, main=NULL, sub=" ") groups <- cutree(fit, k=3) rect.hclust(fit, k=3, border="DarkRed") dev.off() OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))' DECLARE @SQLScript nvarchar(max) SET @SQLScript = N' SELECT [total_logical_io] ,[avg_logical_reads] ,[avg_phys_reads] ,execution_count ,[total_physical_reads] ,[total_elapsed_time] ,total_dop ,[text] ,ROW_NUMBER() OVER (order by (select 1)) as ID FROM query_stats_LOG_2 WHERE Number is null'; EXECUTE sp_execute_external_script @language = N'R', @script = @RScript, @input_data_1 = @SQLScript WITH RESULT SETS ((Plot varbinary(max))); GO
And you will get a query generate a graph.
This graph will tell you that queries are grouping together as:
Group1: 5,20,8 and 33;
Group2: 12 and 25
Group3: all others
We could beforehand rename the values of the query and give them proper names, but when running larger amount of queries, this might be somehow time consuming. On my
test run, it looks something like this:
Now let's just query out from out log table queries from Group1 and Group2:
SELECT * FROM ( SELECT [total_logical_io] ,[avg_logical_reads] ,[avg_phys_reads] ,execution_count ,[total_physical_reads] ,[total_elapsed_time] ,total_dop ,[text] ,ROW_NUMBER() OVER (order by (select 1)) as ID FROM query_stats_LOG_2 WHERE Number is null ) AS x WHERE x.ID IN (5,20,8,33,12,25)
And you can further analyse where your problems and which queries cause resource problems, stall or performance issues. Of course, you also have to be very carefull what kind of statistics and data are you feeding th clustering algorithm. This will certainly effect, how grouping will be prepared, since all these statistics are like attributes and we are grouping based on similarities of these attributes.
Plotting a Heatmap of Executed Queries
Clustering helped us very quickly group queries that go together based on selected attributes. With heatmap, we will be giving an overview of correlations between those attributes. Since heatmap is a very great presentation of multiple correlations among several different observations (in our case of several queries), I have decided to add it to this article.
To start, we will prepare another demo:
USE AdventureWorks; GO --- Turn on the query store ALTER DATABASE AdventureWorks SET QUERY_STORE = ON; GO
We will use the advantage of Query store and information available and stored. And we will execute following queries, to gather demo statistics, available in Query Store.
-- run bunch of queries :-) SELECT * FROM AdventureWorks.[Production].[ScrapReason]; SELECT * FROM AdventureWorks.[HumanResources].[Shift]; SELECT * FROM AdventureWorks.[Production].[ProductCategory]; SELECT * FROM AdventureWorks.[Purchasing].[ShipMethod]; SELECT * FROM AdventureWorks.[Production].[ProductCostHistory]; SELECT * FROM AdventureWorks.[Production].[ProductDescription]; SELECT * FROM AdventureWorks.[Sales].[ShoppingCartItem]; SELECT * FROM AdventureWorks.[Production].[ProductDocument]; SELECT * FROM AdventureWorks.[dbo].[DatabaseLog]; SELECT * FROM AdventureWorks.[Production].[ProductInventory]; SELECT * FROM AdventureWorks.[Sales].[SpecialOffer]; SELECT * FROM AdventureWorks.[Production].[ProductListPriceHistory]; SELECT * FROM AdventureWorks.[Person].[Address]; SELECT * FROM AdventureWorks.[Sales].[SpecialOfferProduct]; SELECT * FROM AdventureWorks.[Production].[ProductModel]; SELECT * FROM AdventureWorks.[Person].[AddressType]; SELECT * FROM AdventureWorks.[Person].[StateProvince]; SELECT * FROM AdventureWorks.[Production].[ProductModelIllustration]; SELECT * FROM AdventureWorks.[Production].[ProductModelProductDescriptionCulture]; SELECT * FROM AdventureWorks.[Production].[BillOfMaterials]; SELECT * FROM AdventureWorks.[Sales].[Store]; SELECT * FROM AdventureWorks.[Production].[ProductPhoto]; SELECT * FROM AdventureWorks.[Production].[ProductProductPhoto]; SELECT * FROM AdventureWorks.[Production].[TransactionHistory]; SELECT * FROM AdventureWorks.[Production].[ProductReview]; SELECT * FROM AdventureWorks.[Person].[BusinessEntity]; SELECT * FROM AdventureWorks.[Production].[TransactionHistoryArchive]; SELECT * FROM AdventureWorks.[Production].[ProductSubcategory]; SELECT * FROM AdventureWorks.[Person].[BusinessEntityAddress]; SELECT * FROM AdventureWorks.[Purchasing].[ProductVendor]; SELECT * FROM AdventureWorks.[Person].[BusinessEntityContact]; SELECT * FROM AdventureWorks.[Production].[UnitMeasure]; SELECT * FROM AdventureWorks.[Purchasing].[Vendor]; SELECT * FROM AdventureWorks.[Person].[ContactType]; SELECT * FROM AdventureWorks.[Sales].[CountryRegionCurrency]; SELECT * FROM AdventureWorks.[Person].[CountryRegion]; SELECT * FROM AdventureWorks.[Production].[WorkOrder]; SELECT * FROM AdventureWorks.[Purchasing].[PurchaseOrderDetail]; SELECT * FROM AdventureWorks.[Sales].[CreditCard]; SELECT * FROM AdventureWorks.[Production].[Culture]; SELECT * FROM AdventureWorks.[Production].[WorkOrderRouting]; SELECT * FROM AdventureWorks.[Sales].[Currency]; SELECT * FROM AdventureWorks.[Purchasing].[PurchaseOrderHeader]; SELECT * FROM AdventureWorks.[Sales].[CurrencyRate]; SELECT * FROM AdventureWorks.[Sales].[Customer]; SELECT * FROM AdventureWorks.[HumanResources].[Department]; SELECT * FROM AdventureWorks.[Production].[Document]; SELECT * FROM AdventureWorks.[Sales].[SalesOrderDetail]; SELECT * FROM AdventureWorks.[Person].[EmailAddress]; SELECT * FROM AdventureWorks.[HumanResources].[Employee]; SELECT * FROM AdventureWorks.[Sales].[SalesOrderHeader]; SELECT * FROM AdventureWorks.[HumanResources].[EmployeeDepartmentHistory]; SELECT * FROM AdventureWorks.[HumanResources].[EmployeePayHistory]; SELECT * FROM AdventureWorks.[Sales].[SalesOrderHeaderSalesReason]; SELECT * FROM AdventureWorks.[Sales].[SalesPerson]; SELECT * FROM AdventureWorks.[Production].[Illustration]; SELECT * FROM AdventureWorks.[HumanResources].[JobCandidate]; SELECT * FROM AdventureWorks.[Production].[Location]; SELECT * FROM AdventureWorks.[Person].[Password]; SELECT * FROM AdventureWorks.[dbo].[Orders]; SELECT * FROM AdventureWorks.[Sales].[SalesPersonQuotaHistory]; SELECT * FROM AdventureWorks.[Person].[Person]; SELECT * FROM AdventureWorks.[Sales].[SalesReason]; SELECT * FROM AdventureWorks.[Sales].[SalesTaxRate]; SELECT * FROM AdventureWorks.[Sales].[PersonCreditCard]; SELECT * FROM AdventureWorks.[Person].[PersonPhone]; SELECT * FROM AdventureWorks.[Sales].[SalesTerritory]; SELECT * FROM AdventureWorks.[Person].[PhoneNumberType]; SELECT * FROM AdventureWorks.[Production].[Product]; ; SELECT * FROM AdventureWorks.[Sales].[SalesTerritoryHistory];
Information on these executions will be returned with the following query:
-- Collect the data from Query Store SELECT qsq.* ,query_sql_text INTO QS_Query_stats_bck FROM sys.query_store_query AS qsq JOIN sys.query_store_query_text AS qsqt ON qsq.query_text_id = qsqt.query_text_id WHERE Query_id >= 41 ORDER BY qsq.query_id
Once we have this, we can run statistics, available in the Query Store with the following query:
-- for R table SELECT LEFT(query_sql_text,70) AS Query_Name ,last_compile_batch_offset_start ,last_compile_batch_offset_end ,count_compiles ,avg_compile_duration ,avg_bind_duration ,avg_bind_cpu_time ,avg_optimize_duration ,avg_optimize_cpu_time ,avg_compile_memory_kb FROM QS_Query_stats_bck WHERE LEFT(query_sql_text,70) LIKE 'SELECT * FROM AdventureWorks.%' ORDER BY Query_Name
As you can see, I have cleaned up a output, by just taking the queries that are relating to AdventureWorks tables. And in addition, I have choosen only query store statistics, concerning durations. With all this data, we simply use this R command:
DECLARE @SQL_heat NVARCHAR(MAX) SET @SQL_heat = 'SELECT LEFT(query_sql_text,70) AS Query_Name ,last_compile_batch_offset_start ,last_compile_batch_offset_end ,count_compiles ,avg_compile_duration ,avg_bind_duration ,avg_bind_cpu_time ,avg_optimize_duration ,avg_optimize_cpu_time ,avg_compile_memory_kb FROM QS_Query_stats_bck WHERE LEFT(query_sql_text,70) LIKE ''SELECT * FROM AdventureWorks.%'' ORDER BY Query_Name'; DECLARE @RScript NVARCHAR(MAX) SET @RScript = N' library(d3heatmap) All <- InputDataSet image_file <- tempfile() jpeg(filename = image_file, width = 500, height = 500) # sort data All <- All[order(All$avg_compile_duration),] #row_names row.names(All) <- All$Query_Name All <- All[,2:10] All_QS_matrix <- data.matrix(All) heatmap(All_QS_matrix, Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10)) dev.off() OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))' EXECUTE sp_execute_external_script @language = N'R', @script = @RScript, @input_data_1 = @SQL_heat WITH RESULT SETS ((Plot VARBINARY(MAX))); GO
And again, we get a nice presentation of attributes (statistics from query store) with all the specified queries. Now we can see which queries are similar or different to one another or to particular attribute. Based on ordering of the values in particular attribute, you can have a view predefined and later easier to read.
Just not to forget, to turn off your query store, if you will not needed no long.
ALTER DATABASE [DBA4R] SET QUERY_STORE = OFF;
Conclusion
Using SQL Server R for purposes of any kind of DBA task – as we have shown here – it is not always hard-core statistics or predictive analytics, I can also be some simple statistical understanding that underlay the connection and relations between attributes queries, gathered statistics, indexes, etc. But we have seen, that also predicting events that are usually only monitored, can be a huge advantage for a DBA and a very welcoming feature for core systems.
With R integration into SQL Server, such daily, weekly or monthly tasks can be automated to different, before not uses yet, extent. And such, it can help give different insight to DBA and also people responsible for system maintainance.
Author: Tomaz Kastrun (tomaz.kastrun@gmail.com)
Twitter: @tomaz_tsql