December 7, 2015 at 10:41 am
Hi Grant, thank you for your eagle eyes! Well spotted mistakes in all of these cases: missing RowCount, missing closing quote in read.csv(), and requiring the GGPLOT2 library.
Many thanks!
Nick
December 7, 2015 at 10:45 am
Hi Barnaby, thank you for putting it all together in one script. I think you have pretty much answered everyone's questions in one great post! 🙂 Do you think it is worth adding a line to explicitly close the ODBC connections?
Cheers,
Nick
December 7, 2015 at 11:11 am
nick.dale.burns (12/7/2015)
Hi Barnaby, thank you for putting it all together in one script. I think you have pretty much answered everyone's questions in one great post! 🙂 Do you think it is worth adding a line to explicitly close the ODBC connections?Cheers,
Nick
No worries. It is my first time using R so wanted to create the script for myself. I haven't worked out how to close the connection yet 😀 but yeah, probably a good idea!
December 7, 2015 at 11:32 am
Your first time? Wow - you are flying! This should close your connection:
odbcClose(dbhandle)
Regards,
Nick
December 7, 2015 at 2:33 pm
Great Post. R is an oiutstanding language and tool. I changed the query as shown below:
NOTE: note the line comment about hidden internal tables which may occasionally provide some useful information.
select
db_name() as database_name
, object_name(ops.object_id) as [Table Name]
, idx.nameas [Index name]
, idx.type_descas description
, sysidx.rowcntas row_count
, sum(ops.range_scan_count) as [Range Scans]
, sum(ops.singleton_lookup_count) as [Singleton Lookups]
, sum(ops.row_lock_count) as [Row Locks]
, sum(ops.row_lock_wait_in_ms) as [Row Lock Waits (ms)]
, sum(ops.page_lock_count) as [Page Locks]
, sum(ops.page_lock_wait_in_ms) as [Page Lock Waits (ms)]
, sum(ops.page_io_latch_wait_in_ms) as [Page IO Latch Wait (ms)]
from
sys.dm_db_index_operational_stats(null,null,NULL,NULL) as ops
inner join
sys.indexes as idx
on idx.object_id = ops.object_id
and
idx.index_id = ops.index_id
inner join
sys.sysindexes as sysidx
on idx.object_id = sysidx.id
where
ops.object_id > 100
and
sysidx.indid not in (1, 2) -----get rid of hidden system tables
group by ops.object_id, idx.name, idx.type_desc, sysidx.rowcnt
order by [RowCnt] desc
Also, in my query I am not getting decimal numbers < 1 as you have gotten in your first chart. Am I doing something wrong?
Bill Bergen
December 7, 2015 at 2:45 pm
Hi Bill, nice changes adding in additional identifying data such as database name and index name. The exact values you get will obviously vary from one database to another, but I also suspect that I might have normalised this data (scaled it so that all the measures were directly comparable). I would have to double check that.
But ultimately, this is an attempt to guide further investigation towards critical bottlenecks and exclude non-critical tables. It is perhaps best described as an exploratory method and certainly not fixed in stone.
Regards,
Nick
December 7, 2015 at 2:56 pm
We've just started charting performance data with ggplot in R using the RODBC library to pull the data directly into RStudio from SQL Server. Of course you can chart in Excel but it's soooo much more laborious.
Plotting time-series of performance data is proving useful, but finding problems using machine learning techniques is a fantastic idea. Thank you!
December 7, 2015 at 3:02 pm
Hi Matthew - thank you for your response, glad you like the article. Love your idea of pulling out time-series metrics for performance! What a great opportunity to use R for real-time anomaly detection!!!
Cheers,
Nick
December 7, 2015 at 3:59 pm
matthew.x.greer (12/7/2015)
We've just started charting performance data with ggplot in R using the RODBC library to pull the data directly into RStudio from SQL Server. Of course you can chart in Excel but it's soooo much more laborious.Plotting time-series of performance data is proving useful, but finding problems using machine learning techniques is a fantastic idea. Thank you!
If you end up having any examples of your machine learning techniques, I'm sure they would be welcome by the community. That's really what monitoring software should be doing - learning from baselines and real-world scenarios.
December 7, 2015 at 5:23 pm
if you don't have the ggplot2 library installed, you can install it in RStudio like this:
> install.packages("ggplot2")
(source: http://ggplot2.org/)
I believe the intended code execution for
# load GGPLOT library(ggplot2) plot.data <- data.frame(pca$x[, 1:2]) g <- ggplot(plot.data, aes(x=PC1, y=PC2)) + geom_point(colour=alpha("steelblue", 0.5), size=3) + geom_text(label=1:102, colour="darkgrey", hjust=1.5) + theme_bw() print(g)
is
> # load GGPLOT
> library(ggplot2)
> plot.data <- data.frame(pca$x[, 1:2])
> g <- ggplot(plot.data, aes(x=PC1, y=PC2)) + geom_point(colour=alpha("steelblue", 0.5), size=3) + geom_text(label=1:102, colour="darkgrey", hjust=1.5) + theme_bw()
> print(g)
if you get this error: Error in do.call("layer", list(mapping = mapping, data = data, stat = stat, :
could not find function "alpha"
then load the "scales" library: library(scales)
Once, I did all that, I got this error:
Error: Incompatible lengths for set aesthetics: colour, hjust, label
which is where I find myself stuck
:pinch:
EDIT:
Just saw the post made by barnaby.self. That works!
:hehe:
December 7, 2015 at 6:57 pm
danechambrs (12/7/2015)
Once, I did all that, I got this error:
Error: Incompatible lengths for set aesthetics: colour, hjust, label
which is where I find myself stuck
I hit those errors with the original code as well. I just removed the formatting and it worked out. I'd love to know how to fix it and keep the formatting.
So all this is great, as I said, but now my question is, how the heck do I figure out which algorithm to use? I've got some ideas for how to put this to work in some different places, but I'm a little stuck.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 7, 2015 at 7:30 pm
That is the hardest part me as well: figuring out which models to use. I can't really answer that question beyond some old fashioned trial and error. However, I did take a class on some basic data modeling which helped explain to me some of these concepts. Slides can be found here: http://www.cogsci.ucsd.edu/~desa/109/
Some slides even contain pictures! Note that any code you find on this site is probably in Matlab.
I hope this helps guide the googling. Aside from that, there are two books I'd recommend:
Pattern Recognition and Machine Learning http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738
This book is very math heavy (read: collegiate knowledge of linear algebra and statistics are required) but has pretty much all of the "technical" information about these topics.
Data Mining: Practical Machine Learning Tools and Techniques http://www.amazon.com/Data-Mining-Practical-Techniques-Management/dp/0123748569
This book takes a more practical approach and encourages you to try out different algorithms to see which one fits your needs.
You can also watch Andrew Ng's videos on Machine Learning here: https://youtu.be/UzxYlbK2c7E?list=PLJ_CMbwA6bT-n1W0mgOlYwccZ-j6gBXqE
I was once told by a graduate student at my university that "if you want to know about machine learning, just watch these videos and follow along in the bishops book. No other classes or lectures really compare to Andrew's video series". This was his opinion, but one that I personally do respect.
Good luck!
December 7, 2015 at 7:33 pm
Hi Grant,
So glad to hear there are lots of other people out there as excited about the marriage of R with SQL Server as I am! The following will plot table names alongside the points:
## Assuming that the table names are stored in a variable 'tables':
g <- ggplot(plot.data, aes(x=PC1, y=PC2, colour=clusters)) +
geom_point() +
geom_text(label=tables, colour="darkgrey", hjust=1.5) +
theme_bw()
print(g)
There are some tricks around GGPLOT naming and scoping (for example using 'color' vs. 'colour' and whether to wrap things in aes()), but the above will hopefully work fine. I wasn't sure what you meant about "which algorithm to use"? Did you mean specifically whether to use PCA or KMeans? Or more generally, other methods? I have very roughly described both PCA and KMeans below (as they were used here anyway).
Cheers,
Nick
Principal Components Analysis (PCA): rotates high-dimensional data onto lower dimensions, but while preserving the maximum amount of possible variation in the data. Often used for to visualise high dimensional data in two dimensions - which is really why we used it here.
KMeans: groups "similar" observations together. For KMeans, "similar" simply means "objects that are near to one another".
December 7, 2015 at 7:43 pm
Hi All, re. "incompatible lengths for label, colour and hjust":
I just had a realisation; in the original post I had simply removed the table names for privacy. But, the original data had 102 tables - and I replaced their names with an index from 1 to 102. The length of the labels (1, 2, ..., 102) should be the same length as the number of tables you have in your data. If you use the table names themselves, this is easiest (and most informative!):
So... ORGINAL:
g <- ggplot(plot.data, aes(x=PC1, y=PC2, colour=clusters)) +
geom_point(size=3) +
geom_text(label=1:102, colour="darkgrey", hjust=1.5) +
theme_bw()
print(g)
Becomes:
## where 'tables' is a list with your table names
g <- ggplot(plot.data, aes(x=PC1, y=PC2, colour=clusters)) +
geom_point(size=3) +
geom_text(label=tables, colour="darkgrey", hjust=1.5) +
theme_bw()
print(g)
Hope that is really clear! 🙂
Nick
December 7, 2015 at 8:18 pm
Hi Nick,
Yeah, it's making that call on what calculation against the data is going to result in... anything.
For example, if you have two columns of data, KMean would let you get an idea of how much that data... groups? Basically, how much there is a correlation between the two columns having a relationship. For example, Salesman and Territory has a high correlation, but Customer and SaleValue has a low correlation (most of the time, although that example could open a can of worms).
Anyway, it's understanding how to apply the right algorithm that seems like the hardest part. Most of the rest seems to be reading documentation on how to apply methods.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 15 posts - 16 through 30 (of 34 total)
You must be logged in to reply to this topic. Login to reply