October 7, 2018 at 1:13 am
Comments posted to this topic are about the item What Should You Look for in a SQL Monitoring Tool?
Aunt Kathi Data Platform MVP
Author of Expert T-SQL Window Functions
Simple-Talk Editor
October 7, 2018 at 6:10 am
Even more important than the completeness of the monitoring is the impact the monitoring system has on the instances, seen huge difference where some tools induce several percentages in overhead whilst other don't have a noticeable impact.
😎
Understanding how the tools work is important, things like how the wait stats are collected, how the actual statements are collected etc. can help choosing the right tool.
October 7, 2018 at 11:15 am
I've found that a whole lot of people that use monitoring tools use monitoring tools because they don't know how to monitor things themselves. Along with that lack of knowledge, they also have heard that "taking a baseline" is a good thing (and it IS) but the big problem with that is <drum roll please>, they don't actually know what the baseline is telling them and so they also don't know if the baseline is good or not. In other words, their system actually has many performance issues and because they don't know what to look for, they don't know they actually have a baseline that sucks. 😉
One great example of this is what seemed to be a bit of a focus in this article... long running queries. I've found that those aren't usually the real problem. What is a problem is the "death by a thousand cuts" queries that everyone thinks is normal and becomes the baseline and, once that happens (or even before), people just accept the problems as "normal".
Now... don't get me wrong. I think a monitoring tool that's smart enough to auto-magically identify that there's going to be a CPU intensive job (for example) that runs every day from 2 to 2:30PM and what the expected profile of CPU usage for that job is so that you don't receive alerts about "expected" CPU usage would be great but I've found that most don't actually do that without having to use the ol' Mark I Mod I eyeball to verify such a thing. There's also the idea that most of them don't actually lead you to the real performance issues, which is buried in the "Death by a thousand cuts" code that's continuously running all day every day.
As for waits... IMHO, waits is a symptom and not a cause. That makes them pretty useless for finding the cause and they won't actually lead you to the real cause of all the little stuff that's the actual cause. It's like a tolerance build up in physical hardware assemblies... it might be ok for a part to be off by 1/10000th of an inch but, if you have 10,000 such parts all stacked in a series, you suddenly have a inch error. Code is very similar but much worse because instead of just 10,000 parts stacked up, you may have millions stacked up per hour. Because people tend to look at code and say things like, well that runs from the GUI and it only takes 100ms to run, that shouldn't be a problem without realizing that it runs (as in a real case for me)) 10's of millions of times per hour because someone made a serious mistake in the GUI.
If you want to write some good monitoring software, then it needs to find such "Death by a thousand cuts" and report on them and, again, these are NOT the longest running queries. Usually, they're the shortest.
Of course, including that in monitoring software might also be a total waste of time because you actually have to have someone that will do something about such "cuts" and they need to be in a shop where the culture will realize the values (there re many) of fixing such things. It would also be cool if the reporting were a bit analytical for the proverbial biggest bang for the buck. Instead of simply reporting on the top 100 worst queries for each server across hundreds of servers (who in the hell would read such a report every morning?), it would be really cool if it said "here are the top 10 worst queries in the enterprise for each four hours of the day over a week and the first two live on 75% of your servers" (for example).
Shifting gears a bit, I agree that monitoring is a form of measuring something and any scientist will tell you that if you measure something, you've changed it and you must be careful to change it as little as possible. With that, I agree with Eirikur... I've seen monitoring systems interfere not only with performance but interfere with things like backups and replication. Then there are those (sometimes the same "those") that feel it necessary to set trace flags to enable such monitoring at the beginning of a sample and then to turn it off after the sample is taken. All of that junk gets recorded in the default trace, special traces, and in the SQL Server logs and, sometimes, in the Windows logs. I don't need that junk in my logs when I'm trying to troubleshoot and sure as hell don't need it filling up and causing the default trace to cycle so often.
To summarize my take on a lot of monitoring software I've seen, I don't want to install software written by amateurs for amateurs.. 😉 Give me something useful, trainable, and maybe even self trained with overrides allowed. I don't need yet another "tool" that was built by Murphy's half brother, Sum Ting Wong. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
October 7, 2018 at 11:33 am
p.s. While performance may not be THE most important thing, I agree that you did touch on it's importance, I'd like to stress that it's very important and runs a very, very close second only to accuracy.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply