Background
The SQL Server error log has a wealth of information contained inside.
Sometimes it is too much, sometimes it is not enough. And then situations occur
when you do not know if there is enough in the log or not until well after you
have a resolution. In my opinion, things are even more complex with 2005, as
there seems to be a lot of messages that end with "This is an
informational message only; no user action is required." If that is the
case, then I may or may not want to be alerted about this, right? User action
may not be "required", but what if it is actually
"desired"?
Like many shops, we developed an in-house system for monitoring our database
servers. Built over the past two years, the system is called DBA_Perform. One of the very first items we incorporated
into DBA_Perform was the error log scan. Code for all
of the tables and stored procedures for the solution can be found in the
downloadable zip files. There is also a readme file
included to assist with your implementation. I have tried to keep things as
general as possible, but some additional modifications may be necessary based
upon your specific environment.
The Need To Know
I believe that many shops have processes in place to scan the logs for
review. Some shops are small enough (less than ten total servers) that no
process is really necessary because they can simply review the logs by hand as
needed. However, I would argue that using an automated process to scan the log
periodically is the smart thing to do, so that you can be alerted should
something occur. That is exactly what we set out to build over two years ago.
What we found was that setting up a process to scan the error log is very
easy to do. There were lots of examples available from community sites. We
sampled some and found that while they did work, there was always something
missing. What one solution would have, another solution would not. So, we took
the best of everything we could find, added in our own ideas, and ended up with
the solution being presented. What set ours apart from others at the time was
the ability to include (or exclude) certain keywords that would appear in the
error log.
This was very important for our group as the number of servers we managed
went from around ten to over seventy in less than fifteen months. The
combination of production, test, and development servers also meant that the
errors we would be concerned about varied. For example, our production servers
are set to log any failed logins, but test and development are not. Any solution
we would roll out to a new server had to work regardless of environment. But do
we want to be alerted by email each and every time someone fails to log in? No,
probably not. But what if that failed attempt was by someone trying to login as
"sa", or "admin", or
"root"? That could be something that we want to know about quickly,
as it could be an attack or a virus.
The Magic
First, get the code. This has been tested with SQL Server 2000 and SQL Server 2005.
We needed to build a process that would scan and dump the contents of the
error log, clean up the contents based upon keywords we wanted to include
and/or exclude, and then send us an email if necessary. The procedure we
currently use is named Usp_error log_Notification,
and is included the zip files. The procedure makes use of the xp_readerror log extended stored procedure in order to
gather all details from the log into a temporary table as an initial starting
point. The process does a check to see what version of MS SQL Server you are
running, because Microsoft changed the output of the xp_readerrorlog
extended stored procedure in SQL 2005.
While checking the version, we also verify that the reading of the error log
results in a valid date that will ultimately be inserted before our emails are
sent. This is essential because the very next step does a calculation of the
maximum date, so we need to make certain the date is valid in order for the max() function to return a proper result. If this is the
first time a scan has been run, then a default date is assigned. Otherwise, it
will retrieve the max date from an earlier run of the process from the error
tables and that date will be used to filter out rows from the temporary tables
later in the process.
Before that cleanup happens, a cursor is built to go through the entries in
the temporary table and filter out the rows that do not have a keyword that we
want to scan for. The keywords for this part are found in the ErrorlogScanKeyword table, and include words and phrases
such as 'stack', 'is full', and 'error'. If a row exists in the #Errors
temporary table that do not have any of the words
listed in this table, then they are filtered out.
After filtering the temporary table, leaving behind only rows that have
keywords we are scanning for, we then cleanup the table by using the calculated
date mentioned in the previous paragraph. As our #Errors table gets smaller, we
next need to force the inclusion of specific words. We take the current rows in
#Errors and insert them into a separate temporary table named #ErrorsHOLD. At this point we remove rows from #ErrorsHOLD that do not have words and phrases found in the ErrorlogForceInclude table. In our experience, we found
that is not sufficient to only exclude certain words, there also needs to be
the forced inclusion of keywords as well. By forcing the inclusion of certain
keywords (for example, the word 'admin' would be a forced include), we can tell
the system to be more flexible. This is the magic. We can tell the system to
not bother us every time the phrase "Login failed for user", but to
make certain we know whenever "Login failed for user 'sa'"
appears in the error log.
After setting those rows off to the side, we do one final scan on the #Errors
temporary table to exclude certain keywords outright (for example, certain
error numbers above 50000, and this is where we would exclude the generic
'Login failed for user'). From there, it is a quick reinsert into the #Errors
from #ErrorsHOLD and we have the desired result set. The
last step is to generate an email and notify our team that there is something
in the error log that may be worth further investigation.
Create the Job
The last thing we needed to do was to put together a job that would run
every ten minutes. We created a job that simply calls the Usp_error
log_Notification procedure (also included in the zip
file). Feel free to modify the job as you see fit, changing the frequency of
the schedule or whatever you desire.
Conclusions
As your environment gets more and more complex, you may find that the number
of alerts increase substantially. Some of the processes that you build while
you have a small shop may not be as effective as more and more servers come
under your control. If you start with a customizable toolbox you can save
yourself some headaches later. With the above solution, you can decide for
yourself what you want to be notified about during an error log scan, and not
have to worry about filtering through the white noise that we all seem to be
swimming in these days.