March 31, 2014 at 9:47 am
As a recent addition to the IT staff at the organization I currently work at, and as far as I've been told, the first time there has been an official DBA, I've ended up in an environment that has a lot of problems and a lot of projects. Projects that I'm having trouble devoting time to because of a recurring problem that is difficult to locate the root cause of.
Environment: 122 remote office sql server 2000 hosts. Each of these sql server hosts also provides file sharing services for the local users, hosting outlook .pst files, documents, etc. These user file shares are hosted on the same disk volumes as the sql server data and log files. The sql server also provides print services for the local users.
These servers are virtual, hosted on hardware that also hosts a terminal services server in a separate vm, and I believe another vm that provides access to a DC for the remote environment. So one hardware installation that provides three separate vm servers. Some areas have additional vm servers that run terminal services.
Problem: Sometimes, and the frequency varies, some regional offices get connection timeout messages in the client application, and the messages are recorded in the Failsafe application log. Some offices never have the issue, others have the issue infrequently, while a few are having their operation severely impacted by the issue.
When the users report the issue in a service ticket, the programmer assigns the ticket to the dba (me,) and requests that I check the indexes to fix the problem. (The same person explained to me that this caused by the "indexes fighting." But I've never heard of this phenomenon so I don't know what it means, hopefully someone here can shed some light on it.)
Index fragmentation levels vary when I inspect them. Sometimes the most fragmented index I can locate is around 6% fragmented. Other times the index fragmentation might range up to 40 or 50 %. No matter what the fragmentation level, if anything above 0% is discovered, this serves as a proof that the indexes are the root of the problem and we need to rebuild the indexes. I've seen other regional offices with indexes that are 40-50% fragmented, where users don't report timeout errors, and in fact, when inspected no timeout errors have been logged in the application's Failsafe log on the regional server either.
The first time I heard of this was when it came up five weeks ago in what I'll call region 1(I'd only been here for 8 weeks at the time and the issue had not come up.)
Region 1 was having constant problems with timeout errors(the users are complaining they can't get any work in the application done,) and the programmer insists that the problem is indexes, and they need rebuilding. We rebuild the indexes, and that action made NO difference in the timeout errors. The issue is ongoing and recurring.
Two weeks ago region 2 had one error message, and the users opened a ticket that got assigned to me to check indexes. The most fragmented index was 6% fragmented. Nothing was done, and the problem didn't recur until one error message was logged two weeks later.
I'm sort of spinning wheels on this problem (instead of getting my projects done as my manager has now pointed out on two occasions,) since reindexing doesn't necessarily seem to be closely linked to the issue.
The reasoning of my manager and the programmer is that, one time, an unspecified amount of time ago, they were having this problem on another server, and they used the maintenance wizzard to run a reindex job, and this "fixed" the problem on that server, in the sense that the connection timeout issues stopped for awhile. I'm skeptical because the reindexing might just be coincidental, and something else might have been going on that they stopped but weren't aware of.
Though I keep thinking this is not necessarily an index issue, it's pretty hard to prove a negative result. Especially since it appears that sometimes index rebuilding has helped the problem in the past, so that has instilled a sense that this MUST be the issue. I've checked cpu and memory utilization, both are typical, even when the errors are happening. I've checked for blocking, and either found no blocking, or minimal and brief blocking is occurring. No spids are deadlocked. When I run sp_who2 or examine sysprocesses, most or all of the spids are reported with a status of sleeping when listed in the output.
So has anyone:
-seen this problem and
-isolated the issue
-and if so, would they be willing to share their solution here. It would be much, much appreciated by me.
Or if you can give me some help on how to prove whether the problem lies with sql server or not, that would be again, much appreciated.
Thanks in advance for any help or advice I might receive on dealing with this problem.
March 31, 2014 at 9:57 am
It is pretty tough to offer much advice here because we really dont have any details to look at. Do you know what query(s) are timing out? Have you examined those queries? Are they SARGable? Do you have stale statistics. Have you examined the actual execution plan for those queries? At the very least we would need to see actual execution plans. Table definitions and indexes would be a big plus.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
March 31, 2014 at 10:38 am
Thanks, but the problem description I get doesn't have anything specific, typically just a "dropped connections - you need to check the indexes on the server" request. The client application log has the vb6 subroutine, but this doesn't have any indication of what it was executing on the sql server. The users are proxied in via the same sql credentials, and they're running from a number of terminal services servers, so even if I knew which user was having an issue, I'm not sure I would be able to pick them out in a trace. When I've checked, the auto-update statistics option is set. Doesn't rebuilding indexes also update statistics? Or have I grossly misunderstood that?
March 31, 2014 at 10:48 am
tresiqus (3/31/2014)
Thanks, but the problem description I get doesn't have anything specific, typically just a "dropped connections - you need to check the indexes on the server" request. The client application log has the vb6 subroutine, but this doesn't have any indication of what it was executing on the sql server. The users are proxied in via the same sql credentials, and they're running from a number of terminal services servers, so even if I knew which user was having an issue, I'm not sure I would be able to pick them out in a trace.
How can anybody expect you to figure out what is happening with no details? Surely there is some kind of information in these tickets. What screen they were on, what they were trying to do etc...
These things generally happen because some of the sql is poorly written. From you description it sounds like the sql probably lives inside the VB6 app as pass through sql. UGH!!!
If they can't give you any details about what/when this happens there is no chance you have of fixing it. How can you fix a problem when you don't even know where it comes from?
Having a developer "inform" the dba of what he needs to do screams of ignorance. The developer should be coming to the dba and saying "hey we are experiencing timeout in process x, what can we do to fix it?". This opens up a dialog so you can determine between you what is causing it and how to fix it.
I can assure you that no amount of rebuilding indexes is going to solve this issue in the long run. To me it sounds like really poorly written sql and a developer demanding that you throw some duct tape at it instead of fixing the real issue. I don't envy you for this challenge.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
March 31, 2014 at 11:35 am
I'm afraid the description I get on the service ticket is "server N is timing out, you need to check the indexes on the server." Sometimes with a screenshot of a vb6 error dialog that shows a connection timeout message. I usually don't get a response when I inquire for more information, one response I did get was because I checked all the indexes in one database reported no fragmentation, and they decided I should have included checking the indexes in a different database.
I'm afraid I'm expected to resolve the issue, because I'm the dba and it's a known fact here that this is a database server problem with the indexes, that's solved by rebuilding indexes.
As far as the sql goes, a lot of it is in stored procedures. Which doesn't make it more performant necessarily, I know, but I have been able to rewrite some procedures that could be identified as problematic, and shorten the duration from minutes to seconds.
But I don't think this is a realistic solution, due to the number of servers, databases and programmers and the fact that it's not an immediately actionable solution when a problem is reported, since I have to re-write the sql, test, explain my work and submit the changes to a programmer for approval.
(Of course, since I have to get approval from the users/programmer/management before I do anything that could affect service level and performance, reindexing isn't really an immediately actionable response either.)
Again, thanks.
March 31, 2014 at 11:52 am
Hope you don't mind but here is my suggestion; check all the possible bottlenecks and possible points of failure, this may not even be a db issue. It only takes one bad nic on a network to cause such a problem, torrenting or other such activities may contribute. Recently saw an issue with a client where streaming from the winter Olympics lead to a service failure. Just make certain you know where it is broken before trying to fix it.
😎
March 31, 2014 at 12:03 pm
Thanks, I agree that it might not be a sql server issue. I've tried to check a variety of things, but there's a wide variety of possible issues within sql server and epic amounts of possible issues outside of sql server. File and print services for all the users in a regional office are stored on the same volume that the data and log files are, for example, or a terminal server that has performance issues, etc.
March 31, 2014 at 12:50 pm
You can use tools like WireShark[/url], nmap[/url] or other similar to do some diagnostics, I often use BackTrack[/url] to analyze this kind of problems.
Back in the days of SQL 7/2000 / VB5-6, I designed, developed and ran quite few distributed systems, often on a hardware less capable than a modern smartphone, on 256kb modem connections. In my opinion, very robust and never ever ever was any kind of indexing on the sql server a problem.
Hope this helps (and excuse my ranting) 😎
March 31, 2014 at 1:12 pm
tresiqus (3/31/2014)
Thanks, but the problem description I get doesn't have anything specific, typically just a "dropped connections - you need to check the indexes on the server" request. The client application log has the vb6 subroutine, but this doesn't have any indication of what it was executing on the sql server. The users are proxied in via the same sql credentials, and they're running from a number of terminal services servers, so even if I knew which user was having an issue, I'm not sure I would be able to pick them out in a trace. When I've checked, the auto-update statistics option is set. Doesn't rebuilding indexes also update statistics? Or have I grossly misunderstood that?
If the client application log tells you what routine failed and you have access to the source code (or can get access or can have your developer help you out), then can't you find exactly what code, and likely what routine/query on the database is causing the problem?
If the problem consistently happens in the same routine then maybe there's something going on, but given how "random" it is, it doesn't sound like it's a problem with the server or the code there.
Is this routine something to do with reporting? Could you possibly be hitting up against the "parameter sniffing" problem? That would give you inconsistent behavior like this maybe, would it not? Here's just one link about that, but you can search for it yourself for more info. http://blogs.msdn.com/b/turgays/archive/2013/09/10/parameter-sniffing-problem-and-workarounds.aspx
In any case, I think you need to insist on more information from your dev or dev team -- such as the exact queries or calls that are timing out. Better to solve the problem once and for all rather than trying random "fixes" that may only treat the symptoms or only seem to fix the problem.
March 31, 2014 at 1:43 pm
I don't have access to this application's vb6 code.
I might be able get it, but to be honest, I've already been assigned maintenance programming responsibility for one vb6 application (a different vb6 application than the failing app,) and I'm not sure that's really compatible with what my other responsibilities already. In my experience, once I start debugging or troubleshooting devs code and application issues to prove it's not sql server, it's all I end up doing. In the past, devs pointed to sql server, and then I had to prove it wasn't sql server so I had to find the binary file they didn't deploy to the application server, or the permissions they didn't set on a file share, or the directory they tried to use in their application code but forgot to create in the filesystem. After enough of that, EVERY application fault was assigned to me from the start for me to do troubleshooting on it, and since I was spending so much time tracking down application faults unrelated to sql server, my boss asked me come up with a sheet of instructions so his secretary could be the dba.
Thanks
March 31, 2014 at 2:16 pm
I'm not saying that you should debug the code, I'm saying that you should *insist* that the developers give you the queries being executed when a timeout occurs. Is there any reason the devs can't give this to you, or will they just not do it?
If you can establish that "rebuilding indexes" doesn't resolve the issue (and it seems you have, though maybe you haven't documented it sufficiently), then why the heck does anyone still think it'll work? And if it does work, why does it work? And how the heck can you be expected to do your job (as DBA) if you're not given sufficient information to do it?
In any case, I think it's going to be hard for anyone here to give you more than vague generalities unless you can somehow get more information. And if you can't effectively challenge your coworkers to get that information, well, that's a whole other problem (not attacking you there, I mean if your organization doesn't allow for an effective challenge process, but also maybe you need to "exert your authority" if you can and if you have appropriate support to do so 🙂 ).
March 31, 2014 at 2:29 pm
dmbaker (3/31/2014)
I'm not saying that you should debug the code, I'm saying that you should *insist* that the developers give you the queries being executed when a timeout occurs. Is there any reason the devs can't give this to you, or will they just not do it?If you can establish that "rebuilding indexes" doesn't resolve the issue (and it seems you have, though maybe you haven't documented it sufficiently), then why the heck does anyone still think it'll work? And if it does work, why does it work? And how the heck can you be expected to do your job (as DBA) if you're not given sufficient information to do it?
In any case, I think it's going to be hard for anyone here to give you more than vague generalities unless you can somehow get more information. And if you can't effectively challenge your coworkers to get that information, well, that's a whole other problem (not attacking you there, I mean if your organization doesn't allow for an effective challenge process, but also maybe you need to "exert your authority" if you can and if you have appropriate support to do so 🙂 ).
In fact, debugging connection sensitive code from VB6 era is not likely to give / produce the right picture. Unless you have extensive knowledge of the underlying libraries, it is kind of a lost case. 😎
March 31, 2014 at 2:36 pm
Eirikur Eiriksson (3/31/2014)
dmbaker (3/31/2014)
I'm not saying that you should debug the code, I'm saying that you should *insist* that the developers give you the queries being executed when a timeout occurs. Is there any reason the devs can't give this to you, or will they just not do it?If you can establish that "rebuilding indexes" doesn't resolve the issue (and it seems you have, though maybe you haven't documented it sufficiently), then why the heck does anyone still think it'll work? And if it does work, why does it work? And how the heck can you be expected to do your job (as DBA) if you're not given sufficient information to do it?
In any case, I think it's going to be hard for anyone here to give you more than vague generalities unless you can somehow get more information. And if you can't effectively challenge your coworkers to get that information, well, that's a whole other problem (not attacking you there, I mean if your organization doesn't allow for an effective challenge process, but also maybe you need to "exert your authority" if you can and if you have appropriate support to do so 🙂 ).
In fact, debugging connection sensitive code from VB6 era is not likely to give / produce the right picture. Unless you have extensive knowledge of the underlying libraries, it is kind of a lost case. 😎
Quite so. And if the OP is working in a defective environment, well, that only compounds the problem.
March 31, 2014 at 2:37 pm
Thanks. I've asked for more information, but it's not been forthcoming.
The dev just doesn't investigate, and they have management support. Maybe since they've been here longer than I have. Remember I am the FNG.
I've tried to point out the logical problem of sometimes an index rebuild seems to solve the issue, but this notably does not dependably solve the issue. Sometimes nothing solves the issue, sometimes indexing doesn't solve the issue, and sometimes it does. Which makes me think that it's
a) not the same problem each time
b) several problems occurcing simulataneously
c) not an index related problem
And there it is, I just got now an email re-assigning me the support ticket with instructions to rebuild indexes tonite to fix the client errors.
Which oddly haven't occurred for a week, though nothing has been done except examine the indexes for fragmentation.
March 31, 2014 at 3:24 pm
It may be a good thing your are the FNG. At least that means your resume is current. That place sounds like a nightmare and one to avoid. :w00t:
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply