Not everyone has the opportunity to call Microsoft Premier Support. For those who have not yet had this experience, I’ll document my most recent, non-critical, call. Critical calls are a different critter entirely.
We were experiencing a very odd error on one server in our system. When a particular procedure was run with a particularly large set of data, it would produce an error, but only when called from the application. The exact same error with exactly the same data called from SSMS did not produce an error. We went through a very extensive set of tests and were unable to fix the problem on the server. After moving the production system that was experiencing the issue to a different server, we decided to contact PSS.
8:48 Am, Tuesday: I made the initial call and went through the process of validating my identity, our company, etc. I had to explain everything that we had done, what versions of the software involved everything was, etc. This took 10 minutes. Since I had called early in the day and this was non-critical call, although labeled as important since we are talking about a production system, I had to wait for the MS guys to come in.
9:47 AM, Tuesday: I received an email from PSS asking me for the SQL Server logs. I zipped up the most recent log with the error and sent it in to the guy. This took me about five minutes to get everything together and zip it up for him.
10:53AM, Tuesday: I received my first phone call on the process. I re-explained the issue to the support person. they suggested that we set up a server-side trace and get the query to error out when run from the app, and to run successfully from Management Studio and see what differences there might be, if any, in the basic trace. It took me about 30 minutes to set up the server-side trace, test it, generate the errors and run the query successfully and then zip it all up with sample code and sample data to ship off to MS.
Does it feel like this is going to take a long time? You’re right. It continues.
12:10PM, Tuesday: I get an email back with some initial thoughts. Unfortunately, it looks like we don’t have complete information. I respond in the email, providing more.
12:26PM, Tuesday: Another email back from MS confused about the trace. They had me use the default trace, which captures RPC:Completed and SQL:BatchStarting and SQL:BatchCompleted. They claimed that the app wasn’t run and must have errored before calling the database. But that’s because the app errored which means no RPC:Completed. We didn’t capture RPC:Starting. I pointed this out in a response. Nothing back yet.
1:32PM, Tuesday: I get my response back, am I sending a new trace? Nuts. I knew when I realized what they were looking for that I should have restarted the trace. There’s an hour shot.
2:35PM, Tuesday: I’ve finished rebuilding our trace, retesting and sending everything off for evaluation, again. It took longer for me this time because I was waiting on a developer who was away from his desk. PSS calls require everyone to be available, all the time. When we do critical calls we just jam everyone into the same room. It makes it easier.
6:39PM, Tuesday: The data wasn’t sufficient. It doesn’t show when the error occurs. PSS sends me a new trace template to try out. Unfortunately, I’m not at work any more and what’s more, the developers who’s support I need are long gone. I can’t do anything with this until morning.
DAY 2
8:08AM, Wednesday: I’m setting up the trace to see what they’re having me gather while I wait for the developers to show up. Odin’s all seeing eye! They’re capturing quite a few events here. Just about every single event in “Errors and Warnings.” OLEDB Errors. Auto Stats, Degree of Parallelism, Showplan All and Showplan XML Statistics Profile from “Performance.” Server Memory Change. PreConnect Completed and Preconnect:Starting from Sessions (I’ve never even seen these before). RPC:Completed, RPC:Starting, SP:Completed, SP:Recompile, SP:Starting, SP:StmtCompleted (I knew that one was coming),SP:StmtStarting from “Stored Procedures.” Finally SQL:BatchCompleted, SQL:BatchStarting, SQL:StmtCompleted, SQL:StmtRecompile, SQLStmtStarting from “TSQL.” A full load. This should be fun.
I’m converting it to a script and adding a filter to only capture data from the database I’m having troubles with. I’m also converting to a script for another reason, which I’ll post a little blog post about seperate from this one.
11:11AM, Wednesday: The joys of developing software. The developer I was working with yesterday wasn’t in, so I got a different developer to help out. He didn’t know how to run the code the way the other guy ran it. So it was run in a different way. I captured everything and shipped it off to Microsoft. While waiting for their response, I read through the data gathered by the trace. No error. In fact, everything ran successfully. We did a bunch of tests and found that as long as we were running the big trace, we didn’t get the error. It was almost like the error was caused by having too many resources.
12:00PM, Wedensday: The PSS person comes back and says that the data contained no error. Yep. They don’t have a single suggestion. All the indications are, the faster, more powerful server, is causing the problem.
DAY 3
I get a response back from PSS. They’re asking if I’m suggesting that I’m supposed to run the trace all the time. This is the point where I decide to bail on PSS. I’m leaving the case in a non-closed state, but I’ve been talking to some other resources and have a troubleshooting scheme from those resources that we’re going to try out.
More when I know more.