In the Real World - Lost Data Chains
It was late one Thursday when I got a call about a SQL Server being pegged. Sure enough when I logged in and checked the Task Manager from Terminal Services, the CPU was in the 95-98% range constantly for the SQL Server Process.
OK, no big deal, submit a change request, notify users, and reboot the server. It was after 4:00pm, so most likely people were bailing and the load was decreasing anyway. Didn't think much about it.
Until Friday. I got a message that the application that used this server was hanging and could I look at the server. Oh by the way, I'm raising the severity on the call to a 1.
OK, I'm not too impressed, everyone thinks their issue deserves a 1. So I check. The server is busy, but not pegged. However the app is hanging? Hmmm. I install the app, something I'm loath to do, but I hadn't closed the call and having a Sev 1 means I need to be working on it.
I install, get my login squared away, and then let the application startup. I'm looking at a class scheduling system and the issue is with the "students" section. A quick check in EM shows there is indeed a "students" table. I start up a trace and set a filter to my login and then click the "students" button.
...
Nothing happening, nothing on the trace beyond a select from sysobjects. Yep, the application hangs. I try to click another button and I'm hung. Hmmmm, I find my spid, kill it and bomb the app. Let's try this again. But why didn't I get any trace events?
I alter my trace to include the statement starting event and repeat the procedure.
Whoa!!! Lots of events now. In fact, a ton of "select * from students" scrolling on my trace screen. I kill the app again and go back in the trace. Why so many statements? This is really not making sense.
I check the students table, 150k rows, not small, but not huge. I select some rows, top 100, some specific ones, all those with a last name of "jones" (suspect I'll find a few here), and everything works fine.
So what's going on? I call the vendor and they swear that the app should check the rows and return a warning if there are more than 1000 rows. I retest with the Courses table. Sure enough I see a select count(*) from this table. Hmmmm, I'm getting confused.
In speaking with the vendor, there are a bunch of meta data tables that drive the app, so we start looking through them, copy them from a test server (I'm doing this on a 2nd test db, not production), same problem occurs.
Now it's Monday and users a ticked. There are some workarounds, but there are still issues. I send a copy of the db to the vendor. We use Litespeed so my 9GB db, which backs up to 1.2GB is only a 200MB file to ftp.
Get a call Tuesday. They don't have a spare 9GB drive for this on their server. Huh? This is tech support? OK, I understand and we're not the only customer. Still 9GB is not too many $$, but I'm in a hurry. Rather than argue or suggest they wander to CompUSA and buy a drive, I delete some data, shrink the db files, and then run another backup. We're now a 5GB db, 140MB backup. ftp it up and cross my fingers.
Wednesday. I ping the tech and get an answer back around lunch time. Can't reproduce the issue. Can I try restoring on a different server. Now I'm a little confused and more than a little annoyed. I've restored this a few times, granted on the same test server, but I'll humor him. I restore to a different server altogether and observe the same problem.
They suggest at the end of the day we roll back to the last known working backup from a week ago. I ask the client, knowing the answer. No ###^%%$$% way is what I hear as I move the phone a bit further from my ear. I understand. If someone asked me I'd probably drop a few four letter bombs as well.
I go to my manager, looking for direction on how I escalate this with the vendor. We start calling people only to find it's just been escalated on my side and directors are calling wondering about the status.
Whew it's been a long week so far. Take a moment, grab another soda and relax. I know that's what I needed, but I settled for going home to relax and get away and tackling it in the morning. I had a good excuse since the vendor was in an earlier time zone and had gone home already.
Thursday
OK, I talk to the vendor and their developers are wondering if we can roll back to the last time it worked to get things working smoothly and then we can "work" on the problem.
Ha! That's seven days ago. I'd be laughed out of the room for presenting that. But I agree since we don't have anything better. I know this is a developer's way of relieving pressure. And perhaps sweeping it under the rug. "If you need to get working ASAP, rollback and reconcile data". A nice developer's trick I've used as well, usually when I have no clue what's going on.
So I call the client, get a somber "no" (better than being laughed at) and go back to work.
I'm ready to call Microsoft, but I need more information before I do so. Don't want to spend the first hour gathering the outer parameters of the problem for them and checking configs.
The first thing I do is expand my trace and get all TSQL:Stmt starting and completed events. Going through this, I see something like:
SQL:StmtStartingSelect A.name from sysobjects B, syscolumns A where (B.type = 'U' or B.type = 'V') AND A.id = B.id and B.name = 'Students' SQL:BatchCompletedSelect A.name from sysobjects B, syscolumns A where (B.type = 'U' or B.type = 'V') AND A.id = B.id and B.name = 'Students' SQL:StmtStartingset FMTONLY ON SQL:BatchCompletedset FMTONLY ON SQL:StmtStartingSELECT * FROM Students
with the SELECT * continuing on for awhile. That looks a little funny. What's SET FMTONLY do?
I notice one other interesting thing as I'm thinking. The trace is still running. And the SELECT * FROM STUDENTS continues to be submitted as a new statement.
Well that's weird. So I bop over to Query Analyzer and run the SET and then the SELECT. Similar result. QA hangs and I don't get a result. I look up SET FMTONLY and find out this returns the outline or structure of a SQL result set without actually returning the results. Kind of like a SET ROWCOUNT 0. BOL says that this statement prevents the query from executing. I'll buy that, but why doesn't it work?
A little further digging showed that this is the type of thing that is usually executed in the ODBC or OLE-DB driver at a low level, like when you issue a "refresh" for the parameters collection. It appears to be (I'm guessing here) that ODBC uses this to setup the recordset structure which it then populates when it reruns the same query (without the SET FMTONLY ON) to get the results.
Makes sense, but it doesn't explain things. I try some other tables, courses, student_groups, etc. and they all work fine.
I'm starting to suspect a SQL Server issue at this point. I decide to experiment a little.
select top 100 * from students
I run this and get 100 rows back.
select top 1000 * from students
This works too. Next I dig in and start incrementing by 10,000 and get to 80,000. 90,000, however, doesn't return results. At least not in the 5 minutes I let it chug. I back down to 85,000 and get results in about 80 seconds. I'm not terribly interested in narrowing this down further. I know now I've got some table issue. I try a few more tests looking for specific rows (single rows) and they all work.
I next check the indexes. Lots of them, 10 in fact, but no clustered index. Since I've got a test system I decide to experiment. First take a backup. ALWAYS take a backup. Next I add a clustered index, randomly selecting a field. Now I run a test:
SET FMTONLY ON select top 100 * from students
That works, course, that worked before. Now I try
SET FMTONLY ON select * from students
Success!!!!
I remove the clustered index and it still works. I submit a change request, repeat in production and I'm gold. Things are working and most people are happy. I let the vendor know and recommend clustered indexes. The tech says most of the developers aren't DBAs, so they have "suggested" indexes which we are free to change.
So what happened?
Well, I can only guess, but here's what I think. There was some corruption in the page pointer that link the data pages together. Somewhere in the 85,000-90,000 page range, a pointer is corrupt or there's a circular reference or something. Since the non-clustered indexes don't traverse the clustered index, instead dropping inside it, they weren't affected. This explains why some parts of the application that referenced the Students table still worked.
By building a clustered index (or rebuilding if there had been one), the page pointers were rebuilt and things were fixes. How did SQL rebuild the pointers if a select * didn't work? I have no idea, I only know that it did work.
A couple more notes. This issue spanned a weekend, a time when we run dbcc checkdb and also rebuild existing indexes. That completed without an issue and no errors were thrown. I suspect if there had been a clustered index, then the problem would have miraculously "corrected" itself. A good reason to have clustered indexes.
Conclusions
I wrote this for a couple reasons. One of which was to share this problem in case anyone else experiences something similar. But also to let you know how I go about troubleshooting problems. Hopefully this will save some reader a bunch of time down the road.
As always, I appreciate comments and feedback.
Steve Jones
©dkRanch.net May 2003