TempDB and SQL Error 845

  • I talked with our server team and they said that J and P are essentially physical drives on the SAN dedicated to SQL Server (although they may have been keeping things simple for my sake since I'm not familiar with SANs). They are going to take the server down tonight and doublecheck all the physical connections (fiber, cards, etc.) to make sure something didn't get hosed during the last physical relocation of the server. I guess that makes sense since the last physical relocation occurred at about the same time that the errors first started...although I just attributed my problems to the fact that SQL had to be restarted.

    Anyhow, if we get this resolved, I'll post the results.

  • Did they say what RAID level? Or are we talking just 2 physical disks with no redundancy of any form?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/26/2009)


    Did they say what RAID level? Or are we talking just 2 physical disks with no redundancy of any form?

    RAID 5.

    BTW: I feel sheepish on this one...but how can I check the initial size of TempDB? I tried sp_helpdb and sp_spaceused, but all indicate a ~3.3GB database. Yet I swear it was only ~2.9GB and 10% growth only day or so ago.

  • The physical server was put back online, the TempDB was 9MB by the time I checked it (not sure what the initialized size was). I ran one of the large jobs that's been failing with the Buffer Latch Type 2 error and it still failed. TempDB was 150MB by this point. I restarted the job and when I checked it in the morning, it had completed successfully...TempDB was about 4GB by now....but hey, at least the job finished this time.

    HOWEVER, the next normally scheduled run started a few hours later and failed pretty quickly with the Buffer Latch Type 2 error. I've tried numerous times to get it to run, but it keeps failing (same error). Still trying....the last attempt has been running for an hour or so.

    But I noticed something wierd...TempDB's MDF file is 5GB but the LDF is now 21GB...WTH?

  • Ok, from that I'm going to guess that the starting size for tempDB is the default. 2MB. Not good when it's getting large.

    Use the Alter Database command, or viw the properties of TempDB from Management Studio's Object Explorer, set the starting size of TempDB to say 4 GB and the log to 1GB.

    Log that size tells me that you're using a lot of temp tables (a lot!). Have a look through the job that's failing, see if there's any optimisations that can be done to the queries.

    Do you have the perfmon counters?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I didn't think I'd ever see a log file quadruple the size of the data file, especially in less than a 24-hour period. That's what shocked me. I've NEVER seen that under normal circumstances.

    Optimization - yeah, it's pretty ugly. We've inherited a beast here and have been instructed to simply appease the beast until we migrate to a system. But to give you an idea - the job calls sprocs that calls views that are based on views ad nauseam. The entire process involves a string of stage tables...i.e. physical tables that get cleared and repopulated with each iteration within each daily run. The underlying tables weren't indexed (I was afraid the indexes would only make the I/O problems worse) but they are now (based on feedback from a previous topic). Seemed to help a little at first. The sprocs are really where the insanity begins....not any temp tables or table variables...don't recall any cursors either, but there are a LOT of nested subqueries and derived tables...A LOT... spread throughout a dozen sprocs or so that get executed in order....some derives tables nested 4 or 5 levels and peppered with subqueries.....in short, it's very inefficient and worse, very unreadable.

    The server team still thinks there is a hardware problem and are looking to do some more maintenance tonight. I'll try to get them to resize the TempDB beforehand....or let me do it.

    I asked about the PerfMon results and they said it looked pretty insane, but I'm not sure what they meant...whether it showed a lot of activity or a lot of latency. I'll see if I can get actual numbers. I think they were commenting on activity.

  • I've seen a job which 'rebuilds' indexes to cause the log files to grow pretty big. Not quite to 4 x the size of the data file however.

  • Well, I'm guessing the problem has been resolved. Unfortunately, none of the recommendations I made were entertained. TempDB is still initialized to something less than 10MB with 10% autogrowth and no max (I recommended something more like 5GB, 500MB autogrowth, no max based on previous sizes under heavy stress). Also, all databases data files are still on the same "drive" and all log files are on another (they said they'd entertain the idea of moving them if the problem didn't get resolved).....although I'll have to admit that I'm still not clear on the actual SAN configuration, other than it is RAID 5. All-in-all, the only thing our server team needed to do was move some hardware to a new "enclosure"...and maybe (judging by the logs I can see through SSMS) remove DiskKeeper 2009.

    I'm still a bit skeptical because the other factor to consider is that the data behind the failing process, the one that was causing the error, resets itself on the 28th day of each month. That means today (when the job is working fine) it's dealing with less data than it was yesterday (when the failure was occuring)......5+M records yesterday but only 500K today. I guess the real test will be closer to the end of the month when the data grows to 5+M again.

    I'm not sure if the moral of this story is that my company has a rock-star server team or that I just need to go back to my cave and write code....or maybe a little of both.

  • Uncle Moki (5/28/2009)


    I guess the real test will be closer to the end of the month when the data grows to 5+M again.

    I'd be quite willing to bet that the problem will be back next month.

    I'm not sure if the moral of this story is that my company has a rock-star server team or that I just need to go back to my cave and write code....or maybe a little of both.

    From the sound of things, not the first.

    More like burying one's head in the sand and hoping that the problem goes away sometimes actually does work (at least temporarily)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/28/2009)


    I'd be quite willing to bet that the problem will be back next month.

    In a way, I hope you are right....even if only in the name of forward-progress.

    GilaMonster (5/28/2009)


    I'm not sure if the moral of this story is that my company has a rock-star server team or that I just need to go back to my cave and write code....or maybe a little of both.

    From the sound of things, not the first.

    More like burying one's head in the sand and hoping that the problem goes away sometimes actually does work (at least temporarily)

    Yeah, my fear is that we are replacing the "If it ain't broke, don't fix it" mantra with, "If it ain't right, don't fix it until the failure is catastrophic." On the positive side, our server team hasn't let us down yet so at least I know I can trust them to be there when we need them.

  • Uncle Moki (5/28/2009)


    GilaMonster (5/28/2009)


    I'd be quite willing to bet that the problem will be back next month.

    In a way, I hope you are right....even if only in the name of forward-progress.

    GilaMonster (5/28/2009)


    I'm not sure if the moral of this story is that my company has a rock-star server team or that I just need to go back to my cave and write code....or maybe a little of both.

    From the sound of things, not the first.

    More like burying one's head in the sand and hoping that the problem goes away sometimes actually does work (at least temporarily)

    Yeah, my fear is that we are replacing the "If it ain't broke, don't fix it" mantra with, "If it ain't right, don't fix it until the failure is catastrophic." On the positive side, our server team hasn't let us down yet so at least I know I can trust them to be there when we need them.

    From what I have read on this thread I don't think you can trust your server team or DBA much. These problems would have never occured if either had done their jobs before this point. Many things are misconfigured and clearly there wasn't much monitoring done or things wouldn't have gotten so bad in the first place! 😎

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Yeah, it's on one of those situation where a lot has happened in a short amount of time...reorg, reloc, lay-offs, new vision, new roles, blah blah. Not making excuses, just providing a little perspective. But yeah, if there were people monitoring the SQL servers regularly (not just when it has problems), this wouldn't have happened. Also, if there was oversight on the quality of the dbo's and SSIS, this wouldnt' have happened. For that matter, NONE of this would have happened if we had good dedicdated DBA's...but we don't. We have "Server" folks that perform some DBA functions (permissions, creating the databases, backups, moving hardware around, etc.)...and "Developers" (like me) that try to control what goes into the databases...emphasis on "try" since there are times where the "Server" folks create DBs for end-users who know just enough to be dangerous catastrophic and upload SSIS/Sprocs that really stress the server. This topic has been a perfect example of the problem with splitting DBA roles between non-DBA groups.

    Hopefully we will find a happy middle-ground .... or better yet, get some dedicated DBAs.

  • Did the one guy mention raid 5...this is just for read and raid 1 for writing to?

    Can someone explain the LUNS part..is there usually x number of drives at say 300GB.

    If this is carved up are you only having x IOPS for the drive itself. So how many IOPS can you have per disk and if you split this up to P, J are you only get half of the IOPS. How many LUNS on a disk?

    For io contention use permon like gila mentioned..that give you some indication if got IO bottleneck.

    So has the buffer latch problem gone away im still interested to see what this is too.

    Great thread.

  • The problem is gone, but I don't know if the problem is fixed. The reason it's difficult to troubleshoot is that the error only occurs when one of our jobs (a very inefficiently designed "system" of tables and sprocs and SSIS) has a full reporting period of data to deal with (20M records or so, staged across 20 tables or so). The job resets itself on the 28th day of the month (after the reporting period is over) so the workload is light right now. To duplicate the problem at this point, I either have to simulate a full-load of data on the production box or wait until the normal production load fills-up again....probably around the 20th or so...just to see if the Buffer Latch error returns.

    Two new things I learned:

    1. The server team was seeing some driver errors for the SAN...and the problem started after they moved everything to a new "enclosure". The errors went away after they moved to the original enclosure. So maybe the problem IS fixed.

    2. The SAN consists of some virtual drives. Not sure what this means exactly...unless it's analogous to a Virtual PC's virtual drive....in which case, I'd think it'd perform poorly compared to a real HDD.

    So, yeah...it's hard to tell if the problem is gone. I could stress the server tonight as a "test for failure"...but I'd hate to set off alarms over the weekend in case I successfully bring the server down again. At this point, I'll just have to wait and see what happens as the data builds-up for this reporting cycle. I'll update this thread when I find out more.

  • Gila.

    Probably IO contention.

    What do the perfmon counters Physical Disk: Avg sec/read, Avg sec/write and % idle time look like for the drives with database files?

    Is this total or per DRIVE DISKS...to select.

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply