DBCC CHECKDB causing application timeouts?

  • We generally run DBCC CHECKDB nightly on all our databases. On one particular database, we've run into problems about 3 times now in the past 5+ months. The application sees timeouts at the exact time that CHECKDB runs on their database. The database is pretty small (< 2GB), and typically CHECKDB runs in under 2 minutes. When we see issues, CHECKDB runs 5-8 minutes. We also have replication running, and we can see that run considerably longer as well. I got lucky and captured 1 event in a perfmon trace, and I can see the data drive is busy, which is expected. This in itself is not unusual to see when CHECKDB runs. Here's what's striking me as odd. I can see a few minutes where transfers/sec drops, as does bytes/sec, but current queue length flat lines. After a few minutes of this, the server returns to normal.

    I haven't been able to determine if there's a disk issue, or some sort of issue within SQL. I know that my SQL processes that are timing out have a wait type of IO_COMPLETION. One thing that's odd to me is that the SQL error log doesn't have the errors indicating slow IO; "SQL Server has encountered 20 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file .....". The only reason why I can think of that this wouldn't be logged, is if SQL's having issues writing to the error log, which is on the same logical drive as our data files (the drive that's slow).

    Does anyone have any thoughts on this? Has anyone ever seen anything like this before?

    My initial thought is that it's a disk subsystem issue (SQL is clustered with SAN on the back-end), or an issue with the HBA card, driver, etc. The OS guy and the storage guy don't see anything amiss though.

    Let me know if there's more info I can provide, or if you've got any ideas of what I can look for.

    Much appreciated,

    Adam

  • This is what I can find on the subject:

    http://support.microsoft.com/kb/919155

    DBCC CHECKDB does take a lock on tables. The above is about DBCC itself recording a timeout, but in a system with slow I/O, it would make sense that it could cause queries to timeout while they wait, even without the exact situation the article is about.

    Does that help?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • -do you experience (dead) locks during your troublesome periodes ?

    - have you got deadlock reporting activated in your sqlserver errorlog ?

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I do not have any deadlocking information in SQL (no deadlocking trace info from a SQL trace, or a trace flag). However, in my experience if a process fails as a deadlock victim, the error that the app sees reflects that. In this case, the app is just failing with a timeout error.

    That KB article is interesting. I've been looking into the IO_COMPLETION wait type to try to find out exactly what type of IO this might indicate, and BOL says this is not associated with a data page. This KB article would support that, since the KB article is talking specifically about IAM pages. Perhaps I'm running into contention on my IAM page(s).

    This error has only happened 3 times that I know of in the past few months. 2 of which I know occured after the server was rebooted (server was rebooted at ~10:30PM and CHECKDB runs at 12:10AM). I'm wondering if this might be related to the fact that the bpool would have been empty after the reboot, and still repopulating when CHECKDB ran......

    Thanks for the quick responses!!

  • GSquared (2/16/2010)


    This is what I can find on the subject:

    http://support.microsoft.com/kb/919155

    DBCC CHECKDB does take a lock on tables. The above is about DBCC itself recording a timeout, but in a system with slow I/O, it would make sense that it could cause queries to timeout while they wait, even without the exact situation the article is about.

    Does that help?

    Gus, I don't think CHECKDB takes out any locks in 2005 and above because it is actually run on an internal database snapshot. From what I remember, if SQL Server cannot create this internal database snapshot - then it will try to perform the integrity check on the actual database and to do that it needs to lock tables.

    Am i missing something?

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Books Online DBCC CHECKDB uses an internal database snapshot for the transactional consistency needed to perform these checks. This prevents blocking and concurrency problems when these commands are executed. For more information, see Understanding Sparse File Sizes in Database Snapshots and the DBCC Internal Database Snapshot Usage section in DBCC (Transact-SQL). If a snapshot cannot be created, or TABLOCK is specified, DBCC CHECKDB acquires locks to obtain the required consistency. In this case, an exclusive database lock is required to perform the allocation checks, and shared table locks are required to perform the table checks.

    Note that it says, "If a snapshot cannot be created....", it will take locks. That's what I suspect is going on here.

    The point that it may have been while the server was getting it's feet back under it after a restart might be connected. That's certainly worth digging into.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I think that from the SQL error log we can conclude that SQL is able to create the snapshot, so I don't believe the TABLOCK is the issue. Here's the pertinent info from the error log from when we saw the issue (I changed the database names to protect the innocent):

    2010-02-11 00:11:44.68 spid61 DBCC CHECKDB (systemDatabase) WITH no_infomsgs executed by <login name> found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 6 seconds.

    2010-02-11 00:11:46.70 spid61 6 transactions rolled forward in database 'userDatabase' (8). This is an informational message only. No user action is required.

    2010-02-11 00:11:51.48 spid61 0 transactions rolled back in database 'userDatabase' (8). This is an informational message only. No user action is required.

    2010-02-11 00:18:03.11 spid61 DBCC CHECKDB (userDatabase) WITH no_infomsgs executed by <login name> found 0 errors and repaired 0 errors. Elapsed time: 0 hours 6 minutes 18 seconds.

    You can see CHECKDB finished on systemDatabase at 00:11:44. Then CHECKDB moves on to userDatabase. The first thing it should to is create the internal snapshot and run recovery on it. From the next two messages, I think we can safely assume that the snapshot was created, and recovery was successfully run and finished around 00:11:51. But then CHECKDB didn't finish on userDatabase until 00:18:03. Typically CHECKDB takes < 2 minutes to run on userDatabase.

    Thanks,

    Adam

  • what type of replication is configured? And how about running dbcc between latency, if you know that in idle condition it not take more than 2 min?

    I am assuming that dbcc checkdb taking time on the same user database which you have configured for replication.

    ----------
    Ashish

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply