January 14, 2011 at 10:09 am
Recently getting error below and would like to know the LEAST INVASIVE way to troubleshoot/fix it. This happened 2 months ago and finally shut down the DB. A contractor (DB admin/programmer) who designed/installed many of the systems here did not want to restore from a backup or repair with DBCC CHECKDB. He instead built a new database and moved tables and data into it, because he was afraid that whatever was corrupted would also be corrupted in the backups. I don't question his methods because he has much more experience than I do, but he doesn't work here full time and I am ulitmately responsible. Regardless, we are getting this warning again. So, I'd like to find out what is causing it and how to prevent and fix it.
According to the event log, it has happened on 4 different dates(not consecutive) since the DB was rebuilt (I've been remiss in checking the event log). Coincidentally, each time it occurs within a few seconds after an AGent job runs in the middle of the night. The agent job reindexes 3 tables with this syntax:
dbcc dbreindex(tblCustomer) (table name is different for each one). I was wondering if the reindexing of these files might be contributing to the problem since they all occur within seconds after the reindex job runs. I disabled the job yesterday and did got get the errors this morning, but they were happening somewhat sporadically, anyway, so I can't say the reindex was the cause.
I am not familiar with DBCC CHECKDB. I have researched it and read all about it, but am afraid to use it for repair, because it suggests NOT doing it. I DID run it with PHYSICAL_ONLY and it came back with no errors. If I run DBCC CHECKDB (just like that with no options), will it give me more info but not try to repair anything?
I ran: DBCC checkdb (SFIdatacorruped) with physical_only,
DBCC UPDATEUSAGE (sfidatacorrupted),
DBCC CheckConstraints (SfiDataCorrupted)
on the old currupted database, because I wasn't afraid to break it. It found some errors and said to run UPDATEUSAGE which I did, but said to run the CHECKCONSTRAINTS afterward. When I did that, it said the database wasn't there- check the catalog (don't know how to do that yet). It would work with PHYSICAL_ONLY, but came back with errors about TempDB being out of spaceor a system table is inconsistent. I know none of this matters concerning the production DB, but it was good experience.
ANY guidance would be greatly appreciated! Sorry for the epistle, but I figured the history might be useful. Thanx!
Log Name: Application
Source: MSSQLSERVER
Date: 1/13/2011 2:00:50 AM
Event ID: 825
Task Category: Server
Level: Information
Keywords: Classic
User: SFI\Administrator
Computer: VSQL2.sfi-wfc.com
Description:
A read of the file 'D:\SqlData\SfiData.mdf' at offset 0x0000008ad60000 succeeded after failing 1 time(s) with error: incorrect pageid (expected 1:284336; actual 14854:3866640). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
January 16, 2011 at 6:56 pm
If you can, I'd restore the database to another server and run DBCC CHECKDB and see what errors it may find. This would be the first step in identifying the problem.
January 16, 2011 at 10:40 pm
An 825 error requires no repairs.
825 means that an IO operation failed, but then succeeded on the 2nd, 3rd, 4th or 5th retry. Hence why it's classified a severity 10. The database is (currently) intact and undamaged.
What this does mean is that there is something wrong with the IO subsystem, that it's intermittently either failing requests or returning incorrect data. This needs investigating and rectifying before that 825 becomes an 824 or 823 (which does indicate damaged database)
You should be running checkDB on a regular basis to ensure that if the DB does get corrupted, you know about it early. This isn't checkDB with repair, it's either checkDB with no options or
DBCC CHECKDB (<Database Name>) WITH NO_INFOMSGS, ALL_ERRORMSGS
Don't run Physical_only unless you're limited on time as it will miss errors (because it doesn't do all checks)
p.s. Your contractor's 'solution' was potentially overkill. The only time corruption warrants the recreation of a new DB and copying of data is when you have irreparable corruption and no good backups.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 18, 2011 at 6:15 am
On Fri 1/14 we got the Event 824 message. We renamed the most current full backup so it wouldn't get copied over and ran a new full backup. We do full backups every night and logs every hour.
Thanx for the input!
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
January 18, 2011 at 6:25 am
Caruncles (1/18/2011)
On Fri 1/14 we got the Event 824 message.
Now you do have a problem.
Run the following, post the full and complete results and get some diagnostics run on that IO subsystem before it gets worse.
DBCC CheckDB(<Database name>) WITH No_INFOMSGS, ALL_ERRORMSGS
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 18, 2011 at 6:35 am
I ran DBCC CheckDB(SFIData) WITH No_INFOMSGS, ALL_ERRORMSGS
and got 'Command(s) Completed Successfully'. Don't that beat all?
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
January 18, 2011 at 6:57 am
No. It could still mean that the IO subsystem is sometimes returning incorrect information and that the retries aren't always enough any longer.
GET SOME DIAGNOSTICS RUN ON THAT IO SUBSYSTEM!!!
An IO subsystem that returns incorrect data on some reads has a problem. You need to identify the cause and fix it before something breaks for good. Get the DB onto some other storage if you can.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 18, 2011 at 8:27 am
I set up a data collector on the Reliability & Performance Monitor last week. I didn't know what I was doing, but I do get results indicating low memory. It says we only have 4 gigs and 156mb available, which is surprising since the person who installed this told me it had a lot more. The system is just a year old this month. We are going to allocate more today.
Trying to figure out how to show your results, but this one sounds significant -
The system is experiencing excessive paging
Cause: Available memory on the system is low.
Details: The total physical memory on the system is not capable of handling the load.
Resolution: Upgrade the physical memory or reduce system load
Related: Memory Diagnosis
--------------------------------------------------------------------------------
Severity: Warning
Warning: High rate of 32 split I/O per second was detected. This represents 71 percent of total I/O processes. Consider size of I/O processes compared to disk format size and defragment or reformat any disks with a high split I/O rate.
Memory
Utilization: 96 %
Memory: 3999 MB
Disk
Top Disk by IO Rate: 0
IO/sec: 4
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
January 18, 2011 at 9:15 am
Warning: The average disk queue length is 6. The disk may be at its maximum transfer capacity due to throughput and disk seeks
I assume this is related to the disk format message- but I don't know.
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
January 18, 2011 at 10:08 am
Low memory won't cause corruption, poor IO performance won't cause corruption. A disk queue length of 6 is not high. There is something WRONG somewhere in the IO subsystem.
Windows reliability is not likely to find it. Check the RAID logs, check the SAN logs, run diagnostics that came with whatever the IO subsystem is. Check the windows event log. Make sure drivers and firmware are updated.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 18, 2011 at 11:01 am
I figured the memory issue couldn't break anything . We've checked the RAID controller and saw no errors. Checked driver updates and they are all current. I'm sure there is something else I can check on the disk/IO side. What about this error:
Severity: Warning
Warning: High rate of 32 split I/O per second was detected. This represents 71 percent of total I/O processes. Consider size of I/O processes compared to disk format size and defragment or reformat any disks with a high split I/O rate
Thanx for your help!
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
January 18, 2011 at 11:29 am
It's not an error, it's listed as a warning. If I'm not mistaken, that indicates that the disk partitions aren't aligned properly (google disk alignment). Again, not an error, not going to cause 825 and 824 errors. Something is resulting in the IO subsystem returning incorrect data some times.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 19, 2011 at 11:45 am
I would ask your server(SAN) team to create a separate LUN and create a new drive for you. Request that they create the LUN from a totally separate set of drives of from the original. Depending on how they have the SAN setup, they may or may not have another resource pool availabe, but they can tell you this.
Move a copy of the database over to the new LUN and make it the production copy. Monitor it for any errors. Run DBCCs on the original copy to check for any disk issues.
March 3, 2011 at 12:35 pm
Since I AM the team, and am not really a DBA, I took the least-invasive measure first, which was to set up a weekly defrag. I haven't seen the error since and it's been over a month. Another measure I took was to modify a SQL Agent job which was reindexing 2 of the largest and most-used tables every night. I changed that to weekly. This was set up by another DBA after a single incident of an indexing problem. I've since read how frequent re-indexing is not a good thing. Another measure which has been discontinued at my SERIOUS request, was to stop a user ( who was the DBA before I got here) from stopping and restarting SQL service everytime a job got blocked. That's like cutting a birthday cake with a chain saw. I just view the blocks with WHo_is and kill the blocker.
Wallace Houston
Sunnyland Farms, Inc.
"We must endeavor to persevere."
March 3, 2011 at 5:09 pm
Caruncles (3/3/2011)
Since I AM the team, and am not really a DBA, I took the least-invasive measure first, which was to set up a weekly defrag.
Defrag/rebuild does not fix corruption. At best it makes it temporarily stop being detected.
Error 825 says there is an intermittent IO failure. The disk is occasionally failing to return data or returning incorrect data. It's not something you can just ignore. Leave it and that is very likely to become permanent corruption and cause real serious problems.
If you don't know how to investigate the IO subsystem, get some help from your vendor, hardware supplier, etc.
I just view the blocks with WHo_is and kill the blocker.
Why? What if the process you kill is running your payroll?
The solution for severe blocking is to fix the cause, not just kill processes without analysis. That's just making the symptoms disappear.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply