November 22, 2010 at 10:39 am
This may be a dumb question but does anyone know if there's any formula for calculating even an estimated time to complete for the DBCC command CHECKDB or even CHECKTABLE? Even if its just a very rough estikmate that can be off by +/- %25 ?
I'm investigating some "incorrect checksum" messages and because of teh size of the DB "over 200GB" I'd like to get a rough idea of how long it may take to run these. I am lucky in that if CheckTable can find the problem that CHECKDB would then I can use it instead for I am %90 certain the probelm is specific to one or more tables (of about a dozen in a DB with over 2,000 tables) since the error can be seen occurring everytime a specifc process we have executes and it uses only a dozen of the tables in the DB.
Then again if this is a hit or miss, something not consistently reproducable then using Checktable may be a waste. I just need to do this in as little time as I can and with a 200GB+ DB there is no such thing as quick.
Thanks
Kindest Regards,
Just say No to Facebook!November 22, 2010 at 10:56 am
Nope.
If you don't know what the problems are, do a full checkDB. Two reasons
1) If you post here asking for help, it's the first thing I'll ask you to do.
2) If you know you have problems and don't know what they are, there's no way you can be sure by running checktable that you've found all the problems. You may have missed the worst.
p.s. 200GB is really not a large database. Unless you're got poor IO throughput things shouldn't take forever.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 22, 2010 at 11:25 am
GilaMonster (11/22/2010)
Nope.If you don't know what the problems are, do a full checkDB. Two reasons
1) If you post here asking for help, it's the first thing I'll ask you to do.
2) If you know you have problems and don't know what they are, there's no way you can be sure by running checktable that you've found all the problems. You may have missed the worst.
p.s. 200GB is really not a large database. Unless you're got poor IO throughput things shouldn't take forever.
Thanks for the link Gail. I already had started a DBCC CHECKDB before I posted my question. THe system I'm working on is not a production server byt a test/dev one and so it can take a while. For example the CHECKDB 'WITH PHYSICAL ONLY' option took over an hour. Maybe thats not that long and I'm justr being unreasonable inexpectation of time to complete.
I do have one addition question on this. Assuming that the DB you are working with has not been moved or restored to and is unchanged (the mdf & ldf files are in teh same place and have not changed in size) and you executing a process that queries a specifdic set of tables raises the error
SQL Server detected a logical consistency-based I/O error: incorrect checksum
then is it logical to assume the execution of that process again should result in the same error if the problem is within the data file as opposed to being something hardware related?
Kindest Regards,
Just say No to Facebook!November 22, 2010 at 11:37 am
A logical consistency-based I/O error is an invalid page in the data file (further proven by the incorrect checksum). A physical I/O error is one where the OS was asked for a page and threw an error. That's the OS/hardware one.
The checksum error could have been caused by faulty hardware
The checkDB option you want, for diagnosis of existing corruption is
DBCC CHECKDB (<Database Name>) WITH NO_INFOMSGS, ALL_ERRORMSGS
With an unknown amount of existing problems you don't want to run the quickest, you want to run the most thorough. You need to find the errors (all of them) so that you can make an educated decision on fixing.
Do you have a backup that was taken before the corruption occurred?
As for how long the checkDB takes, how long does it usually run? (You are running it regularly aren't you?)
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 22, 2010 at 1:33 pm
Gail, Thru some testing, while waiting for a restore of the DB I can DBCC against (takes 3 hours plus or minus an hour to restore the DB ) and I found the actual query that would trigger this error msg. I captured the T-SQL this process runs as well as what criteria and I ran it from SSMS on my desktop and 3 times in a row the thing produced the same error at the same query (there are 1 queries in total this process runs for each identity; ex WHERE TABLE.ID = 1 ).
Here's the kicker, a minute later I started removing the queries form this batch of 11 that were not failing till I had just the one culprit and the error stopped. I then reverted back and now I cannot reproduce the error with the same set of code I just got through using. Nothing has changed with the DB in that time as far as file growth or location. I am no guru but that sounds to me like its a hardware issue and not a data one for if it were data it should be consistently reproducible.
Thoughts?
Thanks for your help
Kindest Regards,
Just say No to Facebook!November 22, 2010 at 1:44 pm
Run a checkDB, see the output.
Checksum error is a damaged page, it's a page that has been modified by something since SQL wrote it out to the filesystem. The main cause of 'disappearing' corruptions is the page getting deallocated (deletes or index rebuilds)
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 22, 2010 at 1:59 pm
GilaMonster (11/22/2010)
Run a checkDB, see the output.Checksum error is a damaged page, it's a page that has been modified by something since SQL wrote it out to the filesystem. The main cause of 'disappearing' corruptions is the page getting deallocated (deletes or index rebuilds)
"Run a CHECKDB"
As soon as my DB restore is done that is first on the list. It however will most likly be late before I see any info.
Thanks again
Kindest Regards,
Just say No to Facebook!November 22, 2010 at 10:41 pm
GilaMonster (11/22/2010)
Run a checkDB, see the output.Checksum error is a damaged page, it's a page that has been modified by something since SQL wrote it out to the filesystem. The main cause of 'disappearing' corruptions is the page getting deallocated (deletes or index rebuilds)
I finsihed the DBCC CHECKDB run and it found 0 allocation errors and 12 consistency errors in one table and thats it. The minimu repair level is of course repair_allow_data_loss. Needless to saythis is frustarting. i do have backups, tons of backups. We have full backups twice a day and log backups (REOCVERY MOEDL = FULL) every 15minuutes.
I did have something else odd happen though and it may be nothing but I'm gonna ask anyway. I ran CheckDB again onm the same DB but later on when there was few if any users at all logged on. It returned 0 allocatyyions erros and only 8 consistency errors on the same table. Is this a possible norm with CHECKDB, that it returns less erros with sucessive runs when no data fix has been executed between the 2 calls to CHECKDB?
I have another question to with CHECKDB. I restored a copy of this DB that was done 1 hour after the error that started all of this was redcorded in the event log. The rror occurred at 705PM and the backup is from 830PM. When I ran CHECKDB against this 830PM backup copy of the DB it reported 0 errors. Shouldn't it have reported at least one error? The same error or problme that caused the checksum error in the event log?
Thanks
Kindest Regards,
Just say No to Facebook!November 22, 2010 at 10:53 pm
YSLGuru (11/22/2010)
GilaMonster (11/22/2010)
I finsihed the DBCC CHECKDB run and it found 0 allocation errors and 12 consistency errors in one table and thats it. The minimu repair level is of course repair_allow_data_loss. Needless to saythis is frustarting.
Do you want advise or just commiserations? If advice, post the errors (all of them).
Is this a possible norm with CHECKDB, that it returns less erros with sucessive runs when no data fix has been executed between the 2 calls to CHECKDB?
As I said earlier
"The main cause of 'disappearing' corruptions is the page getting deallocated (deletes or index rebuilds)"
Shouldn't it have reported at least one error? The same error or problme that caused the checksum error in the event log?
As I said earlier
"The main cause of 'disappearing' corruptions is the page getting deallocated (deletes or index rebuilds)"
One point. If this was my database, I'd be running some pretty intensive diagnostics on that drive right about now. All corruptions are typically caused by a faulty IO subsystem. If you don't find the cause, it can happen again and again and again....
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 23, 2010 at 11:32 am
Gail,
Of course I'm not just looking for commiserations. Because of an NDA with the vendor (the software app that uses this DB) I have to check with legal anytime I post something that explcitly references the DB.
Here is the results returned from the command DBCC CHECKDB ('ACCT_LIVE') WITH NO_INFOMSGS, ALL_ERRORMSGS
Msg 8928, Level 16, State 1, Line 1
Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data): Page (1:20567271) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data), page (1:20567271). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 29493257 and -4.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data). Page (1:20567271) was not seen in the scan although its parent (1:20324078) and previous (1:20567270) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data). Page (1:20567272) is missing a reference from previous page (1:20567271). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data): Page (1:20974218) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data), page (1:20974218). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 63047689 and -4.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data). Page (1:20974218) was not seen in the scan although its parent (1:20920382) and previous (1:20974217) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data). Page (1:20974219) is missing a reference from previous page (1:20974218). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data): Page (1:21153927) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data), page (1:21153927). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 79824905 and -4.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data). Page (1:21153927) was not seen in the scan although its parent (1:21038218) and previous (1:21153926) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 594101157, index ID 1, partition ID 72057597575036928, alloc unit ID 72057601783365632 (type In-row data). Page (1:21153928) is missing a reference from previous page (1:21153927). Possible chain linkage problem.
CHECKDB found 0 allocation errors and 12 consistency errors in table 'TRANS_FIN' (object ID 594101157).
CHECKDB found 0 allocation errors and 12 consistency errors in database 'ACCT_LIVE'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (ACCT_LIVE).
I have restored multiple copies of this DB from different backup points b/c the first one I tested (when I ran CHECKDB against it) foudn no errors even though it was a copy of the DB done an hour after the event log recorded the first CHECKSUM error.
There may be no relation but just in case there is, the live server is a 64 bit OS where as the Test/dev server I'm restoriung the backups to is 32 bit.
Kindest Regards,
Just say No to Facebook!November 23, 2010 at 11:55 am
Odd. The checkDB shows 3 pages with errors, all in the clustered index.
Could something have deleted rows from that table (and resulted in pages deallocated) before the backup that you're restoring?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 23, 2010 at 12:31 pm
GilaMonster (11/23/2010)
Odd. The checkDB shows 3 pages with errors, all in the clustered index.Could something have deleted rows from that table (and resulted in pages deallocated) before the backup that you're restoring?
Because of the vendor (I can't say any names for legal reasons) I'd say anything is possible. I've seen some incredibly bad code used by this app, so bad that it was a wonder it even worked. In a few cases we had to report teh problem and even include the possible fix if we wanted to get an update that had not just working code but better code (i.e. a fix that used actual set based approach instaed of a cursor whcih sometimes seems to be the only method the developrs know that work this app).
The table in question is huge, the largest in the DB coming in at 32+ million rows with a Used Table Space of 33,246,296 KB and Used Index Space of 21,896,752 KB. Normally this table doesn't get any DELETEs executed against it but it can happen.
I checked the Transaction log, the log file the app updates whenever an item is deletd using the program which is the only way any transactiond ata should be deleted and the last deletion was on 11/18. The first sign of trouble poppe up late Sunday eveneing on 11/21.
Let me know if I missed something or did not giove enough detail.
UPDATE:
While I was replying to your question my most recent run of CHECKDB against a restored copy of my database complted and the results are rather suprising.
The copy of the DB I just ran CHECKDB against is from last night at 830PM and it showed no errors at all.
I ran CHECKDB against the live database last night starting at 22:50 CST. That run of CHECKDB showed 8 errors. I ran CHECLKDB again in the live DB about 2 hours later and it retruned the 12 errors that I posted in my last post.
Whats weird is it seems as if when the DB is on my test server CHECKDB returns no errors and when its the live DB I get varying results with CHECKDB and I may or may not get CHECKSUM errors when running the same quqery that produced teh first error msg on 11/21.
Does this sound to you like its a hardware problem?
Kindest Regards,
Just say No to Facebook!November 23, 2010 at 1:22 pm
As I mentioned earlier, any corruption is an indication of problems in the IO subsystem.
Corrupt pages don't usually appear and disappear, unless the pages are getting deallocated (index rebuilds, deletes)
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 23, 2010 at 2:56 pm
GilaMonster (11/23/2010)
As I mentioned earlier, any corruption is an indication of problems in the IO subsystem.Corrupt pages don't usually appear and disappear, unless the pages are getting deallocated (index rebuilds, deletes)
Something that does happen, and a lot, with this table is the moving around of data. This is hard to explain especially if you are used to working with decent to well-designed tables which I assume most are. I know everyone has horror stories about DB design but. I am going to try and explain in as brief a manner as I can how the table does inserts which will then (hopefully) articulate what I mean by INSERTS of new data causing a lot of moving around.
In summary, the Primary Key is clustered but its not auto-incrmeenting and so the the assignment of each new PK value as rows are adeed to the table is not simply the Max(PK) + 1 but is Max(PK) within a subset of a range of allowed values for each type of transaction. This reulst in most new rows added to the table being inserted between existing data.
DETAILED EXPLINATION:
The table has over 200 columns (I know that’s crazy) so for brevity here is an abbreviated DDL:
CREATE TABLE dbo.TRANS_FIN ( hID NUMERIC (18, 0) NOT NULL,
iType NUMERIC (18, 0) NULL,
sAmount NUMERIC (21, 2) NULL,
sAsOfDate DATETIME NULL,
CONSTRAINT PK_TRANS_FIN PRIMARY KEY CLUSTERED ( hID ASC )
WITH (PAD_INDEX=OFF, STATISTICS_NORECOMPUTE=OFF, IGNORE_DUP_KEY =OFF,
ALLOW_ROW_LOCKS=ON, ALLOW_PAGE_LOCKS=ON, FILLFACTOR =90
) ON [PRIMARY]
) ON [PRIMARY]
This table holds every kind of financial transaction from payments to charges to Bills to Checks. Every transaction type has its own unique set of numbers that start at 1 and increment by 1. This control number that each transaction has directly correlates to its Primary Key value but that relationship is a formula. The formula is:
PrimaryKey = Base Number * Transaction Type + Next Number
Base Number = 100000000
Transaction Type: A value that represents the transaction type. Charge = 1, Payment = 5, Bill or invoice = 6 and Check (payment of Bill/Invoice) = 10.
Next Number: This is first unused number within BaseNumber * Transaction Type.
So the first Charge created would have a Primary Key value of (100000000 * 1) + 1 aka 100000001. The next charge created would have a PK value of 100000002.
The first Payment transaction created would have a primary key value of (100000000 * 5) + 1 aka 500000001.
First Bill/Invoice created would have a PK value of 600000001
The first Check writtern would have a PK value of 1000000001.
So if you looked at the table after creating 2 Charges, Payments, Bills and Checks the PK values would look like this:
100000001
100000002
500000001
500000002
600000001
600000002
1000000001
1000000002
Now imagine this PK value assignment scheme when the table has millions of rows with a fairly equal distribution across the transaction types. When you create a new Charge or Payment its primary key is insretd within the clustered list and not added at the end as would be the case for normalized tables.
The tables Indexes are ALL set to %90 fill and there is only 1 file for the DB, no partitioning.
Hopefully that properly explains the workings of the table without being verbose and helps you see how a lot of moving around can occur.
Kindest Regards,
Just say No to Facebook!November 24, 2010 at 11:41 am
Gail - Hopefully your still watching this thread. I debated creating a new thread about this (what I discovered today that may be game changer on this investigation) but decided to reply to this thread first. I have to make a decision about what to do by tomorrow and since its the Holidays I didn’t; know if you would be reading the threads again till after the holiday.
This morning my IT guys told me that the server that host our SQL Server instance and our live DB has not rebooted in 10 days. It is scheduled to reboot daily and it is setup so that it is a forced shutdown and so its not rebooting for over a week is definitely more than just not normal.
They are checking into this but my question is this, could it be that there is no actual data corruption (as reported by CHECKDB) and that this is just a hardware problem? Every copy of the DB that I have restored form a backup (including those backups done of the live DB after I have run CHECKDB on live and it show errors) show no errors by CHECKDB. I still get errors though when I run CHECKDB on the live DB (after hours).
My first response would be to say it must be a hardware problem but we use DPM to do backups and it does not do backups like native SQL Server backups are done. I'm not familiar with other third party backup *& recovery apps so I don't know if they are like native SQL Server backups or like DPM and so the fact we use DPM may be very relevant to this.
DPM backups up the changed bits and not the DB and or log file. At least this is what I've been told. Assuming that is correct then the fact that restored copies of the DB not showing errors may not necessarily mean that the errors reported in the live DB are not actual there and just hardware related.
If the bits that contain the corrupt pages have not changed since DPM last backed them up then any corruption of those would not be pulled into the backup and therefore not included in restores. Does that make sense?
My current plan of action is to reboot the system first and run CHECKDB again. If no errors are reported then I turn this over to IT to let them go over the hardware and make sure there is not something in need of repair. I'm thinking this is hardware related and that whatever is causing these errors reported by CHECKDB is also preventing the scheduled reboots.
Should CHECKDB show errors after the reboot then I will restore a recent copy of teh DB to our live system, verify CHECKDB shows no errors and then drop the old DB.
Kindest Regards,
Just say No to Facebook!Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply