Faultfinding possible I/O issues

Question

Faultfinding possible I/O issues

jblovesthegym

Ten Centuries

Points: 1048
More actions
April 12, 2012 at 5:51 am

#256248

Hi all,
Over the last few weeks 3 of our secondary (log-shipped) DB's have been marked 'Suspect', requiring drop+restore. I've been advised to check the I/O and try to faultfind.
What practices/native tools exist for SS2K to get started on the investigation? BTW, if the initial diagnosis involves creating non-temp tables/objects, I would rather avoid this as even making slight changes involves having to raise an RFC.
Also, would you recommend checking I/O on both Primary + Secondary servers?

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

If the secondary has gone suspect and the primary is fine, then it's the secondary's IO subsystem that's the problem.

Start with the windows error log, any RAID logs, SAN logs. If you can, stop SQL on there and run SQLIOSim (I wouldn't run it with SQL running, too much load)

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

jblovesthegym Ten Centuries Points: 1048 More actions · Answer 2

Hi Gail,

I've done some Perfmon analysis during the 100 seconds after which log shipping runs (every 15mins on the hour), only the logical disk today (physical tomorrow) but the results for the W: drive to which the logs are copied (and restored from) are as follows, I presume the values are milliseconds:

Avg Disk Bytes/Read:

- Avg = 18,199

- Max = 26021

Avg Disk Bytes/Transfer:

- Avg = 40,651

- Max = 65,536

Avg Disk Bytes/Write:

- Avg = 53,696

- Max = 65,536

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

Perfmon is not the place to look, you don't have disk performance problems, you have disk stability problems.

And no, the figure for bytes/write is not milliseconds. It's bytes. It shows the average number of bytes written per second.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

jblovesthegym Ten Centuries Points: 1048 More actions · Answer 4

GilaMonster (4/12/2012)
Perfmon is not the place to look, you don't have disk performance problems, you have disk stability problems.

Agreed, but I don't have a lot of immediate avenues of investigation left, so I was reaching. The event log (app/systeM) showed nothing suspicious around or immediately before the initial failure. We don't have the SAN/RAID guys in until Monday, and stopping the SQL service, even temporarily on Secondary, will require a bunch of form-filling. Ok, actually swapping the disk out is not a lengthy procedure, but I need to make a business case for the switch, and thus need proof the disk is not quite stable.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 5

Nothing in any of the error logs?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

jblovesthegym Ten Centuries Points: 1048 More actions · Answer 6

Hunted around but couldn't make much sense of it...

Error: 5180, Severity: 22, State: 1

Could not open FCB for invalid file ID 0 in database 'XXXXXXXXXXXXX'

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

What about the windows event logs?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

jblovesthegym Ten Centuries Points: 1048 More actions · Answer 8

GilaMonster (4/13/2012)
What about the windows event logs?

Zip. The app log filled up with infomercials and doesn't stretch back that far. However I DID check it on the morning in question (the 11th) and found nothing. The only other 'critical' error was in te System log, a virtual disk service error, about 8 hours before and after the restore job failed:

"Unexpected failure. Error code: 2@0200001D"

jblovesthegym Ten Centuries Points: 1048 More actions · Answer 9

jblovesthegym

Ten Centuries

Points: 1048

April 24, 2012 at 10:47 am

#1478385

Any further thoughts, anyone?

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 10

Both of the errors you've listed indicate there's some form of disk problem. Maybe contact the SAN vendor (assuming it's a SAN) and get them to check it out.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 11

SQLIOSIM is the tool to use to validate that an IO subsystem will properly handle SQL Server IO-style workloads.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 12

http://support.microsoft.com/default.aspx?scid=kb;en-us;815183

the error Could not open FCB for invalid file ID %d in database '%.*ls'. is know to cause data corruption, thread errors and runtime errors

are there different service pack versions on the shipper and shipee?

MVDBA

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 13

TheSQLGuru (4/25/2012)
SQLIOSIM is the tool to use to validate that an IO subsystem will properly handle SQL Server IO-style workloads.

But SQL needs to be stopped when running that. The aim is to validate the IO subsystem, nor slaughter it.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 14

by the way -

FCB stands for File Control Block, the physical file structure used

by SQL to write in and read data from the storage.

i've had these before when i defragged a database and the log shipping made the same changes on the target.

MVDBA

Faultfinding possible I/O issues

Cookies on SQLServerCentral