SHRINKFILE errors

Question

SHRINKFILE errors

CJR

SSC Enthusiast

Points: 108
More actions
October 23, 2007 at 10:42 pm

#376008

I am trying to shrink two large files that have way too much free space.
Several other files I had no trouble shrinking. These two I got these errors:
Server: Msg 845, Level 17, State 1, Line 1
Time-out occurred while waiting for buffer latch type 4 for page (4:694053), database ID 7.
-and-
Server: Msg 8966, Level 16, State 2, Line 1
Could not read and latch page (7:19851028) with latch type SH. Latch failed.
I did a CHECKDB and that reported:
CHECKDB found 0 allocation errors and 0 consistency errors in database 'PRODDNCDB'.
Any help on what the next course of action might be?
Thanks,
CJR

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Paul Randal One Orange Chip Points: 29544 More actions · Answer 1

It looks like you're usking SQL Server 2000 - I changed that message in SS2005 to say 'Unable to' instead of 'Could not'.

Which build of SS2000 are you using? There were several bugs in shrink in SS2000.

Could you do the following please:

DBCC TRACEON (3604)

GO

DBCC PAGE (procdncdb, 7, 19851028, 3)

GO

and tell me the value of m_type in the page header output? I think it's an allocation bitmap page (type 8 or 11) that couldn't be latched internally to update the allocation status of a page or extent. That'll help me lookup and see if this is a known issue.

Thanks

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

Paul Randal One Orange Chip Points: 29544 More actions · Answer 2

One other thing - is the system *really* busy while the shirnk is going on? Are the disk queue lengths really long? This could be an IO that never completes - that can sometimes lead to an 845 error.

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

VaiydeyanathanVS Hall of Fame Points: 3053 More actions · Answer 3

This is a simple task through the wizard for shrinking a database file. Sometimes, if more connections is being sent to database for accesing the objects, this may lead to slow down the process of your actions.

For more info :- better try this....

http://msdn2.microsoft.com/en-us/library/ms190488.aspx

Paul Randal One Orange Chip Points: 29544 More actions · Answer 4

That's not relevant to the discussion here - posting a link to a BOL topic saying how to run shrink isn't going to help figure out the errors posted here.

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 5

I've been having a similar thing recently while running CheckDB and have attributed it to overloading the IO subsystem. I've know for some time that I'm straining the disks.

What I keep seeing is a set of "SQL has detected x IOs exceeding 15 seconds on file..." followed shortly by a latch timeout, sometimes continuing to wait, sometimes giving up.

To see what your IO system is doing, try running perfmon with the following counters while you're trying to shink.

Physical disks: transfers/sec

Physical disks: avg sec per read (NB, not reads/sec)

Physical disks: avg sec per write

Physical disks: percentage idle time

Physical disks: average disk queue length (may not have much meaning if you're on a SAN)

Set for intervals of 15-30 sec and check the result for any spikes or dips that correspond to the time of the errors.

If you can, run SQLIOSim on the server in an idle period and see if that picks up any errors. See http://support.microsoft.com/kb/231619 for details on IOSim, including how to run it.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

CJR SSC Enthusiast Points: 108 More actions · Answer 6

Thanks all. I have been suspicious of this IO subsystem and still am.

I am the only one on the server so I don't think heavy load is an issue.

Here is @@version

I had re-run the SHRINKFILE since the post, and now a different page is referrenced in the same error message. I ran the trace/PAGE against the currently referrenced page:

[font="Courier New"]PAGE: (7:18729706)

------------------

BUFFER:

-------

BUF @0x01039BC0

---------------

bpage = 0x4704E000 bhash = 0x00000000 bpageno = (7:18729706)

bdbid = 7 breferences = 0 bstat = 0x9

bspin = 0 bnext = 0x00000000

PAGE HEADER:

------------

Page @0x4704E000

----------------

m_pageId = (7:18729706) m_headerVersion = 1 m_type = 2

m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x0

m_objId = 2084918499 m_indexId = 1 m_prevPage = (7:18729395)

m_nextPage = (7:2694408) pminlen = 11 m_slotCnt = 310

m_freeCnt = 4066 m_freeData = 6927 m_reservedCnt = 0

m_lsn = (19847:87350:28) m_xactReserved = 0 m_xdesId = (0:0)

m_ghostRecCnt = 0 m_tornBits = 980739258

Allocation Status

-----------------

GAM (7:18404352) = ALLOCATED

SGAM (7:18404353) = NOT ALLOCATED

PFS (7:18723720) = 0x40 ALLOCATED 0_PCT_FULL DIFF (7:18404358) = CHANGED

ML (7:18404359) = NOT MIN_LOGGED

(310 row(s) affected)

Here is the first of the 310 rows from the grid:

7187297060071872939447319823

[/font]

The last field being my primary key.

I have nonclustered indexes on filegroups other than the datafile that the base table is on. I thought I read somewhere in my research this might be a clue. Maybe I should drop the nonclustered indexes and try again. This is my development server. It has 5 250GB AMCC 9590SE SATA drives. When I first got it, I had trouble getting parallel performance with my production server. Finally I found an option in the disk options "Enable write caching". I turned that on and got parallel performance with production. I wonder if that is a problem, or if maybe these are flakely disks.

I will run the PERFMON tests as suggested and reply. I get various "page latch" errors sporatically, only when I run long running INSERT INTO or SELECT INTO's (70 to 120 million rows). I seeem to have gotten around those by breaking my jobs into sprocs that I group 1 mil into a transaction at a time.

Thank you very much.

CJR

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

Hmmm....

Are those disks in a RAID array, or are they single disks?

Be careful with write caches and SQL server. Not so much of an issue, since its a dev server and, I would assume, not too critical.

If the write cache is not on battery backup you can get consistencey problems if you have a power failure and a page was written into the cache but not to disk. Also torn pages, stale reads and other similar fun issues can be caused by a cache that's not quite up to standard.

There are some articles up somewhere on microsoft.com on IO requirements for SQL server. Will see if I can find them

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 8

Found em.

SQL 2000 IO basics

IO Basics chapter 2

Back to the subject of shrink. Have you tried several smaller shrinks instead of one large? If you're trying to shrink the file by 100 MB (as an eg) then try 5 20MB shrinks. May get you around the problems, may not.

p.s. Once you've done the shrink, be sure to check the fragmentation of your indexes in the file you shrank. Will probably be quite bad.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

CJR SSC Enthusiast Points: 108 More actions · Answer 9

Thanks, I will read up on that. Some additional info for now. Appears that maybe there is a conflict between DBCC, LAZY WRITER, and CHECKPOINT. I was watching sysprocesses and just before it got the latch failure, I looked like this:

[font="Courier New"]spidcmdblockedcpuphysical_iowaittimelastwaittypewaitresource

1LAZY WRITER 000657SLEEP

6CHECKPOINT 610127116985PAGEIOLATCH_EX7:7:3557766

61DBCC 13926256479691141110PAGEIOLATCH_SH7:7:14839331[/font]

Looks like DBCC is blocked by LAZY WRITER and CHECKPOINT is blocked by DBCC.

I have the stats from perfmon on this now but need to analyze, I see the avg disk queue length was 16.4 with a max of 116.

A while back I was trying to diagnose and fix these latch errors (weeks ago) and I set the recovery_interval to 3 minutes. Seemed to help but frankly its a shot in the dark on my part.

The database is set to SIMPLE recovery model.

Thanks,

CJR

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 10

CJR (10/24/2007)
I have the stats from perfmon on this now but need to analyze, I see the avg disk queue length was 16.4 with a max of 116.

Wow. Is that on a single disk or a raid array? If an array, now many disks?

p.s. I forgot to mention, the disk counters should have been ust on the drive that contains the filegroup you're trying to shrink, not the total of all drives. Is easier for isolating problems.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

VaiydeyanathanVS Hall of Fame Points: 3053 More actions · Answer 11

Kindly consider on the Isolation Levels and implement the same

Paul Randal One Orange Chip Points: 29544 More actions · Answer 12

VAIYDEYANATHAN.V.S (10/24/2007)
Kindly consider on the Isolation Levels and implement the same

Please stop posting irrelevant info.

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

Paul Randal One Orange Chip Points: 29544 More actions · Answer 13

CJR (10/24/2007)
Thanks, I will read up on that. Some additional info for now. Appears that maybe there is a conflict between DBCC, LAZY WRITER, and CHECKPOINT. I was watching sysprocesses and just before it got the latch failure, I looked like this:
[font="Courier New"]spidcmdblockedcpuphysical_iowaittimelastwaittypewaitresource
1LAZY WRITER 000657SLEEP
6CHECKPOINT 610127116985PAGEIOLATCH_EX7:7:3557766
61DBCC 13926256479691141110PAGEIOLATCH_SH7:7:14839331[/font]
Looks like DBCC is blocked by LAZY WRITER and CHECKPOINT is blocked by DBCC.
I have the stats from perfmon on this now but need to analyze, I see the avg disk queue length was 16.4 with a max of 116.
A while back I was trying to diagnose and fix these latch errors (weeks ago) and I set the recovery_interval to 3 minutes. Seemed to help but frankly its a shot in the dark on my part.
The database is set to SIMPLE recovery model.
Thanks,
CJR

I definitely think there's something up with your IO subsystem. Can you run SQLIOSim to see what it turns up?

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

CJR SSC Enthusiast Points: 108 More actions · Answer 14

I think the "hardware guys" set up RAID 5, I have an email out to confirm. I see 5 physical disks (besides C: and D: which are OS/Other Data file logical partitions on one disk) F:, G;, H:, I:, and J: I have one or two SQL Server database files allocated on each physical disk, in all cases one file to one filegroup. I have logs on F: and tempdb on J:, and put all my tables/indexes on G: H: and I:, although I've started borrowing some of that extra space on F: and J: for work tables etc.

I ran the simulator but don't know what to make of the results.

I don't want to post the whole thing because its so big - are there certain pieces you can advise me on, or that I should post?

Thanks.

CJR