October 23, 2007 at 10:42 pm
I am trying to shrink two large files that have way too much free space.
Several other files I had no trouble shrinking. These two I got these errors:
Server: Msg 845, Level 17, State 1, Line 1
Time-out occurred while waiting for buffer latch type 4 for page (4:694053), database ID 7.
-and-
Server: Msg 8966, Level 16, State 2, Line 1
Could not read and latch page (7:19851028) with latch type SH. Latch failed.
I did a CHECKDB and that reported:
CHECKDB found 0 allocation errors and 0 consistency errors in database 'PRODDNCDB'.
Any help on what the next course of action might be?
Thanks,
CJR
October 23, 2007 at 11:23 pm
It looks like you're usking SQL Server 2000 - I changed that message in SS2005 to say 'Unable to' instead of 'Could not'.
Which build of SS2000 are you using? There were several bugs in shrink in SS2000.
Could you do the following please:
DBCC TRACEON (3604)
GO
DBCC PAGE (procdncdb, 7, 19851028, 3)
GO
and tell me the value of m_type in the page header output? I think it's an allocation bitmap page (type 8 or 11) that couldn't be latched internally to update the allocation status of a page or extent. That'll help me lookup and see if this is a known issue.
Thanks
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
October 23, 2007 at 11:26 pm
One other thing - is the system *really* busy while the shirnk is going on? Are the disk queue lengths really long? This could be an IO that never completes - that can sometimes lead to an 845 error.
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
October 23, 2007 at 11:59 pm
This is a simple task through the wizard for shrinking a database file. Sometimes, if more connections is being sent to database for accesing the objects, this may lead to slow down the process of your actions.
For more info :- better try this....
October 24, 2007 at 12:04 am
That's not relevant to the discussion here - posting a link to a BOL topic saying how to run shrink isn't going to help figure out the errors posted here.
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
October 24, 2007 at 1:06 am
I've been having a similar thing recently while running CheckDB and have attributed it to overloading the IO subsystem. I've know for some time that I'm straining the disks.
What I keep seeing is a set of "SQL has detected x IOs exceeding 15 seconds on file..." followed shortly by a latch timeout, sometimes continuing to wait, sometimes giving up.
To see what your IO system is doing, try running perfmon with the following counters while you're trying to shink.
Physical disks: transfers/sec
Physical disks: avg sec per read (NB, not reads/sec)
Physical disks: avg sec per write
Physical disks: percentage idle time
Physical disks: average disk queue length (may not have much meaning if you're on a SAN)
Set for intervals of 15-30 sec and check the result for any spikes or dips that correspond to the time of the errors.
If you can, run SQLIOSim on the server in an idle period and see if that picks up any errors. See http://support.microsoft.com/kb/231619 for details on IOSim, including how to run it.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 24, 2007 at 6:24 am
Thanks all. I have been suspicious of this IO subsystem and still am.
I am the only one on the server so I don't think heavy load is an issue.
Here is @@version
Microsoft SQL Server 2000 - 8.00.2039 (Intel X86) May 3 2005 23:18:38 Copyright (c) 1988-2003 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)
I had re-run the SHRINKFILE since the post, and now a different page is referrenced in the same error message. I ran the trace/PAGE against the currently referrenced page:
[font="Courier New"]PAGE: (7:18729706)
------------------
BUFFER:
-------
BUF @0x01039BC0
---------------
bpage = 0x4704E000 bhash = 0x00000000 bpageno = (7:18729706)
bdbid = 7 breferences = 0 bstat = 0x9
bspin = 0 bnext = 0x00000000
PAGE HEADER:
------------
Page @0x4704E000
----------------
m_pageId = (7:18729706) m_headerVersion = 1 m_type = 2
m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x0
m_objId = 2084918499 m_indexId = 1 m_prevPage = (7:18729395)
m_nextPage = (7:2694408) pminlen = 11 m_slotCnt = 310
m_freeCnt = 4066 m_freeData = 6927 m_reservedCnt = 0
m_lsn = (19847:87350:28) m_xactReserved = 0 m_xdesId = (0:0)
m_ghostRecCnt = 0 m_tornBits = 980739258
Allocation Status
-----------------
GAM (7:18404352) = ALLOCATED
SGAM (7:18404353) = NOT ALLOCATED
PFS (7:18723720) = 0x40 ALLOCATED 0_PCT_FULL DIFF (7:18404358) = CHANGED
ML (7:18404359) = NOT MIN_LOGGED
(310 row(s) affected)
Here is the first of the 310 rows from the grid:
7187297060071872939447319823
[/font]
The last field being my primary key.
I have nonclustered indexes on filegroups other than the datafile that the base table is on. I thought I read somewhere in my research this might be a clue. Maybe I should drop the nonclustered indexes and try again. This is my development server. It has 5 250GB AMCC 9590SE SATA drives. When I first got it, I had trouble getting parallel performance with my production server. Finally I found an option in the disk options "Enable write caching". I turned that on and got parallel performance with production. I wonder if that is a problem, or if maybe these are flakely disks.
I will run the PERFMON tests as suggested and reply. I get various "page latch" errors sporatically, only when I run long running INSERT INTO or SELECT INTO's (70 to 120 million rows). I seeem to have gotten around those by breaking my jobs into sprocs that I group 1 mil into a transaction at a time.
Thank you very much.
CJR
October 24, 2007 at 6:39 am
Hmmm....
Are those disks in a RAID array, or are they single disks?
Be careful with write caches and SQL server. Not so much of an issue, since its a dev server and, I would assume, not too critical.
If the write cache is not on battery backup you can get consistencey problems if you have a power failure and a page was written into the cache but not to disk. Also torn pages, stale reads and other similar fun issues can be caused by a cache that's not quite up to standard.
There are some articles up somewhere on microsoft.com on IO requirements for SQL server. Will see if I can find them
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 24, 2007 at 6:50 am
Found em.
Back to the subject of shrink. Have you tried several smaller shrinks instead of one large? If you're trying to shrink the file by 100 MB (as an eg) then try 5 20MB shrinks. May get you around the problems, may not.
p.s. Once you've done the shrink, be sure to check the fragmentation of your indexes in the file you shrank. Will probably be quite bad.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 24, 2007 at 7:43 am
Thanks, I will read up on that. Some additional info for now. Appears that maybe there is a conflict between DBCC, LAZY WRITER, and CHECKPOINT. I was watching sysprocesses and just before it got the latch failure, I looked like this:
[font="Courier New"]spidcmdblockedcpuphysical_iowaittimelastwaittypewaitresource
1LAZY WRITER 000657SLEEP
6CHECKPOINT 610127116985PAGEIOLATCH_EX7:7:3557766
61DBCC 13926256479691141110PAGEIOLATCH_SH7:7:14839331[/font]
Looks like DBCC is blocked by LAZY WRITER and CHECKPOINT is blocked by DBCC.
I have the stats from perfmon on this now but need to analyze, I see the avg disk queue length was 16.4 with a max of 116.
A while back I was trying to diagnose and fix these latch errors (weeks ago) and I set the recovery_interval to 3 minutes. Seemed to help but frankly its a shot in the dark on my part.
The database is set to SIMPLE recovery model.
Thanks,
CJR
October 24, 2007 at 7:49 am
CJR (10/24/2007)
I have the stats from perfmon on this now but need to analyze, I see the avg disk queue length was 16.4 with a max of 116.
Wow. Is that on a single disk or a raid array? If an array, now many disks?
p.s. I forgot to mention, the disk counters should have been ust on the drive that contains the filegroup you're trying to shrink, not the total of all drives. Is easier for isolating problems.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 24, 2007 at 8:01 am
Kindly consider on the Isolation Levels and implement the same
October 24, 2007 at 9:02 am
VAIYDEYANATHAN.V.S (10/24/2007)
Kindly consider on the Isolation Levels and implement the same
Please stop posting irrelevant info.
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
October 24, 2007 at 9:04 am
CJR (10/24/2007)
Thanks, I will read up on that. Some additional info for now. Appears that maybe there is a conflict between DBCC, LAZY WRITER, and CHECKPOINT. I was watching sysprocesses and just before it got the latch failure, I looked like this:[font="Courier New"]spidcmdblockedcpuphysical_iowaittimelastwaittypewaitresource
1LAZY WRITER 000657SLEEP
6CHECKPOINT 610127116985PAGEIOLATCH_EX7:7:3557766
61DBCC 13926256479691141110PAGEIOLATCH_SH7:7:14839331[/font]
Looks like DBCC is blocked by LAZY WRITER and CHECKPOINT is blocked by DBCC.
I have the stats from perfmon on this now but need to analyze, I see the avg disk queue length was 16.4 with a max of 116.
A while back I was trying to diagnose and fix these latch errors (weeks ago) and I set the recovery_interval to 3 minutes. Seemed to help but frankly its a shot in the dark on my part.
The database is set to SIMPLE recovery model.
Thanks,
CJR
I definitely think there's something up with your IO subsystem. Can you run SQLIOSim to see what it turns up?
Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005
October 24, 2007 at 11:42 am
I think the "hardware guys" set up RAID 5, I have an email out to confirm. I see 5 physical disks (besides C: and D: which are OS/Other Data file logical partitions on one disk) F:, G;, H:, I:, and J: I have one or two SQL Server database files allocated on each physical disk, in all cases one file to one filegroup. I have logs on F: and tempdb on J:, and put all my tables/indexes on G: H: and I:, although I've started borrowing some of that extra space on F: and J: for work tables etc.
I ran the simulator but don't know what to make of the results.
I don't want to post the whole thing because its so big - are there certain pieces you can advise me on, or that I should post?
Thanks.
CJR
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply