The semaphore timeout period has expired

Question

Post reply

The semaphore timeout period has expired

Viewing 15 posts - 16 through 30 (of 36 total)

You must be logged in to reply to this topic. Login to reply

taylorms SSC-Addicted Points: 482 More actions · Answer 1

taylorms

SSC-Addicted

Points: 482

October 20, 2009 at 2:34 am

#1068020

sry, got a duplicate post so cleared it out.

Zach Nichter Mr or Mrs. 500 Points: 547 More actions · Answer 2

We are having the same issue with FULL backups as well as Differentials. It seems to occationally reoccur on the same databases but mostly seems random.

TCP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01] (Error 121) Communication link failure [SQLSTATE 08S01] (Error 121). The step failed.

It seems as though I read a post somewhere saying that a check disk with repair fixed the issue. our drives are multi-terrabyte on production systems so it's hard to get that kind of downtime to attempt the check disk.

--------------------------
Zach

Odds_And_Ends Blog

taylorms SSC-Addicted Points: 482 More actions · Answer 3

DBADave. Well, I tried everything (you can see my earlier post). Then I tried the recommended changes to the registry to extend the timeout period on just the server and then both the server and backup device, to no avail. All that did for me was just extend the time before failure occurred. Very aggravating.

However, I seem to have SOLVED MY PROBLEM! I hope I don't jinx things here, but I haven't had a failed backup since Nov 6, nearly a month (I do a full on 2 servers every night, and logs every 30 minutes, so that's a lot of backups!). What'd I do? I found that I had been using Cat5 patch cables on my GB Ethernet NW (that's a no-no, and I knew better - even though the NICs reported gud 'nuf signals). I replaced the cables with at least Cat 5e, and Cat6 where I had it. That improved things, but didn't correct the problems altogether.

I've always suspected switch issues, and was the first things I tried, but could never resolve the issues no matter what I did with them. I use Netgear GS605 GB Switches on both my segments between Servers and backup devices. I had them stacked with the Netgear router (i.e., switch just sitting on top of the router. They are both small, uncooled devices, but get pretty warm at times.). It wasn't until I completely separated the two devices giving them plenty of breathing room (top, bottom, sides) did my problems go away.

This is the only thing I can point to that solved my NW timeout problems. Maybe it will last and perhaps this post might help somebody else out there. Keep those SoHo-class routers and switches cool.

DBADave SSChampion Points: 10962 More actions · Answer 4

We had the problem occur again last week for the first time in a few months. In our case we believe the issue is related to a disk bottleneck, but are not certain. We have some significant disk issues with our current SAN that are causing us to purchase a new SAN. If this particular problem occurs after implementing a new SAN than we know disk can be ruled out, but for now we are leaning in that direction.

Good luck.

Dave

Soren Nielsen SSCommitted Points: 1818 More actions · Answer 5

Hi DBADave

If you haven't tried the mentioned reg settings then please do, if you also face these timeout errors. In my case I have not seen any error since I implemented them on my four node X64 bit SQL 2005 cluster.

//SUN

DBADave SSChampion Points: 10962 More actions · Answer 6

We reviewed the Microsoft article regarding the registry setting and believe our symptoms are a little different than those stated in the doc. We've checked the network load on several occassions and have yet to see the network being hit hard. However, if the problem occurs again after implementing the new SAN than I will suggest we contact Microsoft to discuss the registry settings in more detail. I appreciate the input.

Thanks, Dave

mohamed.ismail SSC Journeyman Points: 88 More actions · Answer 7

Hi

One of the things u can check for , if using a SAN is the Fibre Channel cards. FC cards have more than 1 path (route) going to the san. One of the routes could be a problem or the card itself. Driver update does not always help. Check SQL logs if you see something like SQL server has encountered I/O requests taking longer that 15 seconds etc...usually when doing backups or big jobs...look at DISK / SAN / FC card ...

thanks

Mohamed Ismail

Steve-3_5_7_9 SSChampion Points: 10832 More actions · Answer 8

I've been randomly having this error during backups, update_stats, and updates (database grooming). The timings of this work does not overlap so it's not that. We are connected to a SAN and using Windows Server 2003 64-bit

Our File Servers are also getting this error and they are connected to another SAN. We're pretty sure it's response times on the SAN are not meeting the OS's tolerance so the OS is throwing this error. A MS Tech was also here and couldn't ferret out the problem.

In the next couple of weeks, I'm moving the database server (SQL 2005 64-bit) over to a 2-node cluster of HP DL785 (128GB Ram, 8 Quad processors) with Windows Server 2008 64 bit and DAS (MSA 2000's). No more SAN's for our database servers.

I'll post a periodic update once the move is complete.

Good luck, people.

Soren Nielsen SSCommitted Points: 1818 More actions · Answer 9

Hi Steve

We are on same SAN no changes there and by applying these reg fixes I have not seen any issues since, we have by the way also applied them on our Win 2003 file server clusters that also saw "cluster fucks" where the cluster ended up being split brain and completely dead, and also here the fixes have solved the issues, but here I know that more reg settings was changed.

Best regards

Søren Udsen Nielsen

Gianpiero R SSC Enthusiast Points: 187 More actions · Answer 10

Hi all

i confirm the solution provided by soresen;

this is my scenario :

2 server windows 2003 EE 64 Bit ( Microsoft Cluster)

Sqk server 2008 64 bit

Sheduled Job on SQL Agent every night

the error ocurred :

when some sheduled job run, the error dispalyed :

CP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01] (Error 121) Communication link failure [SQLSTATE 08S01] (Error 121)

solution :

insert/modify this Key

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpMaxDataRetransmissions valore (tipo Reg_Dword) = 30 ( decimal Value)

insert/modify this Key

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ KeepAliveInterval (tipo REG_DWORD) = 25000 (decimal value)

and disable TCP Chimney Offload

by command line :

Netsh int ip set chimney DISABLED

regards

Morten Bach SSC Journeyman Points: 83 More actions · Answer 11

Hi there,

We have the same problem and are whilling to try this solution.

Buuut does this fix need a reboot after ?

/Morten

Gianpiero R SSC Enthusiast Points: 187 More actions · Answer 12

hi morten

i confirm you this fix procedure need reboot

Soren Nielsen SSCommitted Points: 1818 More actions · Answer 13

Hi All

Interesting enough, we have begun seeing these timeouts again, so it seems it really is a ip timeout issue that went away with increasing the timeout's, but now the new limit seem to have been reached. I'll do more investigations on this, just wanted to warn people that this fix seem to be dependant on server workload and timings.....

Best regards

Soren Udsen Nielsen

Steve-3_5_7_9 SSChampion Points: 10832 More actions · Answer 14

We haven't had any issues since moving off of the SAN and to direct attached storage. It has been 3 months since moving to this new configuration.

Steve

Praveen Jha-412156 SSC Journeyman Points: 81 More actions · Answer 15

I am observing this on 3 of my sql 2005 x64 clsuter. We only see while weekend maintenance job runs and it succeeds in 3-4 retry attepts. Error meassgae is this

Msg 121, Sev 16, State 1: TCP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01]

Msg 121, Sev 16, State 1: Communication link failure [SQLSTATE 08S01]

Msg 16389, Sev 16, State 1: Communication link failure [SQLSTATE 08S01]

When going through error log I found this I/O error everytime around job failure. This was seen on all problemetic servers but with difeerent counters.

SQL Server has encountered 100 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\SYSTEM\MSSQL.1\MSSQL\DATA\tempdb7.MDF] in database [tempdb] (2). The OS file handle is 0x00000000000009F0. The offset of the latest long I/O is: 0x000000107ca000

So, I zeroed out SAN Issue and spread my temdb across 2 drives. I haven't seen the error resurfaces in 3 weeks. So we are going to replace our SQL System drives where tempdb resides an dhope that will fix it. One thing was common on all server, System drive (20 gigs) where tempdb was residing had less than 500 MB free space left. I dont know why would that cause any error as none of the files are set to autogrow and nummber of tempdb files are same as number of CPU.

Anyway, spreading tempdb across diff drive has worked so far for me, I will re-evaluate when I replace my 20 gigs system SNA drive with new 60 gigs System SAN drive.