October 20, 2009 at 2:34 am
sry, got a duplicate post so cleared it out.
December 3, 2009 at 1:15 pm
We are having the same issue with FULL backups as well as Differentials. It seems to occationally reoccur on the same databases but mostly seems random.
TCP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01] (Error 121) Communication link failure [SQLSTATE 08S01] (Error 121). The step failed.
It seems as though I read a post somewhere saying that a check disk with repair fixed the issue. our drives are multi-terrabyte on production systems so it's hard to get that kind of downtime to attempt the check disk.
--------------------------
Zach
December 3, 2009 at 3:16 pm
DBADave. Well, I tried everything (you can see my earlier post). Then I tried the recommended changes to the registry to extend the timeout period on just the server and then both the server and backup device, to no avail. All that did for me was just extend the time before failure occurred. Very aggravating.
However, I seem to have SOLVED MY PROBLEM! I hope I don't jinx things here, but I haven't had a failed backup since Nov 6, nearly a month (I do a full on 2 servers every night, and logs every 30 minutes, so that's a lot of backups!). What'd I do? I found that I had been using Cat5 patch cables on my GB Ethernet NW (that's a no-no, and I knew better - even though the NICs reported gud 'nuf signals). I replaced the cables with at least Cat 5e, and Cat6 where I had it. That improved things, but didn't correct the problems altogether.
I've always suspected switch issues, and was the first things I tried, but could never resolve the issues no matter what I did with them. I use Netgear GS605 GB Switches on both my segments between Servers and backup devices. I had them stacked with the Netgear router (i.e., switch just sitting on top of the router. They are both small, uncooled devices, but get pretty warm at times.). It wasn't until I completely separated the two devices giving them plenty of breathing room (top, bottom, sides) did my problems go away.
This is the only thing I can point to that solved my NW timeout problems. Maybe it will last and perhaps this post might help somebody else out there. Keep those SoHo-class routers and switches cool.
December 3, 2009 at 3:26 pm
We had the problem occur again last week for the first time in a few months. In our case we believe the issue is related to a disk bottleneck, but are not certain. We have some significant disk issues with our current SAN that are causing us to purchase a new SAN. If this particular problem occurs after implementing a new SAN than we know disk can be ruled out, but for now we are leaning in that direction.
Good luck.
Dave
December 3, 2009 at 3:46 pm
Hi DBADave
If you haven't tried the mentioned reg settings then please do, if you also face these timeout errors. In my case I have not seen any error since I implemented them on my four node X64 bit SQL 2005 cluster.
//SUN
December 3, 2009 at 4:01 pm
We reviewed the Microsoft article regarding the registry setting and believe our symptoms are a little different than those stated in the doc. We've checked the network load on several occassions and have yet to see the network being hit hard. However, if the problem occurs again after implementing the new SAN than I will suggest we contact Microsoft to discuss the registry settings in more detail. I appreciate the input.
Thanks, Dave
February 2, 2010 at 12:34 am
Hi
One of the things u can check for , if using a SAN is the Fibre Channel cards. FC cards have more than 1 path (route) going to the san. One of the routes could be a problem or the card itself. Driver update does not always help. Check SQL logs if you see something like SQL server has encountered I/O requests taking longer that 15 seconds etc...usually when doing backups or big jobs...look at DISK / SAN / FC card ...
thanks
Mohamed Ismail
February 3, 2010 at 9:42 am
I've been randomly having this error during backups, update_stats, and updates (database grooming). The timings of this work does not overlap so it's not that. We are connected to a SAN and using Windows Server 2003 64-bit
Our File Servers are also getting this error and they are connected to another SAN. We're pretty sure it's response times on the SAN are not meeting the OS's tolerance so the OS is throwing this error. A MS Tech was also here and couldn't ferret out the problem.
In the next couple of weeks, I'm moving the database server (SQL 2005 64-bit) over to a 2-node cluster of HP DL785 (128GB Ram, 8 Quad processors) with Windows Server 2008 64 bit and DAS (MSA 2000's). No more SAN's for our database servers.
I'll post a periodic update once the move is complete.
Good luck, people.
February 3, 2010 at 10:23 am
Hi Steve
We are on same SAN no changes there and by applying these reg fixes I have not seen any issues since, we have by the way also applied them on our Win 2003 file server clusters that also saw "cluster fucks" where the cluster ended up being split brain and completely dead, and also here the fixes have solved the issues, but here I know that more reg settings was changed.
Best regards
Søren Udsen Nielsen
March 11, 2010 at 3:40 am
Hi all
i confirm the solution provided by soresen;
this is my scenario :
2 server windows 2003 EE 64 Bit ( Microsoft Cluster)
Sqk server 2008 64 bit
Sheduled Job on SQL Agent every night
the error ocurred :
when some sheduled job run, the error dispalyed :
CP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01] (Error 121) Communication link failure [SQLSTATE 08S01] (Error 121)
solution :
insert/modify this Key
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpMaxDataRetransmissions valore (tipo Reg_Dword) = 30 ( decimal Value)
insert/modify this Key
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ KeepAliveInterval (tipo REG_DWORD) = 25000 (decimal value)
and disable TCP Chimney Offload
by command line :
Netsh int ip set chimney DISABLED
regards
May 6, 2010 at 6:14 am
Hi there,
We have the same problem and are whilling to try this solution.
Buuut does this fix need a reboot after ?
/Morten
May 6, 2010 at 10:57 am
hi morten
i confirm you this fix procedure need reboot
June 4, 2010 at 4:26 am
Hi All
Interesting enough, we have begun seeing these timeouts again, so it seems it really is a ip timeout issue that went away with increasing the timeout's, but now the new limit seem to have been reached. I'll do more investigations on this, just wanted to warn people that this fix seem to be dependant on server workload and timings.....
Best regards
Soren Udsen Nielsen
June 4, 2010 at 5:37 am
We haven't had any issues since moving off of the SAN and to direct attached storage. It has been 3 months since moving to this new configuration.
Steve
June 28, 2010 at 10:28 am
I am observing this on 3 of my sql 2005 x64 clsuter. We only see while weekend maintenance job runs and it succeeds in 3-4 retry attepts. Error meassgae is this
Msg 121, Sev 16, State 1: TCP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01]
Msg 121, Sev 16, State 1: Communication link failure [SQLSTATE 08S01]
Msg 16389, Sev 16, State 1: Communication link failure [SQLSTATE 08S01]
When going through error log I found this I/O error everytime around job failure. This was seen on all problemetic servers but with difeerent counters.
SQL Server has encountered 100 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\SYSTEM\MSSQL.1\MSSQL\DATA\tempdb7.MDF] in database [tempdb] (2). The OS file handle is 0x00000000000009F0. The offset of the latest long I/O is: 0x000000107ca000
So, I zeroed out SAN Issue and spread my temdb across 2 drives. I haven't seen the error resurfaces in 3 weeks. So we are going to replace our SQL System drives where tempdb resides an dhope that will fix it. One thing was common on all server, System drive (20 gigs) where tempdb was residing had less than 500 MB free space left. I dont know why would that cause any error as none of the files are set to autogrow and nummber of tempdb files are same as number of CPU.
Anyway, spreading tempdb across diff drive has worked so far for me, I will re-evaluate when I replace my 20 gigs system SNA drive with new 60 gigs System SAN drive.
Viewing 15 posts - 16 through 30 (of 35 total)
You must be logged in to reply to this topic. Login to reply