September 17, 2007 at 4:15 am
Hi all,
Am hoping that you can help me with a problem that seems to be quite popular on the web but has very few definitive fixes.
Server/SQL Background:
2 nodes with Windows Server 2003 Enterprise Edition SP1 and 16GB RAM. Memory is fixed at 14GB, 2Gb reserved for the OS. Default instance.
SQL Server 2005 Standard Edition (Patch Level 9.0.2153).
Active/Passive Cluster configuration.
Application Log Errors:
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] OnlineThread: QP is not online.
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]TCP Provider: An existing connection was forcibly closed by the remote host.
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]TCP Provider: The specified network name is no longer available.
Cluster Log Errors:
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]TCP Provider: An existing connection was forcibly closed by the remote host.
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]Communication link failure
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] OnlineThread: QP is not online.
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
00000f00.00001e08::2007/09/15-11:45:08.874 ERR SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
Have tried a few things that have been suggested. The ones that spring to mind are:
- Ensured that the cluster service account has sufficient SQL access to run 'SELECT @@Servername'
- Fixed memory size
- Disabled SynAttackProtect in the registry
- Upgraded the HP iLO network card drivers
- Have followed the MS guidelines for cluster configuration
Any further suggestion would be great and much appreciated.
The error occurs completely randomly, around 4-5 times in any 24 hour period. Also, failover does not occur, the cluster stays up and running, it just disappears (from a SQL point of view) off the network for a couple of seconds....
Cheers,
Dave.
April 22, 2008 at 2:07 pm
Hi Grasshopper
Did you ever get an answer to this, as I'm facing the exact same problem.
Thanks in advance!
April 23, 2008 at 4:34 am
Hi,
Nope, never did! I've done just about everything that I can do - except for opening a call with MS or rebuilding the cluster. I will be opening a all with MS this week though and will report anything back here.
Cheers,
Dave.
April 23, 2008 at 10:49 am
I am experiencing some issues with SQL Server 2005 installation.
Task Manager on the active node of the server shows Total Physical Memory :16776172 and Available Physical Memory : 15915488. Does this mean that SQL is not utlizing any memory? Also, SQL server agent log has events like "8 processor(s) and 4096 MB RAM detected.
Any help is very much appreciated.
Prakash B
May 1, 2008 at 11:58 am
I have exactly the same issure on my cluster, any positive comments now?
May 1, 2008 at 2:48 pm
Disable TCP Chimney, TCPA and RSS
HKLM-->System-->CurrentControlSet-->Services--TCPIP-->Parameters
Change the following entries to 0 and reboot
EnableTCPChimney
EnableTCPA
EnableRSS
Alternatively, set the KeepAlive setting via SQL Configuration Manager to a higher setting than the application side connect pool timeout setting.
May 7, 2008 at 2:47 pm
so, guys do you have answer to this issue?
May 7, 2008 at 3:22 pm
So after talking with MS and EMC for over two weeks, we've gotton down to what our problem was\is.
When we upgrade our machine from 6gb to 36gb of memory, this allowed us to keep all our tables in memory, but the side effect was that now when running our maintance job, the in-memory tables now are flooding the disks with Write requests and it's overloading our disks with the number of iops per sec, which once the (San 4gb write cache) disk cache fills up on the san, then io comes to a screeching hault on the machine as it has to start retrying and it has to deal with the large amount of things waiting in queue on the server. This causes our node to become less than fully responsive as it pegs the first cpu (Which ironically in a defult setup is the only cpu that handles network requests). So for us the "Fix" was a few things"
1. We turned off the write cache on the lun that the db was attached to which in turn made the disks slower so that the host did not throw all the thousands of write requests onto the lun and it throttled better.
2. We also appplied a registery change as requested by MS that allows the network load to be assigned to any open processor rather than rely on one proc (I'll paste the article number and some more info below). This change by it's self may be one of the root causes of your cluster "Network" errors.
3. We also tuned the query to reduce the io load as well.
4. We change the lun from a raid 5 to a raid 10 (With a few extra disks for io performance) and turned back on the write cache on the san.
I'd check to see if your getting any io requests taking longer than 15 seconds as well, as this also points to a disk issue as well. As a test lowering our sql server memory settings back down to 6gb elimited the issue as well for us, as the disks had to read which slowed down the write issue.
So while we really improved performance for our users by adding in the additional ram, we didn't take into account the additional load that was now being able to be placed on the disks.
Fun stuff.
-Greg
Ms info:
The processor load is not distributed across multiple processors on a computer that is running Windows Server 2003, Windows 2000 Server, or Windows NT 4.0
System error 64 has occurred. The specified network name is no longer available.
http://support.microsoft.com/kb/892100
Procs HEX BIN
2 0x3 0b11
3 0x7 0b111
4 0xF 0b1111
8 0XFF 0b11111111
Also, some additional notes from our MS case:
Apply MANDATORY Microsoft Hotfix 946448, required for all STORport driver installations (for Windows 2003 SP1/2) - http://support.microsoft.com/kb/946448
May 7, 2008 at 3:26 pm
I called Microsoft and was on phone with them for 6 hours yesterday. They were not able to give me an explanantion. They turned around and said that, this is the way, SQL 2005 is designed to work.
I still do not agree with Microsoft because, one of my friend who works for another IT firm do have SQL 2005 and he says that SQL 2005 should allocate all the available memory. If maximum 14 gigs of memory is configured on SQL SERVER, SQL should utlize every bit of it.
If any one has corrected the issue, please let me know, i shall be eager to know the solution.
Prakash B
May 7, 2008 at 3:43 pm
Sql 2005 will not immediatly allocate all the memory, It will just grab what is configured as it's min and grow from there. But once it's grown it wont release it unless called for by the os. Are you also using the "lock pages in memory" option? Definatly leave at least 2gb for the os outside of Sql though.
http://blogs.technet.com/askperf/archive/2008/03/25/lock-pages-in-memory-do-you-really-need-it.aspx
May 7, 2008 at 3:46 pm
yes, the service account that sql server services run on does have lock memory pages rights...
Prakash B
May 28, 2008 at 2:04 pm
This is a tough problem I remember one of this should fix it. Change the network card speed from Auto to the speed of switch specifically and also make sure network card driver is updated one.
June 20, 2008 at 2:26 am
We too have the following error on our win 2k3 X64 cluster system using SQL2005 . Here is the cluster log leading to cluster failure as pointed out by earlier posters on this thread .
000008d0.00000610::2008/06/19-08:56:49.653 ERR SQL Server : [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
We plan to change following para in registry and eliminate the write cache on storage .
HKLM-->System-->CurrentControlSet-->Services--TCPIP-->Parameters
Change the following entries to 0 and reboot
EnableTCPChimney
EnableTCPA
EnableRSS
Will this solve the problem.
We have also increased the isalive and lookslive values from 5000 and 60000 to 6500 and 80000.
Does this looks ok .
We use VNC , at time RDP and performance logs monitoring . Further SQL memory is et to utilize max physical memory (physical memory - 8GB) .
October 18, 2008 at 12:43 pm
We have resolved this issue on a couple of Clusters by setting the following in Registry on the nodes(After having updated service packs, driver, firmware and disabling TCP/IP offload)
1. Set TcpMaxDataRetransmissions to 30 (decimal);
http://technet2.microsoft.com/WindowsServer/en/library/7dac9001-3e55-4e9c-b0fa-52841ece2fdd1033.mspx
2. Set KeepAliveInterval to 25000 (decimal).
http://technet2.microsoft.com/WindowsServer/en/library/734570a2-06d6-450e-b765-ccfa7530af491033.mspx
It worked for us you mileage may.
October 21, 2008 at 4:30 am
Hi Thomas,
Many thanks for that. I'll certainly give it a try.
Cheers,
Dave.
Viewing 15 posts - 1 through 15 (of 32 total)
You must be logged in to reply to this topic. Login to reply