May 4, 2020 at 3:09 pm
Hello, lately we've been experiencing some odd Connection and/or Authentication errors while connecting to our SQL instances. These issues have been very intermittent and hard to troubleshoot as they are not easy to reproduce. They have also occurred on a variety of SQL versions from 2014-2017
Symptoms:
Occasionally Agent jobs, fail logging in to a local SQL instance. Usually when this happens we see something like the following in the Agent Error log: Unable to connect to SQL Server ''., Agent Error log shows [298] SQLServer Error: 258, TCP Provider: Timeout error [258]. [SQLSTATE 08001] [298] SQLServer Error: 258, Unable to complete login process due to delay in prelogin response [SQLSTATE 08001], the next run a minute late is successful.
Also we see in the agent jobs themselves, "Unable to connect to SQL Server 'instance name'"
Also, we are seeing connection troubles from an SSIS package on remote servers. The error logged in the SSISDB about that run was Client Unable to establish Connection - TCP Provider, and Existing Connection was forcibly closed by the remote host. The next run run is successful 1 minute later.
We've also see connections from SSMS reporting the following: Connection Timeout Expired. the timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was unable to respond back in time. The duration spent while attempting to connect to this server was [Pre-Loging] initialization=35375; handshake=0; The wait operation timed out.
We are seeing errors in the connection ring buffers on the source systems with an OSERROR of -2146893008 which appears to be related to SSPI.
I think this is some sort of intermittent AD issue, but my Server/AD admins are saying they are not seeing any issues. Does anyone have any thoughts as to what might be going on?
Also, is anyone aware of any documentation to what exactly occurs during a SQLClient connection attempt. We are seeing pre-Login handshake timeouts. I'd like specifics and documentation to point my network team to so they can help get to the root of the problem.
Thanks,
-Luke.
May 4, 2020 at 8:53 pm
To me, my first thought when I start getting timeout errors is that it is something in the network stack.
If you are connecting to a SQL ALIAS instead of the full server\instance name, it could be a delay translating that. Might not hurtto try connecting to the full server\instance name instead of the alias and see if it helps.
It could be the server is under heavy load and cannot respond to the connect request.
Things to watch for are:
1 - does this happen at a predictable time? If so, you have a nice window to watch resources. If not, you may need to capture resource usage long term until it happens again.
2 - when it fails, are there other things running on the server? If so, maybe try spacing things out more so not as much is running at the same time.
3 - is there enough free resources to allow a new connection? I am referring to BOTH the SQL instance having enough memory (Max Memory setting) and the OS (both CPU and Memory) and the network. With the network, if, for example, you are doing a large data move on that server, you may not have bandwidth for the SQL Server. Network bandwidth is my least likely theory,but depending on how your IT team does QoS on it, you may be having some odd bottleneck there.
To me, this doesn't sound like an AD issue as it sounds like the timeouts are happening connecting to the SQL Instance; not to AD. Depending on the frequency of these, and your SQL configuration, you could try running a short term test of using a SQL account instead of an AD account for your jobs. This would rule out an AD account issue. That being said, if it was an AD issue, I would expect you could see that in the logs (either the AD log if the authentication failed or in the login event log on the server hosting SQL).
Another thing to look at would be the server logs; not just the SQL logs. The application and system logs may (likely do) have some useful things in them.
To summarize the above: check the server logs (application, system), check server resources at time of failure (CPU, memory, network), if you can, remove as many constraints as you can (AD for example) from the equation of the problem.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
May 5, 2020 at 1:31 pm
Unfortunately, there was never anything of use in the the Windows Events Logs or even SQL Logs. The only things we could tie back to the issue were the failed jobs (and job output log files) and notes in the ring buffer with a type of 'RING_BUFFER_CONNECTIVITY' noting the errors. My initial thought is network related as well, but our network folks are saying everything is ok. At this point I'm looking for additional avenues to examine or to help pinpoint the issue.
It does not happen on a set time and is intermittent. There are always other things running on the servers in question. The issue is more likely to occur on instances on a multi instance, multi node FCI implementation, with hundreds of databases spread over the instances. The connectivity errors were shown while connecting to all three instances but at various times on various days. We've also seen sporadic errors from other systems not related to this cluster, so we're looking at the entire environment.
When the issues occur on the large cluster, there is about ~500GB of free memory per node (out of 3TB) and CPU utilization is normal for the instances with more than ~40% CPU available and AVG CPU queue length times < 0.20.
May 5, 2020 at 1:42 pm
could be Windows version - or even AV misbehaving.
have a look at https://support.microsoft.com/en-ie/help/2919863/you-receive-time-out-error-messages-when-you-connect-to-a-sql-server-2
May 5, 2020 at 3:04 pm
If these are VM's, it could be how the VM host is handling the memory. Balooning (in VMWare) I think can cause odd issues.
It is weird that nothing is showing up in the logs. As a thought, it MIGHT not be an error or warning message popping up in the log, but an information message. I would check all messages in the system and application logs as well as any TCP and network logs during the time it fails.
If they are physical servers, with 500 GB of memory free, Memory pressure or CPU pressure doesn't sound like the culprit. How is the disk I/O? How is the network bandwidth during that time?
To track down potential network issues, I would try running a tracert back from the server to a machine name and back to your DNS server and back to your domain controller. At my work, we have multiple sites world wide and I've had it bounce out to a different city (same country) and back while running tracert before. If you are getting some weird hops, it could be it is hopping to a server that isn't there anymore or a server that is incredibly far away causing timeouts.
In the windows event log, do you have any of the TCP/IP logging turned on? I think it is off by default if I remember right, but it may have something useful in it.
With the servers having the issue, are there any commonalities? What I mean are things like:
- Same OS version
- Same database version
- Same patch levels
- Same hardware
- Same VM cluster
- Same network switch(s)
- Same SAN
- etc
And then if there are any servers that don't exhibit this behavior, what is different between them?
Frederico_fonseca's suggestion though seems to be spot on for the error message. I would hope that your server has that patch applied (came out in 2014), but doesn't hurt to check for patches.
To add to Frederico's comment, your Antivirus is configured to NOT scan your database files, correct?
If your network guys say it isn't them, I think you will need to rule out everything else. Sometimes it is more ruling out everything it isn't and the only thing left has to be the problem. You can only control, check, and test what you have permissions to.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
May 5, 2020 at 6:06 pm
Frederico's suggestion is good, but these are Windows server 2016 servers and or 2012 with up to date patches as of April 2020.
The large cluster is running on physical machines. Network usage is minimal and they are within the same datacenter with 10Gb fiber networking. Disk I/O is within normals, generally 5-10ms for read and write latency.
Unfortunately, there is not much in common between the affect servers. They all go through the same core switches, but I have no idea if there are issues there. The network team initially said no, but now they want to replace some sfp modules for one of the machines that they believe is sending a number of retransmits. I don't think this one machine could be causing that much of an impact to the others, but we'll see what happens after the maintenance window tonight.
Thanks for you input,
-Luke.
May 5, 2020 at 6:50 pm
Too bad it wasn't a windows update patch, but also good that your servers are all up to date patched.
As a thought - when did this problem start? Could be a windows or SQL updates caused the issue too.
A bad SFP module could cause odd behavior so I would not be too surprised if the problem goes away after replacing that module.
Keep us posted!
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
May 12, 2020 at 4:46 pm
The issue started in earnest the middle of April. The SPF swap didn't work and our network folks are still seeing packet discards on the switch ports so they are running that down with driver/firmware updates etc. it's still ongoing so we'll see where this leads. I'll update this again once we have more infromation.
Thanks,
-Luke.
May 12, 2020 at 5:04 pm
Hopefully fixing the network issues helps resolve the problem. My expectation is still that it will.
Packet discards can be tricky to diagnose the problem, but it could be caused by overloaded switches (ie too much traffic for the switch to process) or misconfiguration. Since the problem started middle of April, I'd be looking at what changed around that time. Did you have any driver/firmware updates around that time? Did get more servers or workstations or printers or any network enabled devices? Somebody install a cheap DLINK at their desk?
Hopefully the driver/firmware updates help. If they are cisco managed switches, they have a lot of logging built in and they have a LOT of options you can configure on them. Could be a misconfiguration of the switch too.
Thanks for the update!
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
May 12, 2020 at 5:24 pm
This in an enterprise class data center, with fully redundant multi core 10Gb Fiber network paths back to the Core switches... The network has been in a rather large state of flux as they are working to upgrade the infrastructure to support 40Gb fiber later this year. I have no idea how many changes they are making to support that initiative. As far as driver/firmware updates, that's not something that was done during that timeframe. The only changes to the SQL servers were normal MS Security patches and enabling SQL Audit on some of them, but I think we've already proved it wasn't SQL Audit. We turned it off and still saw the issues.
I'm hopeful the driver/firmware will resolve the issue for this particular server, but I'm not certain that is the root cause as we're seeing it in other places as well.
Thanks,
-Luke.
May 12, 2020 at 7:19 pm
If they know which switch specifically is having the issues (which is likely), it might be a configuration issue. Especially if you were upgrading things. Now, I am not a networking guy by trade, but I've played around a bit with Cisco switches. Actually wrote up a GUI interface for connecting to Cisco switches as the java based one that Cisco had was pretty crashy. Mind you this was a good 10 years ago and the GUI was very limited. It was designed for basic configuration that could be done by a Tier 1 level technician (assigning VLAN's to ports, getting port configs, checking port status, etc). Worst damage you could do was to shut off a port or a set of ports. You could pick a set of ports and put them all on the same VLAN... was a fun tool to write. Was just doing an SSH connection to the switch and then firing off commands to the switch, so nothing super fancy.
But from writing that tool, I did learn a lot about Cisco switches which I have since mostly forgotten. One thing I did remember though is that they have a lot of different logging and config you can do on them. Not to blame the network guys for a bad config, but maybe they have something configured incorrectly and more traffic than intended is going through a single switch. As it is an enterprise class data center, I am guessing that they have a lot of switches and if one of them is configured wrong, it can be a huge pain in the butt trying to narrow down which one is causing the problem especially since you can't really just disconnect one and see if the traffic improves. BUT, I expect they are Cisco or equivalent switches, in which case you should be able to check logs to see which one or ones are problematic...
I have also learned not to poke the bear when they are already grumpy. My advice for you right now is to sit tight and wait for the network team to fix things and hope they get that resolved soon as it likely is causing other problems too...
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
June 29, 2020 at 12:32 pm
Hi,
facing the same kind of issues (intermittent connection issues) on a comparable environment.
Haven't checked yet though because of project and other constraints but maybe you'll find it useful
Chrs
June 29, 2020 at 1:01 pm
Yes, this was the issue, we just implemented it in the middle of the month and have been monitoring. There have been no issues since. We ended up disabling the DH ciphers on both the source and target servers and haven't seen this issue come back. Unfortunately, MS didn't release the patch for Windows server 2012 so there was a mismatch between what the 2 servers were using to communicate.
As long as there are plenty of other ciphers to use, this change is pretty safe, though our server admins wanted a restart of the Server 2012 box because of the down level powershell environment on that system required changing it manually, so there was some downtime involved, but we just paired it with patching.
Thanks,
-Luke.
June 29, 2020 at 2:04 pm
Thanks for the feedback Luke, much appreciated!
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply