February 4, 2022 at 2:39 pm
Hi,
Our infrastructure topology is pretty simple. We have a DMZ that needs access to some instances on the LAN. This is what we have and are observing:
A mixture of standalone instances and AG instances
Connecting to any instance type from SSMS on the DMZ device is always successful - however the icon is the blue question mark
Expanding the Databases list on the standalone instances is successful
Expanding the Databases list on the AG instances fails, with Semaphore timeout message
Expanding the Security node is successful on all instances types
Doesn't matter if the SQL User Login used to connect to this instance is a sysadmin or not - most weirdly in some cases we've seen a non-sysadmin user able to "see" more than a sysadmin user
In T-SQL, we can do USE <dbanme> just fine but when querying a table, get the same Semaphore timeout issue
Running the T-SQL can be seen in Profiler, so the statements are at least hitting the target instance
We have essentially an "allow all" rule from DMZ > LAN for testing. Makes no difference
On the target instances OS, we have the Windows Firewall enabled with inbound rules to allow 1433, 1434, and 5022 for AGs
All instances are Named Instances but have static port assignment (set to 1433)
Absolutely no issues, ever, connecting to instances from SSMS on the LAN
Just cannot understand why the AG instances even struggle to list the databases when the standalone instances do it just fine. Clearly it is the DMZ but have no idea what we can sort of half connect to instances but not do a lot with them.
Any suggestions massively welcome.
Thanks!
February 4, 2022 at 3:20 pm
PS: Running:
Test-NetConnection -Port 1433 -ComputerName <SQLHostname>
from the device in the DMZ comes back successful
February 4, 2022 at 3:32 pm
Just to confirm something - what does the network traffic look like on the AG instances?
My thought here is that your non-AG instances seem to be working fine, but the AG ones are having issues, I am thinking it is a resource constraint issue. I may be going down the wrong rabbit hole, but I would start by looking at the server-side metrics - CPU, memory, disk I/O, and network. If those all look good (ie not saturated) I would look at your SQL instance configs. Is the Max Memory setting set properly? Does the server have enough memory to accommodate all of the Max memory values? That is if you have 10 instances on a single machine, all set to 32 GB max memory, you would need at LEAST 320 GB of memory on the server. Otherwise your server and instances MAY be memory starved. If you have multiple instances on the machine AND you never configured max memory, you MAY have 1 instance using up all of the memory on the server and the others are memory starved.
What I would do is try to replicate the environment on a test/dev environment and try with only 1 SQL instance running. If that works, slowly add more until things fail.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
February 4, 2022 at 5:37 pm
This is me guessing, but according to this article the SQL Browser Service uses UDP on port 1434. So maybe you're not allowed to use UDP on that port?
February 4, 2022 at 6:01 pm
Thanks for both those replies.
Yes SQL Browser 1434 is allowed through the Windows firewall (and the DMZ rules are currently also in "allow all" mode from the DMZ device to anywhere on the LAN)
Yes, Max Memory is done properly. I am properly particular about this on EVERY install I do so understand the principle very well; All instances Windows OS has 40GB and SQL is throttled to 32GB, leaving 8GB for OS
We don't use stacked instances. Every instance is on dedicated OS, so there should be no instance contention for either networking, CPU or RAM.
As I say what I find most weird is that I can always connected to the instances just fine (even no delay in connecting to them in SSMS). It's only when I try to expand Databases or T-SQL query them. What networking is required to just expand Database list that is different to initially connecting to the instance.
February 4, 2022 at 7:12 pm
Use either extended events or profiler to monitor the session from the DMZ. If everything works from the LAN with no issues - but you see issues like this from the DMZ, my first thought would be network routing.
Even if the firewall rules are all set correctly - if the routes from the server back to the DMZ is not set correctly then any data being sent back to the client could be timing out due to latency. That would show up as being able to connect, but as soon as you request any data you would get timeouts.
The semaphore timeout is telling you that there is most likely a network issue - not a SQL Server issue. So - something in the path to/from the DMZ to that SQL Server instance is having issues. It could even just be a faulty component somewhere in the path.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
February 7, 2022 at 9:03 am
I agree 100% is seems network related.
But why don't standalone instances have this issue. Why is it only AG instances. It's not like it's even hit and miss - all standalone instances are fine 100% of the time and all AG instances are broken 100% of the time
I'll have a look at extended events.
February 7, 2022 at 9:19 am
In this document, it mentions a related windows task when setting up an AG of setting the HostRecordTTL . I'm guessing that that's a setting that could influence the stability of a connection if not set as specified.
February 7, 2022 at 1:55 pm
I retract my comment about HostRecordTTL being able to influence the stability of a connection. I saw TTL and somehow immediately assumed it was related to network TTL. Which was silly, because the highest value for that is 255, of course.
The MS link didn't clarify what HostRecordTTL is, but this Oracle white paper about setting up a MSSQL server AG in their cloud actually does a better job of describing that:
By default, clients cache the Cluster DNS records for 20 minutes. This means that if the active
node in your availability group goes offline, it can take that long for the clients to receive the IP
address of the newly promoted active node. You can reduce this time by changing the Host
Record Time To Live (HostRecordTTL) setting. The following example changes it to 5 minutes.
You can specify a lower amount, but that might result in increased traffic to the domain name
servers.
February 7, 2022 at 2:28 pm
As an experiment, I'm assuming you are using Windows Auth to connect. Try SQL authentication and see if the issue goes away.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
February 7, 2022 at 5:50 pm
So the issue is the blue question mark on the server in SSMS?
That is down to there being no WMI response when SSMS asked for the service status so it doesn’t know if it’s up, down, in transition etc.
Its not a problem just means you cannot send a net stop, net start via SSMS to control the services
February 8, 2022 at 9:34 am
Using SQL auth
Blue question mark is 'part' of the problem, but not the only (biggest) problem. Though WMI is something I will look at.
February 8, 2022 at 5:37 pm
Using SQL auth
Blue question mark is 'part' of the problem, but not the only (biggest) problem. Though WMI is something I will look at.
If you were using Windows authentication, SQL needs to connect to the domain controllers first. Without looking it up or checking my firewall rules, this communication occurs on a variety of ports. Since you are using SQL auth, that does not apply. I suggest you download and run Wireshark to pinpoint exactly what is failing with network communication.
The blue question mark may be that the WMI service is not running or the SQL service account cannot access it. What kind of service account are you using? Are you using a domain account?
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
February 8, 2022 at 5:39 pm
Yes, all SQL Services are using domain accounts (and those accounts have all the SPNs for Kerberos)
February 8, 2022 at 7:37 pm
Just to step back a bit - you indicated that you have 1 SQL instance per VM and the ones without AG's are working fine, but everything in an AG is misbehaving. Network configuration has been ruled out by essentially disabling firewalls and/or ensuring the config of the firewall is the same with AG instances and non-AG instances.
Has this ever worked? If so, what changed to cause it to stop working? If it has never worked, are you certain that the AG is configured correctly?
Another thought - can said you run TSQL queries, can you run "SELECT * FROM sys.databases" from the master database?
What I am thinking is something is causing slow access to the user database files which is resulting in the timeout. If you can access system databases, login will succeed, but expanding the databases I would expect to give an error when SQL thinks things are ONLINE but it can't access the database file in a timely manner. USE <DATABASE> should work fine as master thinks things are online and working so changing databases it can do without a problem, but when you go to look at the data, it can't read the mdf file fast enough and you are getting that error.
If this is correct, I would be looking at the AV configuration and if that seems fine, I'd be checking the disk throughput on the disk that holds your MDF and LDF file.
I am thinking that the timeout for loading the MDF/LDF for user databases is longer than the timeout for SSMS to give you the semaphore timeout. SO SQL is able to load the files, but is just incredibly slow to request data from them.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply