May 3, 2022 at 4:53 pm
Hi,
I know this error and why it happens - very well in fact.
Usually this is my config (simplified):
2 SQL hosts, SQL01 and SQL02
WSFC between them, SQLWSFC01
WSFC Security Group, WSFCSG01, contains SQLWSFC01
WSFCSG01 security group is given permission in DNS to be able to manage the Listener DNS record
WSFCSG01 security group object is given permissions to the OU where SQLWSFC01 resides:
...to allow it to 'Create/Delete' computer objects on 'This and descendent objects"
...to allow it to have Full Control to computer objects
This ALWAYS works, I've even checked current deployment against a previous one I did and it's the same.
What am I missing... where can I look?
Thanks
May 3, 2022 at 5:51 pm
You say you know this error very well, but sometimes it helps to go back to the basics and just double check the VERY basic configuration.
Have you come across this article before:
It has steps to check and things to do to try to resolve the error code you mentioned. Not trying to second guess your work, but usually when I hit odd snags like this, going back and double checking the "simple" stuff can lead me in the right direction.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
May 4, 2022 at 9:02 am
Yes I've read that article before. By "very well", I mean the "flow" of things in terms of what happens, i.e. the WSFC computer object is what needs access to create the AG objects, not the current user creating the AG etc. etc.
I've asked the AD team to pre-stage everything anyway to test - even though this isn't needed since the right objects have the right permissions to create things in AD/OU/DNS etc.
May 4, 2022 at 9:08 am
Weirdly, last week the AG Wizard was successful in creating the AG, but failed as above with the Listener. Yesterday I manually created the Listener and it worked (nothing was pre-staged ever).
Today, a new AG was created but again the Listener failed as part of the Wizard. AND failed when doing it manually after the Wizard. Towards the end of March (during testing), the AG Wizard was successful in both the AG and the Listener - not even a blink of an issue.
It's not my AD environment and I have no control over AD so am feeding back to the AD team. They have AD auditing tools and comparing end of March to now, nothing has changed as far as the WSFC object is concerned, for permissions to DNS and AD/OU.
So it seems hit and miss.
May 4, 2022 at 9:09 am
PS: The permissions to AD/OU/DNS are exactly the same I have set in previous environments, and there it worked 100% of the time. I had to build about 20 AGs.
May 4, 2022 at 9:11 am
"Not trying to second guess your work, but usually when I hit odd snags like this, going back and double checking the "simple" stuff can lead me in the right direction."
Didn't think you were, and I know what you mean about the "basic" stuff 🙂
May 4, 2022 at 10:27 pm
This was removed by the editor as SPAM
May 5, 2022 at 9:24 pm
I personally find that the wizards are hit and miss and I prefer to set things up myself where possible. I've had the wizards say "everything is great" only to discover that one step failed, but it kept going. Or it'll say it failed and roll the whole thing back when it is a simple thing to correct.
With intermittent issues like that, my thought is usually something along the lines of the network. I've seen faulty ethernet cables cause all sorts of strange issues for example. I've seen faulty patch cables pass along enough good packets for most simple test related thing to pass successfully, but as soon as a load is put on it, it fails. Not 100% sure what was wrong with the cable as it would pass the gigabit tests with our cable tester, but when trying to stream video across that cable, it would have TONS of packet loss.
Might not be the network, but since nothing changed on your end and it just suddenly worked at one point, I would be thinking it is either some intermittently faulty hardware OR someone is changing configurations on something (firewall for example) and not letting you know.
What may be interesting to do is a constant ping between all boxes participating in the AG and let that run for a few hours and then cancel it and look how much packet loss you had. If you see a lot of packet loss, I'd bring in the network team.
The main reason I don't think you are doing anything wrong is that you changed nothing, just tried it again, and it sometimes succeeds. To me this indicates the problem is NOT with the setup or configuration you have in place, but is something likely related to the network (in this case).
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
August 17, 2022 at 4:45 pm
Hi!
Thanks for the replies.
Have an update on this...
created 2 new servers (VMs)
added Failover Clustering
created Cluster
installed SQL
enabled AlwaysOn in Configuration Manager
boom - creating an AG with an AGL during the wizard no issues at all. The new setup has exactly the same configuration/permissions in AD and DNS to create the Listener computer object and DNS record.
Is there anything I can check on the original Windows hosts themselves. And a way to "undo" all the AG config., including things like removing Endpoint URLs so the SQL are like they never had AG attempted on them before.
Thanks again
August 17, 2022 at 4:47 pm
The hosts the VMs are running on are connected to the same network switches as the current physical SQL servers. Cluster.log is giving false information as well since it's complaining that the chosen Listener IP is a duplicate when it's 100% not!
I'm really stumped with this one!
August 17, 2022 at 4:48 pm
I'm planning to trash all the Windows Failover Cluster config. (none of the SQL is using it yet) and start again, but would ideally like to know the root cause.
August 22, 2022 at 3:40 pm
Another update;
we have stacked instances on the Windows hosts. I've just tried with what was the first named instance to be installed and it worked perfectly first time.
So I then tried on a different instance - a different one to what I've been trying so far - and it has the same error as the instance we've previously tried.
So it seems like any instances installed after the first instance have this issue.
Each instance service port is 100% unique (set statically and manually) - and is the same port across both hosts
Each listener port we're trying to use is 100% unique for that instance
Each instance endpoint ports are unique to that instance
Each other instance fails with the same error in cluster.log ("duplicate IP address for listener")
There is definitely not duplicate IPs though!
August 22, 2022 at 4:23 pm
Actually - I've just done this last test:
The AGL IP that worked in INSTANCE1 was x.x.x.100
The AGL IP that didn't work in INSTANCE2 was x.x.x.101
x.x.x.101 does not respond to a ping.
If I use .100 in INSTANCE2, it works. If I use .101 in INSTANCE1 it fails. So it's actually looking like a duplicate IP afterall. How does SQL/Windows Clustering determine duplicate IP?
September 5, 2022 at 8:42 am
This was removed by the editor as SPAM
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply