May 18, 2015 at 4:09 am
Hi,
I was wondering if anyone was able to help with a problem I am experiencing with the failback of SQL Server 2014/Windows Server 2012 R2 FCI that is part of an Availability Group
Overview of the set up:
3-node WSFC (Windows 2012 R2), 2 nodes within same datacenter and the other node in a second datacenter.
The 2 nodes in the same datacenter are configured as an AlwaysOn AG Replica and the node in the second datacenter is a standalone AG Replica. Currently configured as FCI (Primary) and Standalone (Secondary) with a manual failover mode and asynchronous commit availability mode.
Overview of problem:
Failover at an Availability Group level works fine and can happily switch between primary and secondary and back again use the appropriate commands within SSMS. The problem comes by where we want to test running from the second node of the FCI. So within Failover Cluster Manager I've moved the SQL Server role from Node 1 to Node 2 and this works OK, this causes the AG role to failover as well. When I come to failback I repeat the same process but this time moving the SQL Server role from Node 2 back to Node 1 and the AG role does not move this time. I try to manually move it but fails saying the selected node is not a possible owner of the AG cluster resource.
If I go into the properties of the AG cluster resource it shows only the current node (Node 2) as a possible owner. If select Node 1 as well I can then move the resource back. The problem being that this property is reset each time it fails over so only the current node is selected.
Is anyone able to help provide insight into why this might be happening? I know your not supposed to control/configure AG roles through the Failover Cluster Manager as it's not aware of the synchronization state of the AG replica's but I can't see how else you would control a failover between FCI nodes.
Many thanks....
May 18, 2015 at 4:56 am
MrG78 (5/18/2015)
with a manual failover mode
That is correct, manual failover is the only possible selection when integrating an FCI as a replica.
See more in my stairway to alwayson series starting at this link
http://www.sqlservercentral.com/articles/FCI/107536/[/url]
MrG78 (5/18/2015)
The problem comes by where we want to test running from the second node of the FCI. So within Failover Cluster Manager I've moved the SQL Server role from Node 1 to Node 2 and this works OK, this causes the AG role to failover as well.
It wont cause a failover as this is a manual process but it will take the AG cluster resource offline, there is a difference.
MrG78 (5/18/2015)
When I come to failback I repeat the same process but this time moving the SQL Server role from Node 2 back to Node 1 and the AG role does not move this time. I try to manually move it but fails saying the selected node is not a possible owner of the AG cluster resource.
You'll need to provide more detail of the steps you're taking here as it seems that some parts are missing.
MrG78 (5/18/2015)
The problem being that this property is reset each time it fails over so only the current node is selected.
Not a problem, it's by design. This is all detailed in my stairway series
MrG78 (5/18/2015)
Is anyone able to help provide insight into why this might be happening? I know your not supposed to control/configure AG roles through the Failover Cluster Manager as it's not aware of the synchronization state of the AG replica's but I can't see how else you would control a failover between FCI nodes.Many thanks....
Can you provide any more detail and screenshots of what you're seeing. Error log messaqes too may be helpful
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
May 18, 2015 at 11:46 pm
Hi Perry, thanks for your prompt reply. I have read your 'Stairways to AlwaysOn' series and found it extremely helpful in the set up of our AlwaysOn solution.
I have attached number of files that show you the cluster events from both node 1 and node 2 of the FCI within our AG configuration.
Two of the files show the events when we move the SQL server role from Node 1 to Node 2 within Failover Cluster Manager and shows that immediately after moving this role the associated AG role is taken offline and started on Node 2 as well (with no interaction from myself).
Whereas when I move the SQL server role from Node 2 to 1 this role moves over and the associated AG role doesn't move and remains online on Node 2. The error I receive when I try and move the AG role manually to node 1 is the following:
The operation has failed
The action 'Move' did not complete
The operation failed because either the specified cluster node is not the owner of the group, or the node is not a possible owner of the group
When I add Node 1 to the list of 'possible owners' to the AG cluster resource it moves ok, so I don't understand why when I move roles from Node 1 to Node 2 both roles move over fine but when I move from Node 2 to Node 1 the AG role doesn't move (or go offline) until I manually set Node 2 as a 'possible owner'.
How would you manually change between nodes in an FCI cluster that is part of an AG replica? For instance if you need to windows patch and want to apply the patch to the inactive nodes of the FCI.
May 19, 2015 at 2:10 am
Apologies here is the attachment
May 19, 2015 at 3:52 am
Here's some more info showing config of AG and SQL Server cluster roles and resources.
Also showing the steps I took to move role between nodes of FCI. Again just to clarify, this purely regarding moving between nodes of the FCI (that is also an AlwaysOn replica) and not the failover of AlwaysOn between replicas
In the examples Node 1 and 2 are the two nodes of the FCI and Node 3 is the standalone secondary replica hosted in a separate DC
May 19, 2015 at 4:29 am
so from what I can understand you have 2 FCIs and both wont come online on the same node, correct?
Are you using any startup parameters for the instances?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
May 19, 2015 at 6:30 am
Yes that's right, I only specified one in the detail as to keep things simple. No startup trace flags/parameters are set.
The SQL server cluster roles SQL Server (CORE) and SQL Server (NONCORE) move between the two underlying nodes ok. Its just the behavior of the two AG roles AG-CORE1 and AG-NONCORE1 that differs.
When I move SQL Server (CORE) and/or SQL Server (NONCORE) from Node 1 to Node 2 both these roles and the AG roles move over ok.
When I move SQL Server (CORE) and/or SQL Server (NONCORE) from Node 2 to Node 1 these roles move over ok but the AG roles don't move over and remain online on Node 2. Trying to move them explicitly to Node 1 results in the following error:
The operation has failed
The action 'Move' did not complete
The operation failed because either the specified cluster node is not the owner of the group, or the node is not a possible owner of the group
So I'm not sure why the behavior of the AG roles is different going from Node 2 to Node 1
May 19, 2015 at 7:41 am
so u have 2 different availability groups between the core and noncore instances, but they both use the same standalone instance?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
May 19, 2015 at 7:48 am
Yes that's correct
May 19, 2015 at 8:25 am
and you're using sql server 2014?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
May 19, 2015 at 8:29 am
That's correct, SQL Server 2014 and Windows Server 2012 R2
June 6, 2016 at 9:57 am
We encountered the same issue. The Primary replica is hosted on FCI. When we fail over the cluster resource, the AG group is automatically moved to the designated node. However, when we fail back. The AG group is not moving back, and the AG group is in the resolving state.
What we can do is if we encounter the issue, we have to use SQL Server management studio, and run the command "Alter Availability group test failover" to fix the problem. But this not the automatic way, and cause extended downtime.
Mrg78: Do you have a better solution now?
June 7, 2016 at 10:38 am
Finally, it turns out to be a bug. KB 2687741 fix the issue.
February 8, 2017 at 11:03 am
Did we find a resolution to this issue. I have the exact same problem and that KB article is for Windows 2008 R2 however we are on Windows 2012 R2
April 28, 2017 at 6:54 pm
I would like to bump this back up as well.
We just experienced the same issue today with our FCI with a single AG. We manually failed over the SQL Instance for patching and the AG resource stayed up and running on the current node and went to RESOLVING. That's the first time I have seen it happen and I am pretty certain we have fully patched SQL 2014 and Windows 2012 R2.
I would love to know what caused it because when these kinds of things happen, we end up getting asked by 50 different people what happened and why.
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply