SQL Named Instance Corruption During Multi-Site and AG Testing

  • I have "corrupted" a SQL named instance and I think I know how I did it. I'm at a loss on how to recover. Any suggestions would be helpful.

    I have setup a 3 node multi-site windows 2008 R2 cluster with 2 nodes at site A and one node at site B. The 2 nodes at site A have a FCI instance installed named say VirtualName\Violet. The node at site B has a named instance installed on it named say SiteBNode\Violet. The key here I think to my troubles is I used the same "second part" instance name for both instances.

    I started testing mirroring between the two instances thinking that this would be a good first step in testing that I'd be able to set up a 2012 Availability Group. I received some errors during the DB mirroring setup process. I ended up rebooting the site A nodes.

    Unfortunately I did not before hand remove the site B node from the list of failover nodes available for my FCI instance VirtualName\Violet before hand. So during the reboots the FCI instance tried to failover to the site B node. It was unsuccessful and I failed back the FCI instance to the site A node. Unfortunately the site B instance SiteBNode\Violet will not start up.

    The issue seems to be that the instance SiteBNode\Violet now sorta thinks it's the FCI instance. I say that because i get some errors in the event log about that instance trying to start up the virtual name of the FCI instance (in this example VirtualName). I did not do an "Add Node" installation for the FCI instance on the site B node, so this node should not know about the virtual name of the FCI instance.

    I tried the uninstall route with no success. A named instance uninstall fails on the Removal Rules check with a message that it's a clustered instance and can't be removed that way. I tried the remove a cluster node route but the instance is not available as a choice in the dropdown of listed instances. I'd prefer to do an uninstall, but can't seem to make any progress.

    I've looked in the registry. Under HKLM down to the Micorsoft SQL Server folder and can see a "Cluster" folder under the folder MSSQL11.VIOLET which I think shouldn't be there. I also see a value in the PipeName entry further down the reqistry tree that contains a reference to the VirtualName. These are the only differences I'm noticing in the registry between this messed-up instance and another functional named instance on this node.

    I'm leaning towards editing those two registry entries for this named instance on the site B node thinking that the instance might then just "revert" back to normal operations. Although my gut tells me it couldn't possibly be that easy.

    I'd be fine with a manual removal of the named instance on the site B node, although most of my google-ing results seem to assume that the built-in uninstall processes are running, they just don't complete all the way. So all that you have to do is delete your instance folders MSSQL11.VIOLET from the OS and then remove MSSQL11.VIOLET from the registry. Again my gut tells me it's not that simple when you are not running any uninstall processes at all.

    So manual removal? Uninstall? Registry editing? Something else?

    Any suggestions or fresh ideas are welcome.

    ...Thanks!

  • Ted Zatopek (1/14/2014)


    I have "corrupted" a SQL named instance and I think I know how I did it. I'm at a loss on how to recover. Any suggestions would be helpful.

    I have setup a 3 node multi-site windows 2008 R2 cluster with 2 nodes at site A and one node at site B. The 2 nodes at site A have a FCI instance installed named say VirtualName\Violet. The node at site B has a named instance installed on it named say SiteBNode\Violet. The key here I think to my troubles is I used the same "second part" instance name for both instances.

    I've not had any issues with this in the past, no reason to think it would be an issue now

    Ted Zatopek (1/14/2014)


    I started testing mirroring between the two instances thinking that this would be a good first step in testing that I'd be able to set up a 2012 Availability Group.

    Why?

    Just go straight to AO group configuration if that's what you intend to use.

    Ted Zatopek (1/14/2014)


    I received some errors during the DB mirroring setup process. I ended up rebooting the site A nodes.

    What errors did you receive?

    Ted Zatopek (1/14/2014)


    Unfortunately I did not before hand remove the site B node from the list of failover nodes available for my FCI instance VirtualName\Violet before hand. So during the reboots the FCI instance tried to failover to the site B node. It was unsuccessful and I failed back the FCI instance to the site A node. Unfortunately the site B instance SiteBNode\Violet will not start up.

    Hmm, unless someone has messed with the possible owners list for the VNN and the SQL Server rersource you should not have any issue. These are updated during the Add Node wizard.

    Can you confirm that the FCI fails over and starts correctly on both SiteA nodes?

    Can you supply full details of any messages from the windows app log and the sql log for the instance SiteBNode\Violet?

    Ted Zatopek (1/14/2014)


    So manual removal? Uninstall? Registry editing? Something else?

    STOP!

    Get details of any errors first before you attempt to bastardise the instance\registry.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (1/15/2014)


    I've not had any issues with this in the past, no reason to think it would be an issue now

    Good to know, I'll not have to do more redesign and rebuild.

    Perry Whittle (1/15/2014)


    Why?

    Just go straight to AO group configuration if that's what you intend to use.

    Well, I was thinking that since AO is built off of mirroring technology, I'd make sure that technology was working first as a way to see if my basic infrastructure is setup correctly.

    Perry Whittle (1/15/2014)


    What errors did you receive?

    I did not capture the exact message at the time. It was sometime like "cannot establish mirroring on DB" I do all my mirroring setup via t-sql. I got the error on setting up the partner on principal (after first setting up the partner on the mirror). A ran a stmt something like this.

    ALTER DATABASE DbName SET PARTNER =N'TCP://FqdnName:9023';

    Perry Whittle (1/15/2014)


    Hmm, unless someone has messed with the possible owners list for the VNN and the SQL Server rersource you should not have any issue. These are updated during the Add Node wizard.

    Can you confirm that the FCI fails over and starts correctly on both SiteA nodes?

    Can you supply full details of any messages from the windows app log and the sql log for the instance SiteBNode\Violet?

    All these nodes are VM machines. I've corrupted one of the VMs at SiteA, and after talking with the VM guys it's dead and gone. So I evicted it out of the cluster. I now have a one node FCI at SiteA, which is ridiculous as an implementation. Hopefully though still valid for testing the concept. I can bring the FCI instance offline and online on the SiteA node. It seems to be functioning normally.

    Here's the ending portion of the content of the sql errorlog for instance SiteBNode\Violet. I've been using generic names so for in this post. Here's some hopefully helpful name translation.

    SiteA FCI = SCVD\VIOLETDEV

    SiteBNode\VioletDev = CODVLPSQL01\VIOLETDEV

    2014-01-16 11:21:48.18 spid8s Starting up database 'master'.

    2014-01-16 11:21:48.29 spid8s 1 transactions rolled forward in database 'master' (1:0). This is an informational message only. No user action is required.

    2014-01-16 11:21:48.31 spid8s 0 transactions rolled back in database 'master' (1:0). This is an informational message only. No user action is required.

    2014-01-16 11:21:48.85 Server CLR version v4.0.30319 loaded.

    2014-01-16 11:21:50.45 spid8s Resource governor reconfiguration succeeded.

    2014-01-16 11:21:50.45 spid8s SQL Server Audit is starting the audits. This is an informational message. No user action is required.

    2014-01-16 11:21:50.46 spid8s SQL Server Audit has started the audits. This is an informational message. No user action is required.

    2014-01-16 11:21:50.74 spid8s SQL Trace ID 1 was started by login "sa".

    2014-01-16 11:21:50.79 spid8s Server name is 'CODVLPSQL01\VIOLETDEV'. This is an informational message only. No user action is required.

    2014-01-16 11:21:50.79 spid8s The NETBIOS name of the local node that is running the server is 'CODVLPSQL01'. This is an informational message only. No user action is required.

    2014-01-16 11:21:51.17 spid13s A self-generated certificate was successfully loaded for encryption.

    2014-01-16 11:21:52.18 Server Common language runtime (CLR) functionality initialized using CLR version v4.0.30319 from C:\Windows\Microsoft.NET\Framework64\v4.0.30319\.

    2014-01-16 11:21:53.83 spid13s Error: 26024, Severity: 16, State: 1.

    2014-01-16 11:21:53.83 spid13s Server failed to listen on 10.1.3.63 <ipv4> 2163. Error: 0x2741. To proceed, notify your system administrator.

    2014-01-16 11:21:53.83 spid13s Error: 26075, Severity: 16, State: 1.

    2014-01-16 11:21:53.83 spid13s Failed to start a listener for virtual network name 'SCVD'. Error: 10049.

    2014-01-16 11:21:53.84 spid13s Error: 17182, Severity: 16, State: 1.

    2014-01-16 11:21:53.84 spid13s TDSSNIClient initialization failed with error 0x2741, status code 0xa. Reason: Unable to initialize the TCP/IP listener. The requested address is not valid in its context.

    2014-01-16 11:21:53.84 spid13s Error: 17182, Severity: 16, State: 1.

    2014-01-16 11:21:53.84 spid13s TDSSNIClient initialization failed with error 0x2741, status code 0x1. Reason: Initialization failed with an infrastructure error. Check for previous errors. The requested address is not valid in its context.

    2014-01-16 11:21:53.84 spid13s Error: 17826, Severity: 18, State: 3.

    2014-01-16 11:21:53.84 spid13s Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.

    2014-01-16 11:21:53.85 spid13s Error: 17120, Severity: 16, State: 1.

    2014-01-16 11:21:53.85 spid13s SQL Server could not spawn FRunCommunicationsManager thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

    The part that originally caught my eye was how the process was trying to start the virtual name SCVD and it's associated IP adresss.

    Perry Whittle (1/15/2014)


    STOP!

    Get details of any errors first before you attempt to bastardise the instance\registry.

    Very good advice. I feel myself getting desperate to make progress on this project, and when I get desperate I can tend to try things that are overly risky. Of course, then things can get much worse.

  • Ted Zatopek (1/16/2014)


    I did not capture the exact message at the time. It was sometime like "cannot establish mirroring on DB" I do all my mirroring setup via t-sql. I got the error on setting up the partner on principal (after first setting up the partner on the mirror). A ran a stmt something like this.

    ALTER DATABASE DbName SET PARTNER =N'TCP://FqdnName:9023';

    What's in the error logs

    ted zapotek


    All these nodes are VM machines.

    All my test systems are built in vmware esx 😉

    Which replica is currently the AO group primary is it the FCI?

    Do you have a listener configured for the AO group?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (1/16/2014)


    What's in the error logs

    Here's the relevant errors from the log (I think)

    2014-01-10 13:10:22.53 spid34s Error: 1474, Severity: 16, State: 1.

    2014-01-10 13:10:22.53 spid34s Database mirroring connection error 4 'An error occurred while receiving data: '10054(An existing connection was forcibly closed by the remote host.)'.' for 'TCP://CODVLPSQL01:9027'.

    2014-01-10 13:10:42.25 spid32s Error: 1443, Severity: 16, State: 2.

    2014-01-10 13:10:42.25 spid32s Database mirroring has been terminated for database 'TestDB'. This is an informational message only. No user action is required.

    Perry Whittle (1/16/2014)


    Which replica is currently the AO group primary is it the FCI?

    Do you have a listener configured for the AO group?

    The intention for the AO group was that the FCI instance be the primary and the SiteB instance be the replica.

    I did not try to do any Availability Group setup. I didn't think I'd be able to setup much if I didn't have an instance to use for the replica. Woould doing that help with diagnosis?

  • Which replica is currently] the AO group primary, is it the FCI?

    Have you configured a listener for the AO group?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (1/16/2014)


    Which replica is currently] the AO group primary, is it the FCI?

    You've been very helpful and it seems like I'm missing something fundamental in my response to the question. I apologize. I do not have any AO group setup or a listener setup. I'm attempting to setup the sql instances so I can do so. My current problem is that I cannot bring up the instance (SiteBNode\Violet) that will hold the secondary replica. The FCI instance would be where initially the primary replica would be located.

    At one point I did have both SQL instances up and running. And then instead of going straight to setting up an AO AG group, I tried to setup a db that was being mirrored between the two. I ended up deciding to bounce the instance SiteBNode\Violet. I was able to take it offline, but not bring it online.

    That instance seems to be trying to bring up the virtual name of the FCI. Which doesn't make any sense to me. That instance should not be associated to that virtual name.

  • Ah ok no AO yet.

    Has the mirros session ever successfully connected\synchronised?

    What accounts are the sql server services running under?

    Did you use gui or tsql to setup mirroring?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • We decided to punt on this sandbox environment. I could not get the SiteB\Violet instance to come up. We rebuilt the SiteB server and are going to adding it to our existing test cluster. Hopefully, this goes much smoother.

    Thanks for all the help, suggestions and knowledge on this issue. It did help.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply