June 2, 2010 at 7:21 am
While this could work as a temp fix, the SP2 fix we were talking about would be better.
the main danger in this is that when you look at the registry, the files that are found on Node 1 are in the registry under the D drive (or whatever drive you specified)
On Node 2 in the registry they are under the C Drive.
In SQL they are registered to the installation drive which is the D drive, which is why it is failing on Node 2.
Copying the folder would work to fix this error. However I haven't looked into what other things may go wrong from having the install registry for both nodes looking at different locations. Having SQL and the registry looking at 2 different places could eventually cause a problem. I haven't looked into this enough to try and find one though.
As you can tell I took the path to fix my error by coping over the files.
Twitter: @SQLBalls
Blog: http://www.SQLBalls.com
Channel: https://www.youtube.com/@Tales-from-the-Field
June 2, 2010 at 7:32 am
in COPYING the files i would not be removeing them from the old location. So all the links in the registry that point to registered dll files and such would still be valid.
SP2 may fix the issue during installation - but how does that help us poor saps who already have farms in production? is it supposed to fix the incorrect links in the MSDB database? If so how does it correct the link when the cluster fails over and the path changes?
I have a feeling MS is going to leave people in our situation out to dry on this one. Like i said the SP will fix it for future installs, but toat doesn't help us one bit.
since these nodes are in production, ripping our SQL from one of them isn't the best solution in my mind. Maybe microsoft should let us just "reinstall" these components on our passive nodes to the correct locations.... :shrug:
June 2, 2010 at 7:40 am
I understand your concerns, I'm hoping that SP2 will re-register the components in the registry to match the way the cluster is configured.
but I don't think we will know until the SP gets here.
And I agree I don't know what problems would be caused. I left the folder on the C drive and copied it to my D. The registry is pointing at valid items, and SQL is pointing at valid items.
But i'm sure there is something that could go wrong with this, everything is continuing to work just fine for me, but if soemthing does go wrong I will be taking a hard look at this part of the configuration.
Twitter: @SQLBalls
Blog: http://www.SQLBalls.com
Channel: https://www.youtube.com/@Tales-from-the-Field
June 2, 2010 at 8:30 am
I would NOT just copy the files over. In addition to the registry settings being wrong after the copy, there may be COM objects that have their locations registered. There may be unforeseen side effects down the road, and Microsoft will not support just copying the files over. If you're looking for a quick fix without regard to proper process, reproducibility, or support, then by all means hack your server by just copying the files over. But if I had an employee who suggested using that strategy on a production server I'd be thinking of how I could fire that employee so I could hire someone has more experience.
--Mr. SQL Guy
June 2, 2010 at 8:38 am
DUDE chill out.
There are other consideration that just what you want to do. I completely agree with ejecting the Node, rebuilding it, and re-adding it.
I was flat out denied by my management chain. DENIED.
Not everyone has the option to do the best scenerio 100% of the time.
I made sure to state the obvious, the registry which would work hand in hand with the COM objects is looking at a different location.
BAD THINGS COULD HAPPEN.
BUT, and this is a big one, we can only do what we are allowed. you go into meetings you preach the risk, you offer to come in on the weekend and fix it, or at 2 am on a weekday if that is when the user volume is low enough to allow it, but if your management chain says NO then when it blows up all you can say is, this is due to the issues I pointed out previously, and go fix it.
I am not saying this is ideal, I'm just saying it is what it is and I can't speak for everyone but even beign the Sr. DBA only goes so far when the CIO overrides you.
Twitter: @SQLBalls
Blog: http://www.SQLBalls.com
Channel: https://www.youtube.com/@Tales-from-the-Field
June 2, 2010 at 8:46 am
Mr SQL Guy.
I appreciate the info you shared with us as it identified an important issue/bug. Although I don't share your opinion that someone who uses this as a solution should be fired. Understand that what is best for you may not be best for someone else. Also the dll registrations and com entries in the registry would still all be VALID because we're not removing files from the old location.
The only problem I could see is if in the future you update the server with patches or service packs, the copied files would have to be manually copied again - or could become disimilar versions. Not a huge deal in my mind to work with until we see some kind of an official fix from Microsoft.
Something else I thought of that may stand some small chance of working. What if we added C:\program files (x86)\microsoft SQL server\100\tools\binn to the system path on the server?
June 2, 2010 at 8:48 am
Yes, I've worked with management who have imposed the same restrictions, so I feel for you. I wasn't really addressing your post as much as the posts from zaubut. You made it clear it was a temp fix, and in an emergency that's ok. There are possibly other ways of doing this too, for example if you're running SQL Enterprise just add a third node to the cluster, then remove the bad node. That reduces the risk dramatically. If you're running SQL Standard then you could still prep a third server and have it ready to go to add to the cluster as soon as you remove the bad one. That reduces the time and risk of completely removing the SQL Server install from the bad one before you reinstall. In my experience it's pretty hard to remove SQL Server totally--we have complete step-by-step directions for doing uninstalls as well as installs because of the trickiness.
--Mr. SQL Guy
June 2, 2010 at 9:51 am
i have another question - which is what would the harm be in not fixing this until MS releases the SP? right not it only seems to be impacting my job syspolicy_purge_history and nothing else really. as i understand that this is a job to scan databases for compliance to rules i define. i'm not using this feature, and thus no data should be recorded into the msdb table. So is it even necessary for this job to run?
June 2, 2010 at 10:13 am
I don't think SP2 will retroactively fix the problem for servers that have already been installed. It will only fix it for future installs. If you've already installed the subsystem features to different locations on your primary and secondary server, I think you'll still need to do the "remove node, uninstall SQL Server, add node" routine to permanently fix your installation.
--Mr. SQL Guy
June 2, 2010 at 10:45 am
Ugh... that is just great. Thanks M.S.
So the active member has it on D: and that is where i want it. However I can't install it on D: on additional nodes from what I have been reading. So what i need to do is fail over to the passive node, rip the active node out of the cluseter and uninstall SQL. Then readd to the cluster and reinstall SQL choosing the default location C:
What a pain in the butt. I think i'll just use the workaround #1 on the passive node:
use msdb
go
delete from msdb.dbo.syssubsystemsexec
msdb.dbo.sp_verify_subsystems 1
The thing is that I run an instance on each of my 2 servers. One is active on one server, and the other is active on the other server. They fail over to each other. I run SQL standard so adding a third server isn't an option, and my SAN only supports connecting to two servers (only has 2 fiber raid cards)
So i'm kind of stuck.
Ripping a server out of the cluster and readding it is a real PITA. Especially since that requires powering down and up nodes during the cluster creation process. These servers are over an hour drive away in our co-location and they service a 24/7 e-commerce web site. So doing this would not be easy to say the least.
June 2, 2010 at 11:01 am
yeah, you understand the issues. :crying:
However, from your post you might be forgetting the following:
1. When you re-add node #1 to the cluster, you won't be asked for any paths. It automatically decides all that for you. In fact the re-add/re-install goes very quickly on "add node" because you've already made most of the cluster decisions.
2. Downtime will actually be very minimal. Probably an initial 45 second downtime to move the clustered services to node #2. Then another 45 second outage when you move them back to node #1 after you've added node #1 back in. Of course you want to move them back to node #1 to make sure that node really works. 🙂
--Mr. SQL Guy
June 2, 2010 at 11:08 am
So you're not talking about actually ripping the server itself out of the cluster, just removing the clustered instance of SQL.
so i would want to remove the node that has it on D: which is my current active node. When I readd the node it will install to C: which will match my existing passive node.
I am guessing i would still need to run the:
use msdb
go
delete from msdb.dbo.syssubsystemsexec
msdb.dbo.sp_verify_subsystems 1
After the process. I can't see this updating the tables in msdb which would point to D: and i would want to point to C:
June 2, 2010 at 11:34 am
I don't think you need to run that code because that code only rebuilds the path to the subsystem. The path to the subsystem should already be pointing to your C: drive so you don't need to rebuild the path. But in step 2 below you may need to do that if you've already changed the path.
Below are the steps I followed, this should work for you. We have 3 clusters here and I did this successfully on all 3 clusters:
1. Move ALL clustered resources to node #2. You'll have to use the command line (cluster group "<groupname>" /move) to move the cluster group and the available storage group because you can't do that through the gui. Just use "cluster group" at the command line to get a list of all your groups and which node they're currently on.
2. Run the syspolicy_purge_history job on node #2 to make sure it completes successfully. (node #2 is the "good" node)
3. Remove node #1 from the cluster.
4. Uninstall ALL of SQL Server from node #1. This is the tricky part. You have to remove ALL components, and I even remove the left over directories and registry entries after the reboot just to be sure. We have directions on how to do this cleanly. If you want me to post our directions here I can do that.
5. Re-add node #1 to the cluster. Then immediately apply whatever service pack or SQL patches you need to make it exactly the same version of SQL as node #2.
6. Perform a SQL Server cluster switchover to node #1.
--Mr. SQL Guy
June 2, 2010 at 11:36 am
I just saw that you said you have 2 instances on each node. So I guess you'd need to remove both instances on node #1 from the cluster, then add both back in. We don't run multiple instances so I don't have any experience clustering multiple instances. But the directions should be the same I think.
--Mr. SQL Guy
June 2, 2010 at 12:04 pm
I have to be honest with you, I’m not entirely comfortable with this so far. I am a systems admin, but I play the part of a DBA on the weekends because my company lacks the resources to hire a dedicated DBA. Soooo… Lucky Me.
1.I’m unclear on which node to remove it from. Node 1 is fine, and has it on the D: drive. Node 2 is not, and it is on the C: drive. I’m thinking I remove it from node 1 and reinstall – but that would mean I am trying to fix a problem on node 2 by uninstalling/reinstalling on node 1. Is that right?
2.You say you can’t move cluster resources in the gui. Can’t I fire up cluster administrator – right click on the group and select move group to move it to the other node? Or is what you’re talking about not the same. I believe it is from my understanding of clustering.
3.When you say remove the server entirely from the cluster – are you talking just about removing sql as a clustered service, or actually removing the server itself from the cluster. If so that requires powering down and up servers as you gracefully add cluster resources to prevent resource conflicts. (according to MS http://technet.microsoft.com/en-us/library/cc739757(WS.10).aspx I don’t see how that results in only 45 seconds of down time.
Viewing 15 posts - 16 through 30 (of 54 total)
You must be logged in to reply to this topic. Login to reply