October 13, 2011 at 6:34 pm
We have several very active high performance on-line applications that I currently replicate using mirroring and transactional replication to another server located in different datacenter for DR purposes. Our data centres are located on west coast and east coast and application server farms are already running in active-active mode, but talking to only one of the SQL servers at a time. We want to make SQL servers also run in active-active mode (application servers talking to DB near them), so that failover for Disaster recovery and maintenance periods is simplified and to improve application latency. We are currently running SQL2005 Enterprise fail-over clusters on both sides, but could consider upgrading to 2008 if needed.
At the moment I am lab testing this using Merge replication with 2 second polling interval on the merge agent. I already see that i have to setup individual conflict resolvers for each table to match business logic (this is gonna to be fun in itself).
We have some degree of freedom of making changes to database schema and application code in some apps, but not for all. There is one app that come from third party vendor.
Many, if not most, tables use auto-incrementing IDs.
Most typical conflict scenario would be in case when replication, for some reason (network outage? replication agent crash?) stops and needs to catch up.
My question is this:
- Has this ever been done before in similar transactional, high performance scenario. I don't like to be on the bleeding edge here, this is financial institution and things need to work reliably.
Anyone can share experience? What worked and what didn't? Is this totally crazy idea?
October 17, 2011 at 10:59 am
Bump! Anyone? Anybody at all has anything to say about this? Am I on total bleeding edge here?
October 17, 2011 at 3:08 pm
I would just like to make sure I am understanding you correctly since you appear to be alluding to performance, business continuity, disaster recovery and high availability all at once.
1. Your having of "multiple high performance on-line applications" means that the applications you currently host are indeed performing highly? You have no empirical performance issues at this time?
2. Your applications are exposed through one data center currently? Your other datacenter is for DR purposes only? Nothing is exposed currently through the DR site?
I can speak more directly about the specific technologies you're asking about, but I need to nail down current conditions first, then I will move on to your goals as you've already stated.
October 17, 2011 at 4:24 pm
1. correct, no issues. except for network latency resulting from web servers on the east coast have to talk to DB on the west coast.
2. Incorrect. web servers on both sites are currently active and talking to the same database on east coast or west coast (if we failover). the goal is to have local DB for each server farm and synchronize data.
October 18, 2011 at 2:45 pm
Ok, then my next question would be:
3. Are the application tiers on the west coast and east coast exposing the same application? Or is it that some of the apps are exposed on the west coast, some on the east, all talk to one data center for data?
October 18, 2011 at 2:57 pm
Yes it's the same application running east and west, hence talking to the same DBs.
October 18, 2011 at 4:03 pm
4. So how do you decide which of your users get pointed to the west coast app tier, and which get pointed towards the east coast app tier?
October 18, 2011 at 4:32 pm
GCS - global content switch
October 18, 2011 at 6:51 pm
So you have a single routing/load balancing point making this decision?
October 19, 2011 at 11:01 am
No, they are mirrored too. Anyway, there is full active-active setup on entire infrastructure, except databases.
October 19, 2011 at 11:40 am
Well. Hrm. I guess I'll just ramble then and we'll see what we can see.
Based on what you've told me there is conditional routing of clients to multiple presentation points based on geography of said client. My presumption is that you do this to lessen latency in accessing the presentation tier. In doing so you, for a portion of your users, have simply moved said latency behind the presentation tier to the data tier communications. You've traded milliseconds for other milliseconds. Now you're looking to move those milliseconds once more, into the data tier completely through merge replication.
If you need to do this, you need to do this. I cannot possibly understand your entire business structure, technical structure, policies, SLA's, all of that through this forum, but as a rule I do not suggest this configuration. I am a big fan of there existing somewhere in an environment an authoritative iteration of a data set, and merge replication goes directly against that. Data change in two environments means that neither environment has an authoritative data set and that you always have data in flight (theorhetically). Both sites always know something the other one doesn't, an that makes me edgy.
Is it indeed worth trading latency for data that will now constantly be in flux? Is there a reason a more classical approach couldn't be used? One presentation tier (highly available), one authoritative data tier (highly available), and then appropriate backup and DR configurations to fail the whole thing over if necessary? Are you sure that the latency users would experience out on the public network channels is indeed slower than the latency you have on your private channels? Is the difference enough to put a strain on data integrity?
I'm not anti-complexity by any means. Sometimes a business need demands it and there are no other options, but I just want to be sure that something more normal, mundane, easily dealt-with isn't possibly applicable.
October 19, 2011 at 11:59 am
I understand and agree with the points you made. They were presented to the business unit as a choice between data consistency and availability. Business owners chose availability with eventual consistency. The nature of data stored in the database is such that consistency at all cost is not paramount. We can live with data being out of sync for 5 seconds, not a problem.
Unfortunately geographical partitioning of data is not an option here. Users can be routed to West coast when using their desktop PC, and to East coast when using their mobile phone by the carrier (some route traffic through East coast).
Regarding your point about using traditional methods: we already have all of that. We want to see if we can take it one step further. Traditional fail-over models are prone to failure and not as reliable as we like. We want to have a setup where if our west coast datacenter turns into large smouldering hole in the ground, life just goes on as usual without the need for human intervention and complicated scripts.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply