Poor performance using linked servers for cluster-to-cluster communication

Question

Poor performance using linked servers for cluster-to-cluster communication

Innerise

Ten Centuries

Points: 1020
More actions
February 18, 2014 at 3:38 am

#282579

Hi All - I'm hoping you can help with an issue we have...
We have 2 environments:
Production
- 2 nodes (Server A, Server B) in the same physical location
- 1 Windows cluster
- 2 single-instance SQL clusters
1. CLUSTER1 (active instance on Server A)
2. CLUSTER2 (active instance on Server B)
- During a particular process, the active instances on each server communicate via linked servers (each pointing to the other SQL cluster name).
Test
- 1 node (Server C)
- No cluster
- 2 SQL instances (both active)
- During the same process, the instances on server C communicate also via linked servers (pointing to the other SQL instance on the same server C).
The databases are identical in both environments and all SQL config is the same. However, the test server has half the amount of RAM and half the number of CPUs in the Production environment.
The issue we have is that when the process is run in the test environment, it completes 32% quicker than Production.
We've been careful to make sure that the tests are consistent, so we're pretty sure it comes down to the fact that Production contains instances on different servers, whilst Test has the instances on the same server.
We performed another test in Production, where we failed over one of the active instances, so that they were both running on the same server (server B). However, this made no difference to the Production run time.
The clusters in the Production environments have been inspected by Microsoft and have a clean bill of health.
Given these results, in cases where there are 2 clusters in communication, could the network overhead (of resolving cluster IPs, etc) account for such a large reduction in performance compared to a similar non-clustered solution?
Thanks very much!

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398916 More actions · Answer 1

It could be the network configuration, but it's hard to say. I'd suggest capturing the query metrics on both machines, the execution plans on both machines, and, most importantly, the wait statistics before and after processing your query on both machines (but especially on the production server). We can make all sorts of guesses as to why things are running slow. But with the wait statistics we'll know why. We can also look at the execution plans to see if the optimizer is making different choices in production (possibly caused by differences in statistics or maybe server settings).

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Innerise Ten Centuries Points: 1020 More actions · Answer 2

Here are the top 4 wait stats right after the process has finished...looks like I need to sort out the formatting...

PRODUCTION

Server A instance:

wait_typewait_time_spctrunning_pct

BACKUPBUFFER417962.2432.1932.19

CXPACKET380097.5 29.2761.46

OLEDB 299701.7423.0884.54

BACKUPIO89200.91 6.8791.41

Server B instance:

wait_type wait_time_spctrunning_pct

BROKER_EVENTHANDLER34052.42 50.9650.96

ONDEMAND_TASK_QUEUE15122.11 22.6373.59

CXPACKET 4665.22 6.9880.57

OLEDB 4513.13 6.7587.32

TEST

SQL Instance 1 (equivalent to Server A instance):

wait_typewait_time_spctrunning_pct

BACKUPBUFFER441707.2 39.6139.61

CXPACKET265058.3 23.7763.37

OLEDB 235125.6521.0884.46

BACKUPIO112384.3210.0894.53

SQL Instance 2 (equivalent to Server B instance):

wait_type wait_time_spctrunning_pct

ONDEMAND_TASK_QUEUE79605.99 88.5988.59

CXPACKET 5888.4 6.5595.14

ASYNC_NETWORK_IO2200.14 2.4597.59

WRITELOG 889.81 0.9998.58

Grant Fritchey SSC Guru Points: 398916 More actions · Answer 3

Problem with only having the wait statistics after the query and not both before and after is that we can't tell what waits were experienced by the query. You need to be able to compare in order to understand what waits were experienced by the query.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Innerise Ten Centuries Points: 1020 More actions · Answer 4

Sorry, I should've mentioned that I cleared the wait stats prior to each run.

Grant Fritchey SSC Guru Points: 398916 More actions · Answer 5

The first server on both systems has pretty similar waits. It's the second servers where things get different, especially with the async_network_io, which doesn't usually indicate network problems. Normally it's an indication of waits on the client to consume data, so, there might be something there considering this is a linked server issue.

Why so much waits on backups though?

And what about the execution plans?

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Innerise Ten Centuries Points: 1020 More actions · Answer 6

The backups take a good couple of hours, as there are several 1TB+ databases on the instance. They backup to a local disk, across multiple backup files.

The process itself is a bit of a beast and contains dozens of SPs. I'll compare the execution plans

between environments and will get back to you.

Thanks for all the help so far!