May 28, 2008 at 4:01 pm
We are having a Production problem where the SQL server in production is giving us half the throughput of Test. Here is a quick summary -
Both Production and Test SQL are at -
a. SQL 2000 SP4
b. W2K3 SP2
IBM confirms that the hardware the test and Production servers are on is exactly identical. In order to eliminate any intereference from the application server side, we ran the test from the same production application server to both test and Production databases. Still test gives us 540 rows per sec (540KB) and Production gives us 267 rows per sec (267KB).
Any ideas?
Thanks in advance!
May 28, 2008 at 4:08 pm
are the databases the same size, containing the same data, Stats up to date, database fragmented? Indexes missing, different? Network bandwidth the same? configuration values all the same? AV?
---------------------------------------------------------------------
May 29, 2008 at 5:29 pm
Does it help if I said the Database from production was backed up and restored in Development and then the test was run from the same application server that connected to the Production database and still the throughput when the app server connected to the same Production db on the test server was much higher than when it connected to the Production db on the Production server?
network bandwidth was confirmed by IBM to be the same .....
I do have an observation ... but do not know how significant it is ...
the Test server Drive F: where the DB resides is a "Spanned Volume" with 5 LUNS under it ... all RAID 5. The corresponding Production disk is a "Simple Volume" with just one big LUN under it.
Thanks in advance for your advice!
Leon
May 29, 2008 at 10:11 pm
The different "disks" being used for the two servers will definitely make a significant difference. Whether it accounts for all of the difference should be able to be determined by asking your SAN administrators to monitor for queuing on the LUN's involved. I would expect that the prod server (with a single LUN) will be experiencing a lot of queuing.
The test server appears to have far higher I/O capacity and is coping with the demands of your test.
May 29, 2008 at 10:30 pm
That is what I thought ... man but IBM is such a pain the butt .. they just discounted that as "insignificant difference" ..... they keep saying Disk on Production is not hitting any queuing.
I am going to ask them to mount 4 different LUNS under a single Logical drive and let me run the test and see if that changes the numbers
Thanks!
Leon
May 30, 2008 at 4:30 am
Yeah - other things that may also be affecting you include the cache that are allocated to the LUN's. There is also the potential that the I/O requests were all being satisfied from data already in cache within the SAN for test but the equivalent I/O requests may need to actually read the data from disks for the prod server.
You probably need to get a detailed understanding of ALL of the components that are actually being used for each server. This includes the specs of the various controller cards, disk specs, whether the disks assigned to a LUN are shared with other applications, cache activity, cache allocated to each LUN, cache on each disk. The list goes on and can be quite extensive. There is quite a bit of information about this sort of this available at http://www.sq-server-performance.com that would be worth looking at.
June 1, 2008 at 2:56 pm
Thanks HappyCat! That helps ... I am going to check this out!
Leon
June 6, 2008 at 11:26 am
HappyCat,
Here is the response I get from IBM for the query on the disk cache -
Leon,
Sorry for the delay I can give you cache hit rates but to answer your questions
1. All the LUNS share a common cache pool on the FastT system.
2. I would not be able to give you any cache sizing as it changes constantly depending on the free memory and workload of the box.
From my memory the cache hit rates on ap331 seem to be on the low side during the servers normal operations but when we were doing the file transfer tests the cache hit rate was in a "good" range.
Also, they are telling me that the Cache hit percentages they are seeing are as follows -
Log drive 97%
DB drive 78%
Isn't 78% low for a cache hit ratio?
Leon
June 6, 2008 at 6:45 pm
If 78% were the cache hit ratio that SQL is reporting (e.g via Performance Monitor), I would be quite concerned. For a system that is performing well I would expect that this should be above 90% - the higher the better.
Whilst I am not an expert with managing SAN's, I would expect that 78% would be a bit low. Can you get any information from IBM about the recommended cache hit ratios for their SAN ?
I don't know what options you have with changing the hardware. As an alternative, have you reviewed the indexes etc - you may be able to get a improvement if there is a better set of indexes that can be used. The "index tuning wizard" may be useful for this.
June 7, 2008 at 9:42 pm
I believe there are a few potential improvements to be achieved by optimizing index definitions ..... but the big question troubling everyone here is when the databases are the same, SQL versions are the same, windows versions are the same, application versions are the same ..... why is that Production is not performing as better as Test?
Leon
June 8, 2008 at 2:41 am
All I can suggest is that the differences that you have identified between the 2 servers (the different number of LUNs) should be investigated further as this seems to be the only clearly identified difference.
There is the potential for other activities (other users etc) using shared infrastructure to be implicated. I assume that you have been able to eliminate this.
Also, is all of the firmware etc the same version on each of the servers etc. It is not unusual for different firmware versions to perform differently.
June 8, 2008 at 8:18 am
Good point ... I'll have that checked
Leon
June 8, 2008 at 10:57 pm
Use PerfMon to measure the Disk response time during your tests on both Test and Production. If production is significantly slower, take the numbers to IBM to confirm that they are not providing an adequate Disk solution for your production environment.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply