November 13, 2012 at 4:41 pm
I'm running out of brain cells trying to figure out what I've overlooked. Hoping someone has a quick checklist of things I can look for...
Scenario: I have 3 SQL 2008 servers, 2 lab/testing, 1 production with almost identical hardware (lab servers run 32Gb ram while the production server is at 24Gb). Disk configuration involves 4 mirrors (OS, Transactional Data, Warehouse Data, Log). I deployed process which is executed by 4 jobs. On my lab server, I was able to optimize the performance to process one event record every 55-62ms. When I pushed that process to the production server the same syntax processes one record every 550-650ms. The production server shows no waits or recognizable performance problems in Performance Monitor.
The only thing that points me in any direction is that on the lab server the Database IO on the Activity Monitor runs at 1.5Mb/sec with routine spikes above that while in the production environment the IO stubbornly runs at 0.3 MB/sec with occasional and dramatic spikes upward to the 4-5 MB/sec range. We tested raw disk performance and it runs at 100Mb/sec easily. I pulled out SQLIO and it shows >6.0Mb/sec in production and only 5.xMb/sec on the lab server. SQLIOSim, however, shows a significant difference. Running the standard IO configuration file in SQLIOSim, the lab server completes the entire run in less than 1/3 the time of the Production server.
It just feels like there is some sort of governor that is capping the IO performance, but I can't think of any particular setting that would do that... especially where it would have been turned off in one place and not the others. Just looking for tips on what else I can look at to uncover why one machine seems to scream while the others are just put putting along.
Thanks,
Rob
November 13, 2012 at 9:23 pm
Rob,
I know the end of the rope feeling so I thought I would at least try to help out. I'm assuming at this point you have double or triple checked that the same code is in place; have you also checked the db config and file config of the db in question? Have you checked the query plan on both environments to ensure they are identical? In addition I would also check the server config for RAM; MAXDOP etc. Confirm there is a difference after running DBCC FREEPROCCACHE as well; could be an old query plan in there.
Are either of the hard drives you’re running on different, SSD vs.15k vs. SATA? Are they on the same SAN, different SAN or local? Even if the hard drives are the same in the same mirrored configurations it may be that they are different SANS different caching.
Any other information would be helpful if you can provide it.
November 13, 2012 at 10:20 pm
compare prod vs test perfmon counters for logical disk reads per sec, write per sec, seconds per write, seconds per read, and current disk queue length . Any discrepancies?
November 14, 2012 at 1:55 am
Someone already asked about drive differences so maybe I can ask something else:
Is the data in both environments the same or have you rebuilt indexes, refreshed everything, restarted sql server to get a "clean" test for both. -- or does production have more data, and not as "clean" as the Lab Servers? (Though assuming production box you don't have that opportunity maybe?)
Though I think you already checked "select" from sys.configurations - if you haven't you could compare there too, but guessing that's not the place.
Shot in the dark - could your recovery model be different in Prod and TestLab?
--------------------------------------------------
...0.05 points per day since registration... slowly crawl up to 1 pt per day hopefully 😀
November 14, 2012 at 3:00 am
robr-793239 (11/13/2012)
I'm running out of brain cells trying to figure out what I've overlooked. Hoping someone has a quick checklist of things I can look for...Scenario: I have 3 SQL 2008 servers, 2 lab/testing, 1 production with almost identical hardware (lab servers run 32Gb ram while the production server is at 24Gb). Disk configuration involves 4 mirrors (OS, Transactional Data, Warehouse Data, Log). I deployed process which is executed by 4 jobs. On my lab server, I was able to optimize the performance to process one event record every 55-62ms. When I pushed that process to the production server the same syntax processes one record every 550-650ms. The production server shows no waits or recognizable performance problems in Performance Monitor.
The only thing that points me in any direction is that on the lab server the Database IO on the Activity Monitor runs at 1.5Mb/sec with routine spikes above that while in the production environment the IO stubbornly runs at 0.3 MB/sec with occasional and dramatic spikes upward to the 4-5 MB/sec range. We tested raw disk performance and it runs at 100Mb/sec easily. I pulled out SQLIO and it shows >6.0Mb/sec in production and only 5.xMb/sec on the lab server. SQLIOSim, however, shows a significant difference. Running the standard IO configuration file in SQLIOSim, the lab server completes the entire run in less than 1/3 the time of the Production server.
It just feels like there is some sort of governor that is capping the IO performance, but I can't think of any particular setting that would do that... especially where it would have been turned off in one place and not the others. Just looking for tips on what else I can look at to uncover why one machine seems to scream while the others are just put putting along.
Thanks,
Rob
Could you supply a little more detail on how the disks are setup and presented to each system?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
November 14, 2012 at 10:25 am
I appreciate all the responses giving me additional things to consider.
A little more color to fill in on the details. The servers both run 4 mirrored SCSI volumes for OS, PRIMARY Filegroup, EVENTDATA Filegroup, and Log files. Fragmentation rates across all the tables are about equal. All clustered indexes on both servers were rebuilt within the last week. Production server was started up on Friday and has accumulated about 5 million Event Data records while the lab server has been in operation for over 3 months and has accumulated 91 million event records.
An interesting event occurred last night in production. I logged on to the production server to do some more poking and I found that the Database IO rate was up to 5-6 Mb/sec. I was floored so I looked around and realized that the database backup routine had started and it was performing a backup. This morning when I checked in again it was back to its 0.3Mb/sec throughput.
So, the only glimmer of insight that I've realized now about the differences is that we relatively recently added a 4th volume to hold the TempDb. When we did it, I was attempting to address the same performance issue in the lab server, but it seemed that the change did not improve the overall throughput of the event processing. However, in our production system, the TempDb is on the same volume as the PRIMARY File Group. My current tack will be to try to separate those two and see what happens.
Again, I appreciate the feedback and opportunity to see the problem through other eyes.
Rob
November 14, 2012 at 10:50 am
What version of Windows are these running on? Or, more importantly- are you sure the disk partitions are aligned [in both environments]? Nothing leads me to suspect this is your issue, but is a good thing to be sure of.
http://support.microsoft.com/default.aspx?scid=kb;en-us;929491&sd=rss&spid=3198
-Ken
November 14, 2012 at 11:18 am
Rob,
Do you have SQL Sentry or SQL Monitor in place? Your post regarding backups jogged my memory of a simlar situation with one of my production environments where processing a particular cube was causing havoc with I/O and CPU to my other databases. Depending on what kind of event processing is taking place TempDB residing with Primary could definitly be the culprit as could file growth on your production systems if it is set to autogrow etc.
Teal Canady
November 14, 2012 at 3:03 pm
robr-793239 (11/14/2012)
However, in our production system, the TempDb is on the same volume as the PRIMARY File Group. My current tack will be to try to separate those two and see what happens.
Hmmmm, that might be it. Post back after you move it and what the results are.
November 14, 2012 at 6:16 pm
I'm still curious about latency (avg sec per read and avg sec per write for each logical drive )
how many iops does sqlio show for 8k random reads and writes?
5mb/sec is really slow for backups, especially if the drive can handle 100mb/s.
Can you post some wait stats?
What does cpu look like?
November 14, 2012 at 8:25 pm
Ok... so here are some new information:
I moved the TempDb to a separate ESATA drive on the server. My initial thought was that things seemed a little better but 5 hours later, if anything, the system is running slower than it was. Jobs that were running in 2 1/2 hours are now taking 2 3/4 hours.
I have some more specific statistics to share.
For SQLIO results, the Production server is running 881.66 IOPS with 6.88Mb/s on 8k and a 5 minute run. The Lab server runs 678.28 for 5.29Mb/s.
On Performance monitor, I'm seeing the following:
Avg Disk Sec. Read Write Transfer
Primary 0.017 0.028 0.202
EventData 0.017 0.028 0.028
Log 0.000 0.003 0.003
TempDB 0.000 0.004 0.004
I just observed an event that has given me some pause. Maybe this will change the nature of the conversation. What I've been focusing on is procedure that fires off and moves data from a raw storage table to a normalized data warehouse table. This requires taking data from remote clients and looking up key values and doing data conversion and other actions such as detecting new data and creating keys, etc. On the lab server I can run 4 instances of this job offset by 5 min each at together they will clear 180,000 records an hour. The production server, not so much... However, I happened to be watching the server at the top of the hour when an external agent fires off a sync command to the remote clients and they begin dumping their records to the raw storage table. In this sync process, eight remote clients at a time will dump their records in batches of 100 records each. What caught my attention is that during this phase, the Database IO rate shot up to over 100Mb/sec and the processor which had been pegged at 50% utilization was suddenly peaking at 100%. After the sync process completed, the IO went back down to 0.3 and the processor dropped back to 50%.
Could there be some odd result because this process is fired from a Job? Of course, I'm just talking out loud now... working my way through the system a piece at a time.
Oh, and addressing an earlier question about the structure. The production server was just deployed with fresh scripts from the latest working Lab deployment that we have so I'm confident that the procedure logic is identical to what is working extremely well on the Lab Server.
November 15, 2012 at 12:05 am
This is probably way off base ... but is there something external to the database getting in the way? For example an over enthusiastic virus checker.
November 15, 2012 at 8:04 am
You could hunt-and-peck on this forum for days or weeks for a problem such as this (and I have seen that happen any number of times) and never find a true cause/resolution. Or you could hire a qualified professional to come on board for a few hours or a day or two and really get things nailed down (while simultaneously mentoring you on their methodology). Seems like a no-brainer to me...
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 15, 2012 at 8:26 am
Are you using SSIS packages in your jobs? If so, make sure the server does not have any resource contention between the SSIS and SQL services.
I believe someone has mentioned this already but have you checked the Max Memory and Max DOP settings? Make sure you have the Max Memory set to something below the max server RAM.
November 15, 2012 at 12:25 pm
Rob,
To summarize your system here is what I have gathered so far:
RamRaid QuantityBreak down
Lab32GBRaid 1-04OS,Trans,DWH,Logs
Prod24GBRaid 1-04OS,Trans,DWH,Logs
Things we don’t know
•Is it on a SAN
•Processor Types and Cores Identical?
•Maximum Memory
•Minimum Memory
•Max DOP setting on each system
What I understand of your problem is this:
In Lab you can run four simultaneous processes and the event translation takes around 50-60ms. In production there are 8 simultaneous processes which take nearly 10x the time (550ms to 650ms). There also appears to be a generally high CPU utilization of 50% in production; backups are also causing contention with I/O when they kick off.
Based on that:
SAN:
I’m asking about your SAN because if indeed you are on a SAN your I/O stats could be lying to you. I believe SQLIO is based off WMI events and Perfmon counters. The metrics you have may or may not be accurate.
Question:
Are these disks on a SAN or no?
RAM:
The ram is different from production to LAB; in this case your lab has more. If your production server comes under pressure from too little RAM it can start paging out and causing I/O issues, CPU issues etc. So for instance with 24 GB in place maybe you have 20GB as your max and 8GB as your minimum setting. This would generally be ok provided that you don’t have other external programs running on this server which operate out of SQL Servers reserved memory pool. This would include items from within SQL such as CLR’s and Linked server calls.
When SQL comes under pressure it will try flipping out what’s in memory but only from its reserved pool; any other memory being freed relies on the OS to release the memory. Again this will include linked server calls and CLR’s etc. In earlier editions of SQL; such as 2005 I sometimes have found it necessary to lock MIN and MAX memory in place with the same values as it couldn’t release the RAM in a timely fashion; but that is neither here nor there.
Question:
What are you minimum and maximum memory settings on your production server? Are there any other external programs running on the server? Are you using CLR’s extensively?
Settings can be gathered via the GUI or via sp_configure.
--Allow all options to be visable
exec sp_configure 'show advanced options',1;
GO
--Execute this now as opposed to the next restart of the server
Reconfigure with override;
go
--View all the settings
exec sp_configure;
go
MaxDOP:
Depending on how many cores you have and if this process is going parallel this setting matters a great deal. If you have a MAXDOP of 0 and this is going parallel then even a single processing event could be using all the cores on the system.
Question:
How many cores are on your system and what is the Maximum Degree of parallelism that is currently set?
See sp_configure above.
--Taken from http://msdn.microsoft.com/en-us/magazine/cc135978.aspx
SELECT TOP 10
[Wait type] = wait_type,
[Wait time (s)] = wait_time_ms / 1000,
[% waiting] = CONVERT(DECIMAL(12,2), wait_time_ms * 100.0
/ SUM(wait_time_ms) OVER())
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE '%SLEEP%'
ORDER BY wait_time_ms DESC;
*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it
Clustered Indexes:
If you’re missing clustered indexes / Primary Keys it can cause your CPU to go nuts as SQL tries to satisfy a query across a heap. It’s worth taking a looks at the db’s in question as well as any others on the system to see if this is the case.
--Missing Primary Keys:
SELECT
t.TABLE_CATALOG
,t.TABLE_SCHEMA
,t.TABLE_NAME
,t.TABLE_TYPE
FROM
INFORMATION_SCHEMA.TABLES t
LEFT JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS pk
ON t.TABLE_CATALOG = pk.TABLE_CATALOG
AND t.TABLE_NAME = pk.TABLE_NAME
AND t.TABLE_SCHEMA = pk.TABLE_SCHEMA
AND pk.CONSTRAINT_TYPE = 'PRIMARY KEY'
WHERE
pk.TABLE_NAME IS NULL
AND t.TABLE_TYPE='BASE TABLE'
ORDER BY
t.TABLE_CATALOG
,t.TABLE_SCHEMA
,t.TABLE_NAME
--Missing Clustered Indexes
SELECT name
FROM sys.objects
WHERE type = 'U'
AND object_id NOT IN
(SELECT object_id FROM sys.indexes WHERE index_id = 1)
*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it
Missing indexes:
This can also be a point of contention as your using statics to satisfy queries. It might be worth looking into it to see if you can optimize those tables and queries causing high CPU.
--Count of missing indexes
SELECT
DatabaseName = DB_NAME(database_id)
, count(*) as [Number Indexes Missing]
FROM sys.dm_db_missing_index_details
GROUP BY DB_NAME(database_id)
ORDER BY 2 DESC;
--Top ten missing indexes based on usage since last restart
SELECT TOP 10
[Total Cost] = ROUND(avg_total_user_cost * avg_user_impact * (user_seeks + user_scans),0)
, avg_user_impact
, TableName = statement
, [EqualityUsage] = equality_columns
, [InequalityUsage] = inequality_columns
, [Include Cloumns] = included_columns
FROM sys.dm_db_missing_index_groups g
INNER JOIN sys.dm_db_missing_index_group_stats s
ON s.group_handle = g.index_group_handle
INNER JOIN sys.dm_db_missing_index_details d
ON d.index_handle = g.index_handle
ORDER BY [Total Cost] DESC;
*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it
If all this doesn’t point you in the right direction or give you a better idea of what could be going on you have another couple of options that are more time consuming; which is why I left them at the end.
Install the trial version of SQLSentry or SQLMonitor ; monitor your lab environment and production and see what is going on. Each of these products will provide metrics and a handy GUI for you to help better analyze your environment.
Finally and least appealing to me (because it’s the most time consuming) would be to install SQL 2012 somewhere and hook up the Distributed Reply feature; this would allow you to capture and then reply what is going on in your production environment directly in your lab. I hope this helps.
Teal Canady
Sr. DBA
Lieberman Research Worldwide
*Please excuse any spelling or grammatical mistakes; I'm just a DBA: )
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply