June 20, 2011 at 8:10 am
We recently upgrade our SQL server to new hardware, and our Storage to a new SAN
Server: HP 580 with 4 hex-core CPU's and 128 GB RAM (112 GB Available).
SAN: HP LeftHand SAN 3-node cluster
The Server and SAN communicate via iSCSI.
There are 2 databases hosted on the server.
DB1: 650GB
DB2: 700GB
Each database has multiple data files in multiple filegroups on multple SAN volumes.
Both databases are using SQL Transparent Data Encryption. The issues existed BEFORE TDE was implemented.
We are seeing very long IO times when performing disk operations, such as filegrowth. In one instance it took 5 minutes to add 10 GB to a data file. This caused applicaiton timeouts and downtime until I could manually extend the file. (In actuality I extended the file, disabled autogrowth, and added a new file)
Today we are seeing disk opperation alerts ( 12 ms Reads, 13 ms Writes, etc.) According to my NetOps team, there is nothing wrong with the storage, server, or network configuration.
Admitedly, there is high memory utilization on the server right now. > 95% OS memory. > 90% SQL Memory. A reboot is in order.
I need to tap into the configuration expertise in the group. Where in SQL Server, Windows, Network, or Storage configuration can I look for potential issues?
Thanks,
Greg
Greg Roberts
June 20, 2011 at 9:05 am
what upgrade strategy you use, side-by-side or in-place?
June 20, 2011 at 2:56 pm
side by side.
we installed new hardware with a clean os and SQL server install. Restored a previous backup to the server and ran load tests. Everything appeared OK. After the Load tests we stopped SQL Server on the original server and moved the database files to the new hardware. Same volume and disk configuration as we conducted load tests on.
Original Server: HP 380 Performance
2x 4 Core Xeon CPU
64 GB RAM
2x 10/100/1000 NIC (Teamed @ 1GB each)
New Server: HP 580 G6 Performance
4x 6 Core Xeon CPU
128 GB RAM (112 usable)
6 x 10/100/1000 NICs
- 2x @ 1 GB public netowkr
- 4x @ 1 GB storage network
Greg Roberts
June 20, 2011 at 3:28 pm
what has changed on the new SAN....are you running the same number of spindles.....more hopefully?
what about cache on the SAN, is set to favor READ or WRITE? and that number diff then before
and did the CACHE on the new SAN increase or decrease
and what about disk speeds increase or decrease
and last but not least did you run SQLIO on the new SAN before you turned it on? this should have told you what the SAN was capable off.
June 20, 2011 at 3:49 pm
456789psw (6/20/2011)
what has changed on the new SAN....are you running the same number of spindles.....more hopefully?
Went from 2-node to 3-node cluster
what about cache on the SAN, is set to favor READ or WRITE? and that number diff then before
don't know
and did the CACHE on the new SAN increase or decrease
increase
and what about disk speeds increase or decrease
increase
and last but not least did you run SQLIO on the new SAN before you turned it on? this should have told you what the SAN was capable off.
no. did not have time to run SQLIO
Greg Roberts
June 20, 2011 at 5:47 pm
Greg,
sorry I missed the 12 and 13 ms part. net team is right should have any issues there.
what was the auto growth set to grow at? if it was a low number and you asked for a big growth then that means it had to stop each time before it grew again..makes things very slow!!!
are you having any other issues that are slow? if so, Have you check your wait types best place to start.
as far as memory, not sure what you mean both are using in the 90 percent tile? I would check your SQL Server: Buffer Manager: Page Life Expectancy make sure you are at least above 300....which is 5 min
June 20, 2011 at 11:55 pm
What OS are you on ? ( + sp/hf please )
What's the version info of your sqlinstance ?
Can you elaborate on the topology of your lefthand solution ?
Are the luns of your sqlinstance placed on disks that also host other luns ?
Was there a change in raid level definitions of your luns ?
By enabling DTE on a system that already suffered IO, chances are you've made it worse.
Did you enable instant file initialisation ( windows level grant ! )
You jumped into a dark pit by not running SQLIO.
That thing is produced to predict and prevent IO problems before it hurts you !
IMO Lefthand is more advisable for file system solutions.
Top performance db systems shouldn't be placed on it, but it is a sales argument !
Of course it can host db systems, but how can you control it and make it performing well enough.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
June 21, 2011 at 12:16 am
First of all check the SAN Cache and then RAID Groups,If All disk on the same RAID Group that might be a problem
Page-Life-Expect..
as asked by the @Old Hand
For Disk Delays
===============
select DB_NAME(database_id) DB_NAME, di.file_id,df.name,io_stall_read_ms ,num_of_reads
,cast(io_stall_read_ms/(1.0+num_of_reads) as numeric(10,1)) as 'avg_read_stall_ms'
,io_stall_write_ms,num_of_writes
,cast(io_stall_write_ms/(1.0+num_of_writes) as numeric(10,1)) as 'avg_write_stall_ms'
,io_stall_read_ms + io_stall_write_ms as io_stalls
,num_of_reads + num_of_writes as total_io
,cast((io_stall_read_ms+io_stall_write_ms)/(1.0+num_of_reads + num_of_writes) as numeric(10,1)) as 'avg_io_stall_ms'
from sys.dm_io_virtual_file_stats(null,null) di inner join sys.database_files df on df.file_id = di.file_id
where DB_NAME(database_id) = 'your database name'
order by avg_io_stall_ms desc
IO pendings
===========
select
database_id,
file_id,
io_stall,
io_pending_ms_ticks,
scheduler_address
from sys.dm_io_virtual_file_stats(NULL, NULL)t1,
sys.dm_io_pending_io_requests as t2
where t1.file_handle = t2.io_handle
For Memory Presure
==================
select * from sys.dm_os_performance_counters
where counter_name like 'page life%'
For Queries Utilizations
========================
SELECT ST.TEXT,SP.SPID,WAITTIME,LASTWAITTYPE,CPU,PHYSICAL_IO,STATUS,HOSTNAME,PROGRAM_NAME,CMD,LOGINAME FROM SYS.SYSPROCESSES SP
CROSS APPLY SYS.DM_EXEC_SQL_TEXT(SP.SQL_HANDLE) ST
WHERE STATUS !='SLEEPING'
ORDER BY CPU DESC
For CPU Usage
=============
select
scheduler_id,
current_tasks_count,
runnable_tasks_count
from
sys.dm_os_schedulers
where
scheduler_id < 255
For System Info
=============
Select * from sys.dm_os_sys_info
provide these queries result
Regards,
Syed Jahanzaib Bin Hassan
BSCS | MCTS | MCITP | OCA | OCP | OCE | SCJP | IBMCDBA
My Blog
www.aureus-salah.com
June 21, 2011 at 5:15 am
The Server and SAN communicate via iSCSI.
Greg,
before change of servers, you communicate using iSCSI?
June 21, 2011 at 5:52 am
please supply more details of the iSCSI setup, hardware or software?
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
June 21, 2011 at 12:22 pm
I agree about not running SQLIO. Unfortunately the SAN did not arrive until the day before deployment, and the business refused to let us delay to properly configure and test.
Greg Roberts
June 21, 2011 at 12:24 pm
Greg, is it just the file growth that is giving you issue?
June 21, 2011 at 12:29 pm
The SAN is dedicated to this SQL Server. There are no other servers accessing LUNs on the SAN.
As far as SAN config is concerned, I'm a bit in the dark there. I know that it is a 3-node cluster. I don't recall how the array(s) are configured that are hosting the LUNs. My SAN Admin is on vacation,so I can't find out right now.
Another issue I am seeing, but not sure it really is an issue, is the Average disk milliseconds per Read/Write/ or Transfer is hovering in the 10 - 20 range on all SAN volumes.
Regarding the previous server, yes it was running iSCSI. Not my favorite, but not budget for a good fibre channel.
I'll dig into the WAIT stats a bit more and post what I find. Thank you all for your assitance.
Greg Roberts
June 21, 2011 at 12:41 pm
456789psw (6/21/2011)
Greg, is it just the file growth that is giving you issue?
That was the big smashed thumb. We have also noticed a degradation in our backup times, even before TDE. We saw thruput drop from an average of 115MB/sec to 100MB/sec. But that only equated to 15 extra minutes to the backup job. I'm pretty convinced there is a SAN config issue, but our storage admin instists it is working correctly. And when he runs LAN Sped Test from the DB server to the SAN volumes, performance is good.
Regarding the growth settings, the files are set to grow at 5GB. This reduces the number of growth events to approximately 1 / week.
Greg Roberts
June 21, 2011 at 1:06 pm
I would agree pointing more and more to the SAN...he may be right it is working correctly...but it may not be working optimally.( it could be set to run faster or maybe there is less Drives and more RPM resulting in less through put)
shoot us out the Wait stats...maybe that will help
Viewing 15 posts - 1 through 15 (of 35 total)
You must be logged in to reply to this topic. Login to reply