October 28, 2018 at 3:22 pm
Evening Guys,
I am back with another performance enigma (at least for me!).
I have a couple of queries (altering the collation of a column and creating a primary key on an int).
The biggest table is about 500 million rows, a mixture of nvarchar, ints and datetimes. No BLOB this time around.
So we have super fast SSD's that are apparently capable for 10x or more of the performance we are seeing. At *most* we're hitting the 10MB/Second mark. I am told by multiple people that the disk is not constrained in anyway. Disk Queue length is pretty much 1 all the time.
So the obvious culprit next is the CPU. But that's not getting above 7% (total) apart from the odd spike to 35% There are 10 CPU's - Virtual Xeon 3.2
There is no network involved.
Memory is 128GB installed but sql is throttled to use 123GB
I am at a loss as to where the bottleneck is i would expect the disks to be thrashing more OR performing faster than 10MB, I would expect the CPU to be taxed.
Can anyone help, or explain, or point me to other counters to help identify the bottle neck?
Cheers all
Alex
October 29, 2018 at 2:58 am
assuming that you have a VM then look at your host cpu sharing stats and disk IO quota
other things that do affect performance are power saving settings. for SQL Server Machine Bios, VM host (where used) and windows OS should have this set to max power.
For disk themselves its going to depend on HBA used, and disk block size.
But regardless of that you didnºt give any "bad performance" indications. Having a server doing slow IO does not mean that machine isn't set correctly. May just mean that the processes on that server are not bound by IO or CPU.
Apart from other tests you can do at IO level a quick test to see how disks behave is to do backup/restores of big databases (with DATA on the data files, not empty ones) and measure the disk activity during the backup/restore operation..
October 29, 2018 at 4:16 am
Hi Frederico,
Thanks for your reply.
That's exactly my problem - if its not bound by IO or CPU what could it be bound by? I thought performance boiled down to a bottleneck at the CPU or Disk level.
So now I am trying to find out what knowledge I am missing 🙂 as well as if its possible to speed this up 🙂
Cheers
Alex
October 29, 2018 at 4:32 am
Alex
Have you analysed your wait stats? That'll help you understand whether it's CPU or disk - or it could be something else, such as blocking.
Incidentally, I would recommend allowing the OS more than 5GB on a server with that much memory. Something more like 12GB would probably be more appropriate, or more still if you have other applications or other SQL Server components (such as SSRS) on the same server.
John
October 29, 2018 at 5:51 am
Hi John,
To me the waits some non-eventful. Bear in mind this is a brand new instance with nothing else running on there. Just me and the scripts above, this is what the top waits look like (using Brent O's script)
WaitType | Wait_S | Resource_S | Signal_S | WaitCount | Percentage |
PAGEIOLATCH_SH | 236745.59 | 235530.64 | 1214.95 | 17157323 | 48.52 |
LCK_M_S | 62642.25 | 62642.25 | 0 | 46 | 12.84 |
BACKUPTHREAD | 46064.03 | 46064.03 | 0 | 47 | 9.44 |
BACKUPIO | 45181.81 | 45119.36 | 62.45 | 7315869 | 9.26 |
CXPACKET | 36576.44 | 36383.51 | 192.93 | 6690586 | 7.5 |
PAGEIOLATCH_EX | 30649.77 | 30556.03 | 93.74 | 2104634 | 6.28 |
PREEMPTIVE_OS_WRITEFILEGATHER | 12882.15 | 12882.15 | 0 | 1385 | 2.64 |
Can I ask what the thinking is behind allowing more memory to the OS because it has more memory?
Cheers!
October 29, 2018 at 5:54 am
Oh. I should add that there is BLOB data this time around. I somehow missed it when I was originally looking at it. So it has 500+ million of data which is a mix of varchars, ints, and 1Byte to 60KB BLOBs.
October 29, 2018 at 6:09 am
From those numbers, it looks as if most of your waits are to do with disk latency, with a little bit of blocking (possibly caused by the disk latency) thrown in for good measure. But that's for the whole time since your instance was last started. Even if there is "nothing" else going on, I would still recommend measuring immediately before and after you do the work, and subtracting. Other things to consider are whether your data and log files are physically separated (if you're not on a VM) and your block size. What are your max degree of parallelism and your cost threshold for parallelism set to?
Here's a reference on the max server memory thing.
John
October 29, 2018 at 6:18 am
Hi John
MAXDOP is set to unrestricted (0), meanwhile threshold for parallelism is the default of 5.
The logs are not separated but are on a VM on a SAN.
Blocksize = 4096
October 29, 2018 at 6:27 am
Alex
Might be worth changing the block size to 64KB from its default of 4KB. Also, speak to your SAN administrator about whether putting log files on a separate logical drive could improve performance. I suspect (s)he'll say no, but it's worth asking anyway.
John
October 29, 2018 at 6:48 am
With the understanding that it resets when the SQL Server Service bounces and that something can be driven out of cache, I'd bet that it's code problems that need to be fixed and those can be found fairly easily. Open the explorer window in SSMS, right click on the instance, select reports, and then look for the performance reports. Fix those top ten and a lot of your performance will have a marked improvement.
I know I'm going against the normal grain but wait stats don't tell you as much, IMHO
--Jeff Moden
Change is inevitable... Change for the better is not.
October 30, 2018 at 4:26 am
Jeff this is an entirely unused server, and is purely being used for testing the collation change.
What is performing really slow is putting a primary key back on an int field.
October 30, 2018 at 4:27 am
Cheers John for the block size link.
October 30, 2018 at 6:10 am
alex.sqldba - Tuesday, October 30, 2018 4:26 AMJeff this is an entirely unused server, and is purely being used for testing the collation change.What is performing really slow is putting a primary key back on an int field.
Ah... now I understand...
If you're talking about a Clustered Primary Key then, yes, it can take quite a while. There are ways to speed it up.
If you're using the FULL Recovery Model, then all of it will be logged and that could also mean substantial log file growth. Since VLFs must be formatted in the log file, that can take a substantial amount of time even in the presence of SSDs. Don't forget that when you rebuild a Clustered Index on any table over 128 extents (that's only 8 MB), that the old copy of the index will persist until the new copy is completely built and committed. Only then will the old copy be dropped. That also means that if you have a large table and your MDF file doesn't have the space for another complete copy of the index, the MDF file must grow and if you don't have instant file initialization turned on, then the file must be zeroed out before use and that can also take a substantial amount of time even in the presence of SSDs.
So, the two ways to speed things up a bit (actually, a whole lot) is to....
1. Do the Primary Key build while in the Bulk Logged Recovery Model.
2. Make sure that Instant File Initialization is enabled with the understanding of the possible security risk there because old data isn't actually overwritten when the file is created or expanded.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 30, 2018 at 6:13 am
p.s. Don't forget about the possible serious problems that can occur with changes to collation in regards to TempDB. I've not personally witnessed such a thing because I've not done such a collation change but there are a lot of people on the web that have identified it as a serious problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 30, 2018 at 6:23 am
p.p.s. And, yes... I agree with the others. If you're only seeing a 10MB transfer rate, something is seriously wrong. We saw such a thing during one of our migrations to another box but slightly different. We'd see huge transfer rates at the beginning of a file copy and then it would slow to a crawl after a bit. I don't recall the particulars but it turned out to be some sort of a pipeline problem where it appeared to run like a banshee until cache was full and then slow down to about 10MB.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply