August 20, 2009 at 2:51 am
Hi, I'm really stumped on this and can't find anything to try and resolve it. Any hints of clues as to what to look for would be really appreciated.
I installed 64bit SQL Server 2005 on a single physical CPU Nehalem Xeon x5570 with HT (so 8 virtual cores are shown to the OS) on a Windows Server 2008 Standard edition Dell box.
I noticed the queries were not running as fast as I expected (around the same speed as a previous generation Xeon box) and, after trying everything else I could think of, I messed with the affinity mask settings. My results found that for CPU bound queries (that do not seem to run in parallel anyway) if the CPU bound part schedules on virtual core 0 they complete about twice as fast as if they schedule on any other core. Actually this is true no matter what the affinity mask is set to. But the only way to ensure they schedule on core 0 is to set the processor affinity mask for that core only.
Obviously this doesn't scale well, but I don't really have much experience of this CPU / system setup, so if anyone knows what I'm doing wrong, or how to address this I'd really appreciate it, as most of the queries for the main system running on this are CPU bound.
Thanks.
UPDATE: this is still the case with Hyper Threading turned off and even with some of the physical cores turned off. If the query schedules on core zero it runs twice as fast as any other core.
Additional info, there's nothing more fancy than some large tables, in-line table valued functions, scalar functions and some other views in the database (ie no user external stored procedures, c# or any other non-SQL stuff).
August 28, 2009 at 5:26 am
Added SQL Server 2005 SP3 but still the same issue. Anyone got any ideas?
August 28, 2009 at 6:57 am
No idea offhand. This might be something worth contacting customer support about. I could understand if the Hyperthreading was enabled, but not when it's disabled.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 28, 2009 at 7:09 am
I'm starting to wonder if it's something to do with the non uniform memory architecture of the Xeon x5570. There's 6GB of ram on this box. I don't know how it's physically installed yet (I'll open the box this weekend). Do you know if there are any known issues with this and SQL Server?
August 28, 2009 at 7:26 am
SQL from 2005 onwards handles the NUMA architecture just fine. However, with 4 cores (how many physical processors?) and 6 GB memory, I don't see how the memory can be equal for each NUMA node, unless it's 2 physical processors with 3GB memory each.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 28, 2009 at 7:34 am
One physical CPU (quad core). Is there going to be any solution with 6GB, (ie if I find one physical memory module) Or does it sould like I am going to need to change the memory setup?
August 28, 2009 at 7:57 am
I'm not familiar with the x5570 and I'm not as familiar with NUMA as I would like to be. As far as I know, a NUMA node would consist of one physical cpu and a certain amount of memory. Could be mistaken there. I suggest you investigate Intel's technical specs of the x5570 for details on how it behaves and how the NUMA nodes are set.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
September 3, 2009 at 3:38 am
Update.
Memory is 3x2GB Dual Rank UDIMMs 1333Mhz
September 3, 2009 at 3:41 pm
NUMA:
The OS defaults to allocating memory from the current NUMA node that the thread is executing. If the thread changes nodes or runs out of available memory on the current node, it will then allocate memory from a different node.
I know SQL2008 is NUMA aware and tries to take advantage of reduced latency/extra bandwidth of NUMA, but there shouldn't be a huge diff between which node you're on.
The QPI connect between the nodes is 25.6 GB/s and very low latency. It is still additional latency, so you will get reduced performance and possible contention for the other node trying to access it's own data, but it should still be peppy
September 4, 2009 at 8:06 am
1) check the sql log just after startup for NUMA messages. If you don't see them then it isn't doing it I think, at least soft numa. I doubt that a computer with a single CPU would be configured for hardware numa.
2) NUMA can be a collection of CPUs per node, not just one.
3) IIRC it is (or used to be) best to have match pairs of memory modules. You have an odd number. not sure if this is relevant in your architecture or not.
4) I would be on the phone to both hardware manufacturer and Microsoft on this one.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
September 5, 2009 at 9:36 am
Hi Aidan,
If you could run the following and upload the results it would help enormously.
Thanks
Paul
-- scheduler info (inc. NUMA)
select parent_node_id,
scheduler_id,
cpu_id,
[status],
is_online,
is_idle,
preemptive_switches_count,
context_switches_count,
idle_switches_count,
current_tasks_count,
runnable_tasks_count,
current_workers_count,
active_workers_count,
work_queue_count,
pending_disk_io_count,
load_factor,
yield_count,
last_timer_activity,
failed_to_create_worker
-- quantum_length_us
from sys.dm_os_schedulers
where scheduler_id < 256
order by
scheduler_id;
-- useful system info - as seen by SQL Server
select cpu_count,
hyperthread_ratio,
physical_memory_in_bytes,
virtual_memory_in_bytes,
bpool_committed,
bpool_commit_target,
bpool_visible,
stack_size_in_bytes,
os_quantum,
os_priority_class,
max_workers_count,
scheduler_count
from sys.dm_os_sys_info;
-- selected configuration options
select name,
value,
value_in_use,
is_dynamic
from sys.configurations
where name in
(
'max worker threads',
'priority boost',
'affinity mask',
'min server memory (MB)',
'max server memory (MB)',
'lightweight pooling',
'affinity I/O mask'
)
order by
name asc;
-- wait stats (cpu)
select top (10)
wait_type,
waiting_tasks_count,
wait_time_ms,
max_wait_time_ms,
signal_wait_time_ms
from sys.dm_os_wait_stats
order by
signal_wait_time_ms desc;
September 5, 2009 at 10:03 am
Hi Paul
Because of the issues, at the moment the server has cores 2 and 3 disabled in the bios with cores 0 and 1 enabled and the affinity masks set with core 0 for processor and core 1 for I/O. So I'm not sure how useful these results will be. When I get a chance I can re-boot the machine and reset the bios and affinity masks to default (or another setting if you would prefer). But as it currently is here are the results:
parent_node_id scheduler_id cpu_id status is_online is_idle preemptive_switches_count context_switches_count idle_switches_count current_tasks_count runnable_tasks_count current_workers_count active_workers_count work_queue_count pending_disk_io_count load_factor yield_count last_timer_activity failed_to_create_worker
-------------- ------------ ------ ------------------------------------------------------------ --------- ------- ------------------------- ---------------------- ------------------- ------------------- -------------------- --------------------- -------------------- -------------------- --------------------- ----------- ----------- -------------------- -----------------------
0 0 0 VISIBLE ONLINE 1 0 271357 2537913 3041200 12 0 22 11 0 0 15 8259666 708727658 0
0 1 1 VISIBLE OFFLINE 0 1 2 5 792 3 0 4 1 0 0 1 797 708408901 0
64 255 255 VISIBLE ONLINE (DAC) 1 1 4 9 794 2 0 3 1 0 0 1 801 708410336 0
(3 row(s) affected)
cpu_count hyperthread_ratio physical_memory_in_bytes virtual_memory_in_bytes bpool_committed bpool_commit_target bpool_visible stack_size_in_bytes os_quantum os_priority_class max_workers_count scheduler_count
----------- ----------------- ------------------------ ----------------------- --------------- ------------------- ------------- ------------------- -------------------- ----------------- ----------------- ---------------
2 2 6431227904 8796092891136 165568 585924 585924 2093056 40000 128 512 1
(1 row(s) affected)
name value value_in_use is_dynamic
----------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------
affinity I/O mask 2 2 0
affinity mask 1 1 1
lightweight pooling 0 0 0
max server memory (MB) 2147483647 2147483647 1
max worker threads 0 0 0
min server memory (MB) 0 0 1
priority boost 1 1 0
(7 row(s) affected)
wait_type waiting_tasks_count wait_time_ms max_wait_time_ms signal_wait_time_ms
------------------------------------------------------------ -------------------- -------------------- -------------------- --------------------
SLEEP_BPOOL_FLUSH 51251 38657 31 11762
WRITELOG 87202 662395 936 5943
ASYNC_NETWORK_IO 929545 1021338 1996 4773
PAGEIOLATCH_EX 96972 74942 546 4290
SLEEP_TASK 432690 1030449 93 4196
LAZYWRITER_SLEEP 726376 1414146146 85544474 1138
PAGEIOLATCH_SH 17446 47127 374 795
PAGELATCH_UP 591 9781 530 234
LATCH_EX 61 1466 109 140
BACKUPBUFFER 8341 21403 327 109
(10 row(s) affected)
September 5, 2009 at 10:19 am
I'll have a look at the results you posted, but the site formatting makes it kinda hard 🙂
It may pay to save the results to a CSV file (right-click and save as) and attach that instead?
It would certainly be better to have the results when everything is reset to default.
One last question: which exact OS are you running?
Thanks anyway!
Paul
edit: It is rarely a good idea to enable the 'priority boost' option - when you reset to default, please don't forget this one.
It would also be cool to get the results of my sys.dm_os_schedulers query when your query is executing on core zero, and then with it running on a problematic core. If you are able to get a third reading when allowing it to run in parallel, that would be extra-awesome!
September 5, 2009 at 11:40 am
I second the priority boost comment. I also note max memory is defaulted - not best practice, but likely not a culprit here.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
September 7, 2009 at 12:58 am
Sorry about the previous text format. I couldn't get IE 8 to upload anything (I'm on another machine using IE 6 and it seems to work from here).
The machine is Windows Server 2008 Standard Edition SP 2
Dell Server R610 X5570 2.93Ghz.
Attached are the files. For the file with the results from Core 0 executing the query, I had to use the affinity mask to make core 0 execute (so I hope that doesn't invalidate the results).
Thanks for your help.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply