Query execution speed half on anything other than x5570 core 0

  • Hi, I'm really stumped on this and can't find anything to try and resolve it. Any hints of clues as to what to look for would be really appreciated.

    I installed 64bit SQL Server 2005 on a single physical CPU Nehalem Xeon x5570 with HT (so 8 virtual cores are shown to the OS) on a Windows Server 2008 Standard edition Dell box.

    I noticed the queries were not running as fast as I expected (around the same speed as a previous generation Xeon box) and, after trying everything else I could think of, I messed with the affinity mask settings. My results found that for CPU bound queries (that do not seem to run in parallel anyway) if the CPU bound part schedules on virtual core 0 they complete about twice as fast as if they schedule on any other core. Actually this is true no matter what the affinity mask is set to. But the only way to ensure they schedule on core 0 is to set the processor affinity mask for that core only.

    Obviously this doesn't scale well, but I don't really have much experience of this CPU / system setup, so if anyone knows what I'm doing wrong, or how to address this I'd really appreciate it, as most of the queries for the main system running on this are CPU bound.

    Thanks.

    UPDATE: this is still the case with Hyper Threading turned off and even with some of the physical cores turned off. If the query schedules on core zero it runs twice as fast as any other core.

    Additional info, there's nothing more fancy than some large tables, in-line table valued functions, scalar functions and some other views in the database (ie no user external stored procedures, c# or any other non-SQL stuff).

  • Added SQL Server 2005 SP3 but still the same issue. Anyone got any ideas?

  • No idea offhand. This might be something worth contacting customer support about. I could understand if the Hyperthreading was enabled, but not when it's disabled.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I'm starting to wonder if it's something to do with the non uniform memory architecture of the Xeon x5570. There's 6GB of ram on this box. I don't know how it's physically installed yet (I'll open the box this weekend). Do you know if there are any known issues with this and SQL Server?

  • SQL from 2005 onwards handles the NUMA architecture just fine. However, with 4 cores (how many physical processors?) and 6 GB memory, I don't see how the memory can be equal for each NUMA node, unless it's 2 physical processors with 3GB memory each.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • One physical CPU (quad core). Is there going to be any solution with 6GB, (ie if I find one physical memory module) Or does it sould like I am going to need to change the memory setup?

  • I'm not familiar with the x5570 and I'm not as familiar with NUMA as I would like to be. As far as I know, a NUMA node would consist of one physical cpu and a certain amount of memory. Could be mistaken there. I suggest you investigate Intel's technical specs of the x5570 for details on how it behaves and how the NUMA nodes are set.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Update.

    Memory is 3x2GB Dual Rank UDIMMs 1333Mhz

  • NUMA:

    The OS defaults to allocating memory from the current NUMA node that the thread is executing. If the thread changes nodes or runs out of available memory on the current node, it will then allocate memory from a different node.

    I know SQL2008 is NUMA aware and tries to take advantage of reduced latency/extra bandwidth of NUMA, but there shouldn't be a huge diff between which node you're on.

    The QPI connect between the nodes is 25.6 GB/s and very low latency. It is still additional latency, so you will get reduced performance and possible contention for the other node trying to access it's own data, but it should still be peppy

  • 1) check the sql log just after startup for NUMA messages. If you don't see them then it isn't doing it I think, at least soft numa. I doubt that a computer with a single CPU would be configured for hardware numa.

    2) NUMA can be a collection of CPUs per node, not just one.

    3) IIRC it is (or used to be) best to have match pairs of memory modules. You have an odd number. not sure if this is relevant in your architecture or not.

    4) I would be on the phone to both hardware manufacturer and Microsoft on this one.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Hi Aidan,

    If you could run the following and upload the results it would help enormously.

    Thanks

    Paul

    -- scheduler info (inc. NUMA)

    select parent_node_id,

    scheduler_id,

    cpu_id,

    [status],

    is_online,

    is_idle,

    preemptive_switches_count,

    context_switches_count,

    idle_switches_count,

    current_tasks_count,

    runnable_tasks_count,

    current_workers_count,

    active_workers_count,

    work_queue_count,

    pending_disk_io_count,

    load_factor,

    yield_count,

    last_timer_activity,

    failed_to_create_worker

    -- quantum_length_us

    from sys.dm_os_schedulers

    where scheduler_id < 256

    order by

    scheduler_id;

    -- useful system info - as seen by SQL Server

    select cpu_count,

    hyperthread_ratio,

    physical_memory_in_bytes,

    virtual_memory_in_bytes,

    bpool_committed,

    bpool_commit_target,

    bpool_visible,

    stack_size_in_bytes,

    os_quantum,

    os_priority_class,

    max_workers_count,

    scheduler_count

    from sys.dm_os_sys_info;

    -- selected configuration options

    select name,

    value,

    value_in_use,

    is_dynamic

    from sys.configurations

    where name in

    (

    'max worker threads',

    'priority boost',

    'affinity mask',

    'min server memory (MB)',

    'max server memory (MB)',

    'lightweight pooling',

    'affinity I/O mask'

    )

    order by

    name asc;

    -- wait stats (cpu)

    select top (10)

    wait_type,

    waiting_tasks_count,

    wait_time_ms,

    max_wait_time_ms,

    signal_wait_time_ms

    from sys.dm_os_wait_stats

    order by

    signal_wait_time_ms desc;

  • Hi Paul

    Because of the issues, at the moment the server has cores 2 and 3 disabled in the bios with cores 0 and 1 enabled and the affinity masks set with core 0 for processor and core 1 for I/O. So I'm not sure how useful these results will be. When I get a chance I can re-boot the machine and reset the bios and affinity masks to default (or another setting if you would prefer). But as it currently is here are the results:

    parent_node_id scheduler_id cpu_id status is_online is_idle preemptive_switches_count context_switches_count idle_switches_count current_tasks_count runnable_tasks_count current_workers_count active_workers_count work_queue_count pending_disk_io_count load_factor yield_count last_timer_activity failed_to_create_worker

    -------------- ------------ ------ ------------------------------------------------------------ --------- ------- ------------------------- ---------------------- ------------------- ------------------- -------------------- --------------------- -------------------- -------------------- --------------------- ----------- ----------- -------------------- -----------------------

    0 0 0 VISIBLE ONLINE 1 0 271357 2537913 3041200 12 0 22 11 0 0 15 8259666 708727658 0

    0 1 1 VISIBLE OFFLINE 0 1 2 5 792 3 0 4 1 0 0 1 797 708408901 0

    64 255 255 VISIBLE ONLINE (DAC) 1 1 4 9 794 2 0 3 1 0 0 1 801 708410336 0

    (3 row(s) affected)

    cpu_count hyperthread_ratio physical_memory_in_bytes virtual_memory_in_bytes bpool_committed bpool_commit_target bpool_visible stack_size_in_bytes os_quantum os_priority_class max_workers_count scheduler_count

    ----------- ----------------- ------------------------ ----------------------- --------------- ------------------- ------------- ------------------- -------------------- ----------------- ----------------- ---------------

    2 2 6431227904 8796092891136 165568 585924 585924 2093056 40000 128 512 1

    (1 row(s) affected)

    name value value_in_use is_dynamic

    ----------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------

    affinity I/O mask 2 2 0

    affinity mask 1 1 1

    lightweight pooling 0 0 0

    max server memory (MB) 2147483647 2147483647 1

    max worker threads 0 0 0

    min server memory (MB) 0 0 1

    priority boost 1 1 0

    (7 row(s) affected)

    wait_type waiting_tasks_count wait_time_ms max_wait_time_ms signal_wait_time_ms

    ------------------------------------------------------------ -------------------- -------------------- -------------------- --------------------

    SLEEP_BPOOL_FLUSH 51251 38657 31 11762

    WRITELOG 87202 662395 936 5943

    ASYNC_NETWORK_IO 929545 1021338 1996 4773

    PAGEIOLATCH_EX 96972 74942 546 4290

    SLEEP_TASK 432690 1030449 93 4196

    LAZYWRITER_SLEEP 726376 1414146146 85544474 1138

    PAGEIOLATCH_SH 17446 47127 374 795

    PAGELATCH_UP 591 9781 530 234

    LATCH_EX 61 1466 109 140

    BACKUPBUFFER 8341 21403 327 109

    (10 row(s) affected)

  • I'll have a look at the results you posted, but the site formatting makes it kinda hard 🙂

    It may pay to save the results to a CSV file (right-click and save as) and attach that instead?

    It would certainly be better to have the results when everything is reset to default.

    One last question: which exact OS are you running?

    Thanks anyway!

    Paul

    edit: It is rarely a good idea to enable the 'priority boost' option - when you reset to default, please don't forget this one.

    It would also be cool to get the results of my sys.dm_os_schedulers query when your query is executing on core zero, and then with it running on a problematic core. If you are able to get a third reading when allowing it to run in parallel, that would be extra-awesome!

  • I second the priority boost comment. I also note max memory is defaulted - not best practice, but likely not a culprit here.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Sorry about the previous text format. I couldn't get IE 8 to upload anything (I'm on another machine using IE 6 and it seems to work from here).

    The machine is Windows Server 2008 Standard Edition SP 2

    Dell Server R610 X5570 2.93Ghz.

    Attached are the files. For the file with the results from Core 0 executing the query, I had to use the affinity mask to make core 0 execute (so I hope that doesn't invalidate the results).

    Thanks for your help.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply