Query never completes on HAG primary replica but runs in seconds on Secondary

  • Hello everyone,

    We have a HAG with 3 nodes. A query that normally runs in seconds on the primary replica (SRV1)  is taking a very long time to return partial results since yesterday; I have not seen it complete since as I have to kill it before users scream. The same query runs in a few seconds on the secondary read-only replica (SRV3). There is an index reorg procedure that runs daily, on both replicas. Last night, on the primary, it went from a couple of minutes to 2 hours and a half for an audit table, the largest of all tables. That is the only unusual thing to happen yesterday.

    Can someone enlighten me on situations where the one node runs the same query significantly slower on the primary than on the secondary? We do not delete records.

    Thank you very much for your help!

    • This topic was modified 3 years, 6 months ago by  jeancelmoy.
  • Hi, few cases that may occur on primary especially when index rebuild was running are:

    • blocking: you should spot this on
      select * from sys.dm_exec_requests where blocking_session_id>0?

      or even better using sp_WhoIsActive http://whoisactive.com/downloads/

    • resource contention cpu or disk: check on OS level or in monitoring software if you have one

    Most probably it was blocking as it ran immediately on the secondary. Secondary when set to read mode is using snaphost isolation that eliminate most cases of blocking.

  • You stated the process is an index reorg - how is that going to run on a read-only secondary?  Is this really performing a REORGANIZE or is it performing a REBUILD?

    If the former - you really should reconsider that process as it probably isn't doing what you expect.  A REBUILD will be more efficient and can be done online, unless you are using REORG to compress LOB columns.

    Did this process take a long time identifying the index to be processed - or did it take a long time performing the index reorg?  If it took a long time performing the reorg - that part of the process will not run on a read-only secondary so it isn't the same comparison.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Thank you Jeffrey.

    The process is an index maintenance one which reorganizes indices under 10% fragmented and rebuilds 10% or more. In the instance I am talking about it ran a REORGANIZE WITH (LOB_COMPACTION = OFF) . That step is what took 2 hours and 38 minutes. After that day (5/24) it went back to running in its normal time. The index is on the audit table, the largest and busiest on the database. We use the very thorough Olla Allengren scripts which you can find here https://ola.hallengren.com/. Both the primary and the replica are maintained by the same procedures, the index fragmentation is 99% the same on any table on both, except for the busiest table, the audit

    We've made another change to the original query and now it runs fast on the primary. Still, the original one never completes. The execution plan is the same on both copies...Still puzzled. We'll try to look into I/O a bit deeper. All nodes are on the same network segments.

    Thanks again.

     

  • Thank you e4d4. There is no significant blocking when the query is executing. Just the normal SQL Server behavior.

  • You cannot perform an index rebuild or reorganize on a read-only secondary.  It doesn't make sense to compare performance of the reorganize process on a system where that process cannot be run.

    There is nothing special about Ola's utility - other than it wraps the operations to be done so you don't have to schedule or set them up yourself.  Very similar to how the maintenance plan wizard prompts you for what actions to be taken - but you still need to make sure you are performing the operations in a manner that works best for that system.

    On Enterprise Edition - it would be much better to choose an online rebuild instead of reorganize.  The only time you should use reorganize is where you are prevented from rebuilding online - and cannot take that table offline for a rebuild.  There are now only a very few restrictions on performing online rebuilds in Enterprise Editions.

    It isn't surprising that a reorganize takes 2.5 hours on a large table.  Because of the way you have the utility defined - one or more indexes on the audit table were selected to be reorganized, which took a long time to perform.  Now - those indexes are no longer selected so the process is back to 'normal'.

    The utility uses a DMV to inspect the fragmentation - that DMV can take an exceptionally long time on a large table.  It will take a lot less time on a read-only secondary because there are no locks on the object that would interfere - where on the primary normal activity against that table is going to cause the DMV to take longer.

    Finally - audit tables are not good candidates to either rebuild or reorganize and generally don't need either process to be performed, at least not on the clustered index itself.  But that really depends on how you setup the audit table - which should have a clustered index on either a date/time type column or an identity.

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • You can still use sp_WhoIsActive when the query is executing to check what kind of waits are slowing down the query or status whatever it is running/runnable or else in case of no waits this should give you some starting point what to check next.

  • I'll go one step further... stop using REORGANIZE except to compress LOBs and then make sure that you do a REBUILD right after because it also compresses the indexes.  It doesn't follow the Fill Factor like a lot of people say.  Rather, it compresses to the Fill Factor meaning that it will try to a rolling group of 8 pages up to the Fill Factor and cannot make new pages to reduce full pages down to the Fill Factor.

    Also, if you are REBUILDing or REORGANIZEing indexes that have a Fill Factor of "0", that's the same as 100 and both methods will remove almost all of what little freespace there is.  Of course, the reason for the index maintenance was because page splits were causing fragmentation and your index maintenance just removed all vestiges of free space to help prevent page splits and so you've turned some form of fragmentation into massive fragmentation with blocking in most cases except for two types of insert/update patterns where it doesn't actually do anything for ya when it comes to preventing page splits. 😀

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply