HELP! ERROR 17883 Non-Yielding scheduler, SP3 doesnt Work!

  • Help!  We are encountering this error so often now.  It is puzzling and making our management starting to dis-believe SQL server's stability.  I would like to share our problems and see if any of you SQL experts out there have similar problem and how you resolve this!

    For the past few weeks, we got error message as following after our data analyst ran one of their queries from our linked server.

    Process 409:0 (bb8) UMS Context 0x11FE8868 appears to be non-yielding on Scheduler 3.

    Error: 17883, Severity: 1, State: 0

    We would get the above error message every minute.  Even if we have kill the process 409.  SP_WHO2 shows that it is rollingback, but it will stay in that process list forever.  Also we see lots of processes staying around and not killable.  I assume they are all hung up on the scheduler.  The server will still work, but extremely slow and non-responsible.

    If we can fix this without rebooting the server will be great, but for the last few times, we had no choice but reboot. 

    This was the query that crashed it last time.

    Updating a local table on Server1 from joining a table from LINK_SERVER2

    The remote link server table has about 300000 rows and data size is about 3.4G

    UPDATE #Local   SET   field1 = M.field1, 

    field2 = M.field2,  field3 = M.field3  from LINK_SERVER2.TABLE

    AS M  JOIN  #LOCAL  ON   M.ID = #Local.ID

    From the hardware side, SERVER1 is running on a DELL 4XXX serie with TEMPDB on LOCAL drive RAID 1, 1GB allocated autogrowth upto 3GB.  Datafiles are on EMC Clarrion on multiple LUNs.  We don't have any other major CPU usage or I/O bootleneck issues.

    Our datafiles are pretty frameneted on the SAN and we are going to shrink the datafile soon.  Currently 140G allocated and used 140G, but the actual data size is about 80G.  Could autogrowth a problem?  Nevertheless, the above query looks like it uses mostly TEMPDB right?

    From Microsoft support, we applied SP3 our current @@version is at 8.000.847

    Any suggestions where I should start troubleshooting this mysterious problems?  Please ask me if you have any other questions regarding to our system and I will be able provide much more detail info.

    Any other super Guru out there up for this challenge?

    Thanks!

    JON

      

    DBA/OLAP Developer -
    San Francisco, CA

  • MAybe if you save the linked server table locally and then execute the update can run faster.

    SELECT M.Filed1, M.Field2, M.Field3

    INTO #Temp

    from LINK_SERVER2.TABLE AS M 

    JOIN  #LOCAL  ON   M.ID = #Local.ID

    UPDATE #Local   SET   field1 = M.field1, 

    field2 = M.field2,  field3 = M.field3  from #Temp AS M  JOIN  #LOCAL  ON   M.ID = #Local.ID

  • I agree with you.  We can do a lot of improvement on making the query more efficient.  But that is not the only query that caused the ERROR 17883 problem. 

    BTW, what do you mean by storing them locally?  we created the link server so that we don't need to store them locally.

    thanks.

    Jon

    DBA/OLAP Developer -
    San Francisco, CA

  • I agree that you don't need to create a table with data from your linked server, but like I thought that the problem was with that query, looked to my like a good choice.

  • When you query sysprocesses for spid > 50, do you see CXPACKET waittype. if yes then,what is the setting of the advanced parameter max degree of parallelism ? By default the value is 0 and the query would take all available processors (4 in your case) to execute the query. Change this value to 2 or specify a hint in the update statement with maxdop=1.

  • Have you tried running the item that does kill it but watch what is happening during the process with Profiler?

     

    Also, reading the KB articles I don't think SP3 fixes just changes the wording to be more descriptive.

     

    I feel the data amount is large and causing too many blocks to occurr but not sure.

    May want to consider setting up a black box trace in the mean time and doing a mini dump to see what is being pointed at if someone is willing to take a look ad see if they can figure out. However MS might have to be involved and they will want dumps from these occurances. Might be a metadata issue or just data volume to many variables without being hands on unless someone has had this before.

  • 3.4 Gb is huge for 300,000 rows what type of data are you moving? Have you seen this http://support.microsoft.com/default.aspx?scid=kb%3Ben-us%3B810885 and applied those hot fixes?

    Maybe you could simplify the update and do 1 field at a time instead of all 3. I suspect you have some latency somewhere (network, disks...) that is causing the problem and your san is able to write much faster than its receiving your data and this is causing a problem that the above hotfix will fix.

  • SA24, I agree with you regarding to the Maxdop can be set to 1 or 2.  Our default is 0, so it will uses all the processors.  Again, The processor doesn't seems to be an issue. 

    Bnordber, I did look into the microsoft knowledge base. You brought up a good point.  I am also suspecting that it is a disk issue.  However, what do we need to do?  downgrade our disk speed?  How can I capture all the latency issues and pin-point that.  What perf mon counter are we talking about?

    Finally, Antares how do I setup a black box trace?  We will contact Microsoft and send over the error dump.  But unless the problem happens again, I will not be able to use profiler.  In that case, what is your strategy and how would you implement it?

    Thanks fellow SQL DBAs,

    Jon

    DBA/OLAP Developer -
    San Francisco, CA

  • I also have my systems on a fast san, so I applied the hot fix and I haven't seen the problem since. I found other problems related to networking which is probably the same with you. Other recommendations are to either link the server by ip, or add in entry to you lmhosts file so it doesn't need to do a lookup, set both nics to 100mb and full duplex (or gig if you have it - just don't use auto). Good luck

  • Blackbox Trace will be your only hope to catch to the moment it occcurrs but still no guarantee. See "Reporting Errors to Your Primary Support Provider" in BOL (index Blackbox.trc).

  • bnordber  Do you know which hot fix did you apply?

    My @@version is 8.00.847 July 31, 2003

    Thanks, I will talk to the SAN engineer to see what he can do.

    -JON

    DBA/OLAP Developer -
    San Francisco, CA

  • Arghhhhh

    I have had this problem before (17883) It is a thread/scheduler issue, probably because the sql scheduler (internal) is timing out with the link server operation. I remember this happening with a very large transaction at another company. BTW, the error for this is a "hung scheduler". I have also heard that certain raid controllers can make this happen as well.

    They will most likely not fix it unless they reduce the dataset coming from LINK_SERVER2.Table.. how many rows are getting updated in this operation? I would do it in batches, that will fix it for sure.

    I would love to see the query plan for this.


    John Zacharkan

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply