March 24, 2004 at 11:24 am
Help! We are encountering this error so often now. It is puzzling and making our management starting to dis-believe SQL server's stability. I would like to share our problems and see if any of you SQL experts out there have similar problem and how you resolve this!
For the past few weeks, we got error message as following after our data analyst ran one of their queries from our linked server.
Process 409:0 (bb8) UMS Context 0x11FE8868 appears to be non-yielding on Scheduler 3.
Error: 17883, Severity: 1, State: 0
We would get the above error message every minute. Even if we have kill the process 409. SP_WHO2 shows that it is rollingback, but it will stay in that process list forever. Also we see lots of processes staying around and not killable. I assume they are all hung up on the scheduler. The server will still work, but extremely slow and non-responsible.
If we can fix this without rebooting the server will be great, but for the last few times, we had no choice but reboot.
This was the query that crashed it last time.
Updating a local table on Server1 from joining a table from LINK_SERVER2
The remote link server table has about 300000 rows and data size is about 3.4G
UPDATE #Local SET field1 = M.field1,
field2 = M.field2, field3 = M.field3 from LINK_SERVER2.TABLE
AS M JOIN #LOCAL ON M.ID = #Local.ID
From the hardware side, SERVER1 is running on a DELL 4XXX serie with TEMPDB on LOCAL drive RAID 1, 1GB allocated autogrowth upto 3GB. Datafiles are on EMC Clarrion on multiple LUNs. We don't have any other major CPU usage or I/O bootleneck issues.
Our datafiles are pretty frameneted on the SAN and we are going to shrink the datafile soon. Currently 140G allocated and used 140G, but the actual data size is about 80G. Could autogrowth a problem? Nevertheless, the above query looks like it uses mostly TEMPDB right?
From Microsoft support, we applied SP3 our current @@version is at 8.000.847
Any suggestions where I should start troubleshooting this mysterious problems? Please ask me if you have any other questions regarding to our system and I will be able provide much more detail info.
Any other super Guru out there up for this challenge?
Thanks!
JON
DBA/OLAP Developer -
San Francisco, CA
March 24, 2004 at 11:46 am
MAybe if you save the linked server table locally and then execute the update can run faster.
SELECT M.Filed1, M.Field2, M.Field3
INTO #Temp
from LINK_SERVER2.TABLE AS M
JOIN #LOCAL ON M.ID = #Local.ID
UPDATE #Local SET field1 = M.field1,
field2 = M.field2, field3 = M.field3 from #Temp AS M JOIN #LOCAL ON M.ID = #Local.ID
March 24, 2004 at 11:58 am
I agree with you. We can do a lot of improvement on making the query more efficient. But that is not the only query that caused the ERROR 17883 problem.
BTW, what do you mean by storing them locally? we created the link server so that we don't need to store them locally.
thanks.
Jon
DBA/OLAP Developer -
San Francisco, CA
March 24, 2004 at 12:22 pm
I agree that you don't need to create a table with data from your linked server, but like I thought that the problem was with that query, looked to my like a good choice.
March 24, 2004 at 7:02 pm
When you query sysprocesses for spid > 50, do you see CXPACKET waittype. if yes then,what is the setting of the advanced parameter max degree of parallelism ? By default the value is 0 and the query would take all available processors (4 in your case) to execute the query. Change this value to 2 or specify a hint in the update statement with maxdop=1.
March 24, 2004 at 7:43 pm
Have you tried running the item that does kill it but watch what is happening during the process with Profiler?
Also, reading the KB articles I don't think SP3 fixes just changes the wording to be more descriptive.
I feel the data amount is large and causing too many blocks to occurr but not sure.
May want to consider setting up a black box trace in the mean time and doing a mini dump to see what is being pointed at if someone is willing to take a look ad see if they can figure out. However MS might have to be involved and they will want dumps from these occurances. Might be a metadata issue or just data volume to many variables without being hands on unless someone has had this before.
March 25, 2004 at 8:39 am
3.4 Gb is huge for 300,000 rows what type of data are you moving? Have you seen this http://support.microsoft.com/default.aspx?scid=kb%3Ben-us%3B810885 and applied those hot fixes?
Maybe you could simplify the update and do 1 field at a time instead of all 3. I suspect you have some latency somewhere (network, disks...) that is causing the problem and your san is able to write much faster than its receiving your data and this is causing a problem that the above hotfix will fix.
March 25, 2004 at 10:34 am
SA24, I agree with you regarding to the Maxdop can be set to 1 or 2. Our default is 0, so it will uses all the processors. Again, The processor doesn't seems to be an issue.
Bnordber, I did look into the microsoft knowledge base. You brought up a good point. I am also suspecting that it is a disk issue. However, what do we need to do? downgrade our disk speed? How can I capture all the latency issues and pin-point that. What perf mon counter are we talking about?
Finally, Antares how do I setup a black box trace? We will contact Microsoft and send over the error dump. But unless the problem happens again, I will not be able to use profiler. In that case, what is your strategy and how would you implement it?
Thanks fellow SQL DBAs,
Jon
DBA/OLAP Developer -
San Francisco, CA
March 25, 2004 at 11:01 am
I also have my systems on a fast san, so I applied the hot fix and I haven't seen the problem since. I found other problems related to networking which is probably the same with you. Other recommendations are to either link the server by ip, or add in entry to you lmhosts file so it doesn't need to do a lookup, set both nics to 100mb and full duplex (or gig if you have it - just don't use auto). Good luck
March 25, 2004 at 11:04 am
Blackbox Trace will be your only hope to catch to the moment it occcurrs but still no guarantee. See "Reporting Errors to Your Primary Support Provider" in BOL (index Blackbox.trc).
March 25, 2004 at 1:13 pm
Arghhhhh
I have had this problem before (17883) It is a thread/scheduler issue, probably because the sql scheduler (internal) is timing out with the link server operation. I remember this happening with a very large transaction at another company. BTW, the error for this is a "hung scheduler". I have also heard that certain raid controllers can make this happen as well.
They will most likely not fix it unless they reduce the dataset coming from LINK_SERVER2.Table.. how many rows are getting updated in this operation? I would do it in batches, that will fix it for sure.
I would love to see the query plan for this.
John Zacharkan
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply