October 15, 2014 at 12:14 am
Quick back story first, we have SQL Server 2012 and Dynamics CRM 2013 in a client environment. We just rolled out about a month ago, and about a couple weeks after out launch, things started getting ugly. Suddenly the CPU would jump to 100% CPU usage, the web servers would get all sorts of queued requests, and things would go down hill quickly for anywhere from 2 to 10 minutes. No big deal I figured, but boy was I wrong.
I did the standard items as I had been here before. I rebuilt indexes, updated stats, looked for missing indexes, verified all settings at the server level including setting maxdop to 1 server wide, putting the CRM database into read committed snapshots mode, and verified that the memory was properly setup to prevent overruns from the memory hungry SQL. I even put CPU limits on all users but the web site users, and no change. To make things even more strange, I ended up writing a process that would kill all sessions for the user that CRM was using to connect to the database, and when I would run it, the processor would drop down to its normal 18-23% usage and stay down for some random amount of time, then do the same thing again. I hated issuing a mass kill on all of those connections as I know users were out there getting page could not be displayed messages, but when the client is on the phone telling you to fix it now, you gotta do what you gotta do.
So at this point I had determined a couple of things, that it definitely was the CRM website user that was causing the issue (all crm services run under different users, so it was easy to identify that way), and that the users were not doing anything crazy.
That being said, it was obvious that something was going on, and since even a restore of production on another server would not reproduce the issues, I started recording CPU usage via a job every minute on Prod, and also started traces from a script that I have. From that trace I correlated the CPU spikes (particularly the EndTime) with the CPU spikes end times as CRM is one noisy system SQL wise when you have 700 concurrent users hammering on it.
From that correlation over time, I started picking off the items with the highest CPU numbers, and started to try and tune them in the limited manner that you can tune queries issued from CRM, by adding indexes and moving files around. Still no luck. After getting down the list a bit, I started noticing some trends, in particular with one query that retrieved email. It was a monster, and when I looked at the query execution plan (estimated and real time in the server I restored the database on) - it looked good. After a couple days of tracking things, I started looking at the execution times for this query for users, and was seeing wild swings in execution times and CPU usage.
Digging further and recording more events, I found that this query was actually being executed against the sp_executeSql proc, with parameters passed in. Okay, that is good I figured as it would cache the plan, and it was being executed a lot. So why the wild swings? I finally got permission to execute the query and have it show me the actual execution plan in production, and it showed a warning for unmatched indexes. A bit of research later, I realized that sp_executeSQL and parameterized queries over tables with filtered indexes causes problems with plans as the parameters can change, so it cannot use the filtered index if the filter is on a field that is referred to in the where clause using one of those parameters. Okay - so I went into the section of the explain plan where it shows which index was unmatched, and it showed me a that it was a filter on the ActivityParty table. Over to the filtered index I went, and looked at the settings. It was filtering for where the PartyId was null. That is where it got really weird. The @PartyID parameter was never null, as that was actually being populated with the current user's id. But somehow, there it was in the execution plan, clear as day, and clear as day in my traces that it was never executed WITHOUT that parameter being populated.
The fix (at least I hope it is the fix) was pretty simple. I grabbed one of the query executions from the log that took a long time, and stuck a recompile at the end, and presto, no more warnings in the execution plan, and no more reference to a unmatched index.
So now that the backstory is out of the way, the one thing that is still bugging me is how that version of the execution plan ever got cached like that in the first place, and how come it took two weeks for it to start bombing out. I just cannot reconcile that in my mind as I even did a freeproccache and dumped everything, it would just recompile the next time, and since the User cannot be null (or you cannot get into the website), I cannot imagine how that wackadoo query plan ever got in place to begin with. Amy ideas?
Tim Blum
Senior Manager, Outsourced Data Services
Learn2Live at Ureach.com
October 15, 2014 at 4:53 am
It had to have gotten into cache the same way as all other plans. It was called and compiled that way. You'd want to look at the first operator in the part of the plan with the problem to see what the compile time values are. That could help determine why it wasn't able to use the index. It could just be a severe case of bad parameter sniffing. But from the sounds of things, I'm not surprised that once a bad plan goes into cache, it stays there pretty well. Without seeing the plans though, I couldn't hazard a guess as to why the optimizer made the choices it did.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 15, 2014 at 5:06 am
Parameter sniffing is almost certainly the cause. And there isn't much you can do about it. Data value skew can cause unbelievably bad performance if you don't do some things differently from "normally-distributed" apps, but with Dynamics you are pretty ham-strung unless you have a rock-star dev who can go in and tweak things. You may be able to use a plan guide and an OPTION (RECOMPILE) hint, but I can't recall off the top of my head if that is one of the things you can OPTION in a plan guide (and I have been up for all but a small handful of the last 50+ hours and changed a total of 12 time zones so I couldn't find the answer if my life depended on it :hehe:).
I just picked up a large (500GB+, >1K users) Dynamics client a month ago or so. JezUS what a MESS that system is!!! :w00t:
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 15, 2014 at 11:05 am
I agree, CRM code is bad. So since the sp_executesql was parameterized, and one of the parameters was the filter field of the filtered index, and the complied execution plan was using this index, or should I say attempting to use it, and the filtered index was never used during any other plan on file, I am trying removing the filter from the index. I know this will fix the query execution plan as it keeps reverting back to the old version, followed almost immediately by the processor spike, but I know I am going to pay for it somewhere if that filtered query ever gets used by something where it is not parameterized, or where the parameter isn't the finer on the index.
I am going to research your idea also as I see this same symptom in other plans, and as I get those details I will save off the actual sql and the filtered index asking with the actual query plan and post that here too. It is like Microsoft forgot that you cannot parameterized sql when the parameter is in the filtered index without causing bad things to happen.
Glad to know that you are also seeing the mess in dynamics crm, I was beginning to think it was just something that only ib was seeing.
Tim Blum
Senior Manager, Outsourced Data Services
Learn2Live at Ureach.com
October 15, 2014 at 11:29 am
The team that designed the Dynamix app (especially the SQL Server side) should be taken out behind the wood shed and soundly thrashed!! 😀
There have been a number of flaws, limitations, gotchas, issues, etc. with filtered indexes over the years. It is one of those features that SHOULD be awesome, but wasn't ever fully baked I think. Of course you could say that about a lot of SQL Server features over the last 10 years or so. 🙁
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 15, 2014 at 11:57 am
Things are still not much better in CRM2013, as I hae been supporting this CRM instance from CRM 4 to now. Just finished a dual hop upgrade from CRM 4 to CRM 2013.
One big thing was that they did eliminate the baseextension tables and merge them with the base tables in CRM 2013. That helps as it reduced joins, but the killer is still the fragmentation that happens so quickly on the GUID Clustered Primary Keys. I ended up adapting a script I had to rebuild all indexes when they reach 30%, and reorg when over 5%, and I have it scheduled every other day, and hit the stats to keep them up to date too.
Personally I think the write caused by reassigning or renaming things causes most of the performance issues as all relationships carry both the GUID of the foreign key AND the natural key of the foreign key table (usually the name field). I am seeing constant updates to the activity and activity party tables (which by the way is where the filtered index was) as it goes through all activities for whatever thing you renamed and updates the natural key. Yep - a sounds thrashing is in order just for that design, let alone the decision to use parameterization with filtered indexes....
Tim Blum
Senior Manager, Outsourced Data Services
Learn2Live at Ureach.com
October 15, 2014 at 1:22 pm
Have you adjusted fill factors down to combat the fragmentation problems? This is about the best you can do. There is obviously a tradeoff of taking more disk space, more RAM and more IOs for the same amount of data. But massive, quick fragmentation is truly horrible because it impedes ongoing transaction performance and leads to both tlog and data writes, extra locks/latching, etc. Assuming you use ola.hallengren.com for your index MX you can track every reorg/rebuild and find optimal fill factors in a matter of a few weeks usually.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 15, 2014 at 4:43 pm
Yes, I use Ola's index tools to work on things.
What I do is dump the output to a table and keep watching things, and adjust the fill factor on the index if it repeatedly is over the reorg threshold of 20%. I do this until it stays below the rebuild threshold each time. I do this during time period where the user count drops dramatically, so I fully update stats at the same time. The whole thing usually takes about 20 minutes, even on the 400GB database, and our window is an hour for this work, so that part fits in quitter well.
I also use the sp_BlitzCache script to catch things like the unmatched indexes, but also parameter sniffing and missing indexes so I can look at what needs to be done.
Furthermore, I actually put in the Resource Governor for all accounts that are not the CRMWebService account, which only the CRM Websites use to connect to the databases, where those processes cannot exceed 50% of the processor.
I capped the Memory at 36 of the 48 on SQL to keep the memory usage pretty steady at 88-90%.
All in all - despite the db being a mess in CRM 4, we have gotten things so much better performance wise, yet, all of a sudden this issue cropped up. Actual load on the server at peak times is 800 users where total users is around 1500, most of which use the outlook client, and the SQL Server CPU generally sits at somewhere in the 20% used.
My tasks for tonight is to move the tempdb onto the new SSD based IO card, which will definitely help the IO portions. I am still researching the other indexes as I know there are some other filtered indexes that are causing the issues with execution plans. Where I removed that filtered index today things stayed pretty good for that query, so now I must find the other culprits so that I know what query plans to fix.
Did I mention how much I love Dynamics CRM lately?
Tim Blum
Senior Manager, Outsourced Data Services
Learn2Live at Ureach.com
October 16, 2014 at 8:47 am
Sound like you are well on top of things. Nicely done!
Did I mention how much I love Dynamics CRM lately?
Fortunately you get to call that "JOB SECURITY"!! 😎
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 17, 2014 at 8:22 pm
Talk about job security, I feel like I have been doing hand to hand combat versus CRM 2013 and it's deficiencies for 2 weeks now.
Here are some of the things I found out, just in case anyone needs to fight this battle again.
RCSI is almost a must for CRM, as is MAXDOP 1. That is pretty standard, but it does mean the your tempdb ought to be local - not on the san. In my case I have a Fusion IO card with 2 TB local to each node of the cluster, with matched directories and permissions. I have 16 tempdb files, each set to 2GB. Not sure what else I may use the cards for (2 TB was the smallest they could get on short order), but I am thinking I may use the cards for other things - just not sure what yet. Any ideas there would be great.
That is all pretty much standard, but here is where I have been fighting hand to hand combat.
CRM 2013 uses sp_executeSQL against filtered indexes, which causes bad query plans all over. I have had to go in and remove the filters from the filtered indexes, so far 24, tonight another 11. I added 26GB to my index sizes, which nearly doubled them over what they were on those tables, but I have eliminated the high CPU spikes caused by the bad plans, which show as unmatched indexes on the plans. I have been keeping my traces running, watching for those evil queries, and correlating with the sp_blitzcache results showing the bad plans.
Even with that, I found I still had CPU spikes that are not attributable to the bad plans. And best yet, I know it is something inside CRM being evil as I see the following symptoms. Literally every 5 minutes, I see a processor spike, but not like a normal spike that starts 5 minutes from the beginning of the last spike. It always starts 5 minutes on the nose after the Completion of the last spike, meaning it is a programmatic wait timer. Additionally, I had previously assumed that it was not the async processes doing this as it was coming from the connection used by only the web servers, but then I realized something, that the async process is calling the web service on the web servers, and firing commands, so even my resource thresholds on the login for async are not helping as it only uses its connection to gather information about what to run and when, then it calls the web service on the web server, which connect in just like any other user. So that lead me back to the async tables, where I saw all kinds of CRM workflows in a retry state. While I know that I read somewhere that you should not have any more than 25 workflows, the company I am doing the work for is a corporate entity with franchises numbers in the hundreds, and each has some workflows in them, to the total is like 250 active workflows. The main problem is that some of them created their workflows at the organization level, and they are attempting to fire on items they do not have permission to, and failing, then sitting in a retry state. It seems that every 5 minutes it retries them, while generating new ones all the time. I ended up creating a SQL job to clear out the ASYNC tables, keeping RI intact, for anything over 2 hours that runs once an hour. Since then I have reduced the spike CPU usage, and shortened the duration, but there still is a sizable spike. We are talking a system that runs at about 8-15% CPU usage during high usage times, except for those spikes every 5 minutes that spike the CPU up to 90% for about 10-15 seconds. Totally better than what it was before I did the above, where it would spike the processor up to 100% for 3-5 minutes at a time. That being said, I don't get to put away my light saber as of yet, as I have determined what is causing the recompiles on the Query execution plan, and if you guesses the async process, you guessed right. So here's the rub on that. While it is doing its every 5 minute thing, if someone hits the same thing it is trying to do, it uses the bad plan which attempts to hit the filtered index through sp_executeSQL, and if, and only if, the param that is passed in matches the field on the filter, instead of a 90% for 10-15 seconds, I can get ones that last up to 3 minutes, cause blocking, and really wreak havoc. So one by one, I am detecting which objects have workflows associated to them, then removing all the filters from the filtered indexes.
Oh - and did I mention that the SSISDB is giving me fun to boot. Seems the client turned on verbose logging with 365 day retention, and since it is on the same server as CRM, and both the log and the data file are on the san data drive (bet I can guess how that happened as the defaults are on different drives, so obviously someone "helped" that happen). Best yet, they have packages that fire every 2 minutes and 3 minutes throughout the day, so this thing is getting crazy big, so now I get to clean that out too, and move the log files, and try to convince them that scheduling jobs every 2 minutes, plus one every 3 minutes, plus the silly CRM process that fires every 5 minutes, is just asking for trouble. Right now it is like timetable Russian Roulette.
So yeah - CRM is definitely job security, but with security like this, actually having time off just is not going to happen.
Tim Blum
Senior Manager, Outsourced Data Services
Learn2Live at Ureach.com
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply