June 2, 2021 at 3:48 am
Hey guys... this post is to discuss some weird behaviour we've had tonight with one of our instances..
I've had a roller coaster of emotions because i couldn't accept that it was the issue, things are fine now and like i said earlier it'd be nice to talk about it..
We have a VLDB with more than 15 datafiles... most of them are closed. and maybe 2-3 of them have autogrowth yet the drives are almost full..
We started getting reports from one of our teams of high cpu usage from this instance, it was weird because this instance is a VLDB but its basically running the same queries always, they never change, they rarely make any changes to the queries of the application. the design is simple to maintain but its a VLDB..
Sometimes we usually clean the orphaned records, of course we usually go through the process it should go, development, QA, approval of the higher ups etc etc and we have been doing this process for a few years and its going well.. (the development team is supposed to work on a fix, but for now we have been cleaning it), most of these records are in data files that are old and have been closed for years.
I'm telling you above some background so you could understand what i'm about to explain, i'm explaining the situation we've had if you guys could talk to me about this types of scenarios (this is my first time in one of these) if you have been through it before it'd be nice.
So.. tonight we starting getting reports of high cpu usage, like 90-100%, this server is always in 50%, it was weird, the process using the most CPU was SQL Server.
We get into the instance we check for transactions and we see the transactions are taking longer, we are getting emails of large queries taking longer than 2 minutes, none of these transactions had any wait_type (according to sp_whoisactive), we see that CPU is high, queries are taking longer, we disabled all the snapshot/backup tools and nothing has changed.
the 2-3 data files I've talked earler are on autogrowth and have a max file size, both of the files according to SQL Server had the MAX size they are supposed to get (doesn't mean they might be full tho), basicalle the SIZE = MAX file size, as i said earlier doesn't mean they are full tho...
so we are scratching our heads because no wait types on the transactions or anything like that...
I check the highest wait_type, its SOS_SCHEDULER_YIELD and its really weird, because usually the highest wait type in this server is BACKUPs wait types...
One of the DBA guys just opened a data file that was closed in one of the drives, the drive had like 100GB left and he gave the data file 50GB.
Everything was fixed... CPU went back to 15-20% (its night time so less users), transactions went smoothly and everything calmed down...
My theory is that the transactions were maybe taking more time and accumulating (hence the cpu spike) looking into the data files searching for free space through all of them since as i said before we do some cleaning internally without shrinking the datafiles..
Everything looks normal now.. but we'll keep an eye into the instance.
have you guys ever been through these kind of scenarios? could you guys go a little bit deeper into it?
Regards,
June 2, 2021 at 12:47 pm
What is the "cost threshold for parallelism" setting? If it's lower than, say, 50, you should seriously consider changing it.
Did you verify there are no physical issues -- controllers, NIC/communication cards, etc.?
Do the missing index stats show lots of missing indexes?
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
June 2, 2021 at 2:07 pm
The cost threshold is actually 25 but the queries/transaction running in the instance never go above it so they are not actually using parallelism, i've checked most queries and none of them seem to take more than a few seconds or even milliseconds.
I'm not that pro checking physical issues, i checked drive queues and all of them where on 0.05 or below, i don't really know how to dig into NIC issues, i checked the event viewers in case something was odd but couldn't find a thing either.
Couldn't find missing indexes and fragmentation in most tables are less than 5%, hell i even got missing indexes in msdb backupset tables and not in that database...
was using the
sys.dm_db_missing_index_details
to get the last information, only showed database_id 4(msdb) and 1 (we have some tables capturing information for diagnostics and stuff).
Regards,
June 2, 2021 at 2:31 pm
Well, good luck. Since your posts seem to disagree with themselves, I don't think I can be of any more assistance.
"we are getting emails of large queries taking longer than 2 minutes ... queries are taking longer, "
"i've checked most queries and none of them seem to take more than a few seconds or even milliseconds."
Given that, and the off-and-on nature of the problem, it is likely that something outside the queries is causing the core issue.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
June 2, 2021 at 2:42 pm
Yeah i completely understand what you mean.
the instance was working fine, transaction times started taking longer in night and what "fixed" the issue at last was increasing a data file size... no warnings in the logs regarding making space in data files either, couldn't see a long queue of transactions as if the data file was full and couldn't grow, that could explain the cpu usage if it spiked to 100, but none of these issues were happening.
Thanks for reading tho :), it always feels nice to have a discussion here regarding sql server and having amazing DBAs replying to the posts.
June 2, 2021 at 7:29 pm
I am not clear on what you mean by 'closed' - data files are not closed. They may be set to read-only but that is different...
It sounds like you have 15 data files for a single filegroup, filled up the previous data files and then created a new file to allow for additional growth. This is not the intended configuration for multi-file filegroups - that is intended to allow balancing of data across all of the files in the filegroup, and since you have 'closed' the other data files SQL Server is not able to balance the data appropriately.
In general, you would want multiple files to all be sized the same and close to the same amount of space available. When SQL Server needs to grow the data file(s) in this configuration - it is going to try and grow all files at the same time. I believe that change occurred in SQL Server 2016 - but can be disabled for user databases.
I am just guessing here - but if you have differently sized data files, and 12 out of the 15 are full and cannot grow (auto growth disabled - or max size set current size), that might explain the issues and would explain how 'opening' one of the previously 'closed' files and allowing it to grow resolved the issues.
If that is your configuration, I would also think index rebuilds are problematic.
Then again, if you are actually talking about partitions then it is a whole different subject.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply