December 4, 2018 at 9:54 am
Hi,
I am hoping some one maybe have any insight or suggestions.
I have a stored proc that returns 10 to 200 rows max it, it is reading by the primary clustered identity key.
95% of the time this runs like it should, then something odd happens, the logical reads start to increase each time the proc is called.
Our monitor captures when it hits the couple million rows logging, as the proc is called the reads increase getting into 20, 30, 40, 50 60 70 million reads for an execution, timing out at the 30sec cut off.
Note it is still will return the 10 to 200 rows.
We have indexes on the tables, the indexes are generaly rebuilt every week.
The table is a hot table being updated and deleted from a lot.
So far I have not seen any pattern as to what may intiate the event.
The only thing I have found so far that seems to correct the run away logical reads is marking the proc for recompile.
If I run the failed proc via ssms it runs as it should in a second and returns the few existing rows.
Some time the proc can go weeks with out an incident and then I can have a day where it has this situation twice in day.
Most of the time when it occurs it is once in a day.
I put in a recompile for the proc as an sql agent job every 15 minutes.
It is a production site and has a large number of users, this issue causes other parts of the system to fail when it is running like this against the table.
There is an index with the first 2 fields as the key and the context field is in as an included field so it only needs the index to do the the search until it pulls the data.
The stored proc is pretty simple.
CREATE PROCEDURE dbo.usp_SelectSubmissionQuestionAnswersByDocumentIDRevisionContext_All(
@documentID INT,
@revision INT,
@context INT
)
AS
BEGIN
SET NOCOUNT ON
SELECT SUBMISSION_QUESTION_ANSWER_ID, DOCUMENTID, REVISION, QUESTIONNAIREID, SUBMISSION_QUESTION_DEFINITION_ID,
RANK, REQUIRED_FOR_SUBMISSION, QUESTION_TEXT, DATE_VALUE, INTEGER_VALUE, DECIMAL_VALUE, TEXT_VALUE, LIST_VALUE,
LIST_DESCRIPTION, LIST_ITEM_ID, DATATYPE, CONTEXT, AVAILABLE_ITEMS, REVISION_INDEPENDENT_ID, CORRESPONDING_AUTHOR, ACTIVE, DRAFT_MODE,
QUESTION_MAP_ID, INCLUDE_IN_MERGE
FROM dbo.SUBMISSION_QUESTION_ANSWERS WITH (NOLOCK)
WHERE DOCUMENTID = @documentID
AND REVISION = @revision
AND CONTEXT = @context
ORDER BY
RANK,
SUBMISSION_QUESTION_DEFINITION_ID
END
Thanks,
Bob
December 4, 2018 at 10:01 am
How many rows does this table have?
A good index for this query would be:CREATE INDEX IX_SUBMISSION_QUESTION_ANSWERS_1
ON dbo.SUBMISSION_QUESTION_ANSWERS(DOCUMENTID,REVISION,CONTEXT,RANK,SUBMISSION_QUESTION_DEFINITION_ID)
Why are you using WITH (NOLOCK)?
December 4, 2018 at 10:20 am
We inherited the data structure and as much as we would like not to use nolock they are not going to redo the data structure and the code.
The table has 19,484,440 rows as of now.
There are many things bad about this, fragmentation is high, they insert temporary rows into the table and then delete them at the end of the coding process.
This just adds to the table running extremely hot with a lot of I/O they should of put into another work table just for the transient data.
When the logical reads runs out of control, the sp_recompile of the proc seems to break the runaway read cycle.
We are trying to figure out the why, hoping if we can understand that we might be able to correct it so the situation does not happen.
The index could be better maybe.
Looking at the behavior and that the recompile of the proc corrects the reads issue, if everything else stays the same what changed by the recompile?
Thanks
Bob
December 4, 2018 at 10:25 am
Robert Menard - Tuesday, December 4, 2018 10:20 AMWe inherited the data structure and as much as we would like not to use nolock they are not going to redo the data structure and the code.
The table has 19,484,440 rows as of now.
There are many things bad about this, fragmentation is high, they insert temporary rows into the table and then delete them at the end of the coding process.
This just adds to the table running extremely hot with a lot of I/O they should of put into another work table just for the transient data.
When the logical reads runs out of control, the sp_recompile of the proc seems to break the runaway read cycle.
We are trying to figure out the why, hoping if we can understand that we might be able to correct it so the situation does not happen.
The index could be better maybe.
Looking at the behavior and that the recompile of the proc corrects the reads issue, if everything else stays the same what changed by the recompile?
Thanks
Bob
Sometimes parameter sniffing is the cause. You can stop this by declaring local variables in the SP and using them in the query.DECLARE @documentIDLocal INT = @documentID,
@revisionLocal INT = @revision,
@contextLocal INT = @context
Then use the local variables in your select:WHERE DOCUMENTID = @documentIDLocal
AND REVISION = @revisionLocal
AND CONTEXT = @contextLocal
Might be worth trying this.
December 4, 2018 at 10:47 am
Interesting, that is true this does seem like it follows the pattern of parameter sniffing.
Funny thing is the stored proc was far worse before I forced it to be rewritten.
It had an or in it and they were selecting either where they had another parameter that pulled all or production only.
That killed the query and we definitely felt parameter sniffing was the issue.
They made each case into a separate proc called from the main proc.
That is why I was not sure if this still was related to parameter sniffing.
I am guessing that by passing off to a local variable it allows the save plan to be more independent form the data?
December 4, 2018 at 12:17 pm
Have you examined the execution plan(s) used by this stored procedure?
You mentioned that the table is "hot", which I assume you mean it's getting a large volume of writes throughout the day. My first guess is that when the query runs fast, it's using something like a nested loop with a relatively small number of (ie: a few hundred) seeks, but when index statistics start getting stale, the compiler falls back to using a table / index scan and maybe a hash match join.
Using your monitoring tool (ie: SentryOne or RedGate), QueryStore, or dm_exec_sql_text, compare the plan used for executions with long duration versus short duration, and you'll probably see a scan operation with millions of rows being returned. If that's the case, then look into setting up statistics maintenance job nightly and/or use QueryStore or some other method to force the execution plan that's consistently working best.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
December 4, 2018 at 12:29 pm
The things I'd try are:
1. Change the stored procedure to stop parameter sniffing.
2. Make sure your statistics are up to date.
3. Drop the current index on dbo.SUBMISSION_QUESTION_ANSWERS(DOCUMENTID,REVISION) and create one on dbo.SUBMISSION_QUESTION_ANSWERS(DOCUMENTID,REVISION,CONTEXT,RANK,SUBMISSION_QUESTION_DEFINITION_ID)
4. Try some hints on the query like OPTION (OPTIMIZE FOR UNKNOWN), OPTION (RECOMPILE) or WITH (FORCESEEK)
December 4, 2018 at 1:41 pm
We have SentryOne montioring, the proc goes from sub second to 30 sec plus, we are capturing anything over 5sec. None of the failed procs captured did it capture the execution plan so I am blind on that, I wish it had caught even a few.
The proc is being called between 2 and 3 times a second to read the table, the proc is used in a number of different places getting called.
On the other side there are a fair number of users updating data in the rows.
What makes it a very hot table is when a user enters editing, the code inserts into this table a clone of the rows to fall back to if the user loses connection. Once the user leaves editing the clone rows are deleted.
Say hello to fragmentation on every level , between the size and the number of deletes and inserts going on it's tough.
We tried rebuilding the index one morning keeping an eye on it and by noon it was as fragmented as it was when we started in the morning.
The hard part is I have seen no pattern yet as to when it happens, we have had 2 weeks where it ran fine and not one instance.
The next week I have 2 days in a row at different times when it will happen.
I appreciate the suggestions and your time, will start looking using them.
Thanks
December 5, 2018 at 7:50 am
What makes it a very hot table is when a user enters editing, the code inserts into this table a clone of the rows to fall back to if the user loses connection. Once the user leaves editing the clone rows are deleted.
What you describe sounds like a hacky attempt to re-invent what snapshot isolation will already do better. If refactoring the application and SQL coding isn't an option, then consider rebuilding table with a very low fill factor, perhaps 70% or even 50%, so perhaps this will minimize page splits and slow the rate of fragmentation.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
December 5, 2018 at 9:42 am
Eric M Russell - Wednesday, December 5, 2018 7:50 AMWhat makes it a very hot table is when a user enters editing, the code inserts into this table a clone of the rows to fall back to if the user loses connection. Once the user leaves editing the clone rows are deleted.
What you describe sounds like a hacky attempt to re-invent what snapshot isolation will already do better. If refactoring the application and SQL coding isn't an option, then consider rebuilding table with a very low fill factor, perhaps 70% or even 50%, so perhaps this will minimize page splits and slow the rate of fragmentation.
Ah... careful now. While that is the "Best Practice" recommendation (which I don't agree with, BTW), it could be a very bad one depending on the current page density of the index and the table itself. A 50% Fill Factor will also be duplicated in memory which basically says that you're going to waste 50% of the memory allocated for this query
--Jeff Moden
Change is inevitable... Change for the better is not.
December 5, 2018 at 9:44 am
I guess my first question would be, when is the last time stats were rebuilt for this index? And, remember, logical fragmentation means very little once something is in memory. It really only matters during the "Read Ahead" part of when something is being loaded into memory. After that, page density matters more.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 5, 2018 at 10:06 am
Jeff Moden - Wednesday, December 5, 2018 9:42 AMEric M Russell - Wednesday, December 5, 2018 7:50 AMWhat makes it a very hot table is when a user enters editing, the code inserts into this table a clone of the rows to fall back to if the user loses connection. Once the user leaves editing the clone rows are deleted.
What you describe sounds like a hacky attempt to re-invent what snapshot isolation will already do better. If refactoring the application and SQL coding isn't an option, then consider rebuilding table with a very low fill factor, perhaps 70% or even 50%, so perhaps this will minimize page splits and slow the rate of fragmentation.
Ah... careful now. While that is the "Best Practice" recommendation (which I don't agree with, BTW), it could be a very bad one depending on the current page density of the index and the table itself. A 50% Fill Factor will also be duplicated in memory which basically says that you're going to waste 50% of the memory allocated for this query
Yes, memory and disk storage would be a trade off here. I don't know why I suggested 50%, that is extreme, but bumping page fill down to 80% would be a good start, assuming it's not already leaving some slack.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
December 10, 2018 at 8:41 am
Sounds to me like page splits are likely here, and in large quantities. Might be worth taking a look into the number of varchar columns (or nvarchar) and see just how much usage these columns get as compared to their length. Have seen an awful lot of situations where nvarchar was in use and there was absolutely no reason for it, as the data never justified it. Then there's the varchar(4000) and no one ever used more than 30 characters. You say that no one is willing to engage in data structure changes, but to be honest, that might well be the "real problem"....
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
December 10, 2018 at 10:36 am
sgmunson - Monday, December 10, 2018 8:41 AMSounds to me like page splits are likely here, and in large quantities. Might be worth taking a look into the number of varchar columns (or nvarchar) and see just how much usage these columns get as compared to their length. Have seen an awful lot of situations where nvarchar was in use and there was absolutely no reason for it, as the data never justified it. Then there's the varchar(4000) and no one ever used more than 30 characters. You say that no one is willing to engage in data structure changes, but to be honest, that might well be the "real problem"....
Just to clarify; the definition of columns as VARCHAR don't cause page splits; it's updating the columns with a value larger than the original that results in an increase of row size and page splitting. So, LastName VARCHAR(4000) might not make sense in terms of data constraint, but it's no more prone to cause page splits than VARCHAR(40), so long as the user isn't updating the values after the initial insert.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
December 10, 2018 at 11:00 am
Indexes are generally rebuilt once a week, we have a several hour window on the weekend. We try to get through as many as we can in the time frame, if it is not rebuilt it is the next week. The rebuild index cycles around again starting over.
For this table in general they are adding approx. 10 to 15 thousand rows a day that will be long term kept.
The temporary insert/deletes per day rough guess can range this can be from 100,000 to 300,000 per day, this could be higher, I don't have stats on this.
The temporary insert/deletes is pure bad design, at this stage of the game it has not hit a pain threshold high enough for them to address rewriting the code that uses this table.
The code is now being used in foreign countries and they are saving text needing nvarchar support.
The size of the fields, I wish they could be smaller but the users at times are saving a large amount of text that they feel the must have.
So trying to make the field smaller is not a option either for us.
I have applied the option unknown and so far it has resolved the time outs and blocking it was showing for 2 databases that this has been happening on.
I appreciate all the suggestions, thoughts, and your time.
It has given me some other ideas to keep in mind.
If I make it through this week with no more timeouts I think I may have the the solution for now to keep the pain at a tolerable level.
I realize that at some point this will read junction where they will have to address it
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply