May 23, 2019 at 9:18 pm
We migrated a major server to a new cluster. Installed SQL Server 2017 and set up all of the nodes and the AGs.
After a couple of months, we started seeing an occasional blocking of jobs in the morning. The indexoptimize_user_databases job was still running.
We put in some code to kill the job at 6:00 am. At the same, time we been troubleshooting. I've studied hallengren's script and I can't see what is taking the index optimize job that starts at 12:00 am to still be running at 6:00 am.
I'm not sure what else to review. I've checked the command table for the hallengren script, I see alter index statements up to a point and then I see a start time and a null for the end time. I assume that is when the job was cancelled.
Does anyone have any other troubleshooting ideas? It didn't take 6 hours before. I don't know why there are blocks when we are using the "ONLINE" parameter. I know transactions have to finish and then the online indexing starts. Not sure, what stops the indexing and puts it wait/blocking state.
Any ideas from anyone would be greatly appreciated.
Script being executed at 12:00 am.
EXECUTE [dbo].[IndexOptimize] @databases = 'USER_DATABASES',
@Indexes = 'ALL_INDEXES',
@LogToTable = 'Y',
@PageCountLevel=200,
@FragmentationLow = NULL,
@FragmentationMedium = 'INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE',
@FragmentationHigh = 'INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE',
@FragmentationLevel1 = 5,
@FragmentationLevel2 = 30,
@UpdateStatistics = 'ALL',
@OnlyModifiedStatistics = 'Y',
@maxdop = 16
Thanks.
Things will work out. Get back up, change some parameters and recode.
May 24, 2019 at 2:38 am
Having about 400 hours of testing into index maintenance and discovering many of the problems with it, my first suggestion would be to stop using REORGANIZE except when you KNOW that you need to compact LOBs. It's single threaded and isn't the lightweight and friendly little puppy that a lot of people think it is. It also removes critical free space from certain index patterns when the most need it and that causes massive page splits which cause excessive log file writes, extended system transactions during the bad page splits, and all of that cause blocking that you might not be aware of but is having incredible "morning after" affects.
If you can, rebuild your smaller indexes using OFFLINE. ONLINE takes a whole lot longer and you can fly through most of your smaller stuff in very short order without ONLINE.
To be honest, I'll eventually go back to doing regularly schedule index maintenance after I've finished writing some special code to do a better job but... I've not scheduled index maintenance to run since Sunday evening, 17 January 2016 (more than 3 years ago now). I had to watch some of the larger indexes for page density and occasionally rebuild them but CPU dropped from an average of 22% to 8% and stayed there since I stopped doing regular index maintenance and especially since I stopped using REORGANIZE.
I've been giving a 2 hour introduction presentation on how to fix things. If you're in the Columbus, OH area on June 8th, 2019, go to the SQLSaturday there. That's the next place I'm giving the presentation. To whet your whistle, you'll find out that there are 6 different index insert/update patterns and what you can do about them with index maintenance.
Ed Wagner and I are also working (feverishly... there's a lot to cover) on a precon for Pittsburgh this October on this subject including additional things like how LOBs screw you up and what to do about it while still being able to use LOBs, etc, etc, etc.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 24, 2019 at 2:49 am
Getting away from the problems associated with index maintenance in general, you moved to a new machine. You also have replication of some form going and AG. IIRC, all of that relies pretty heavily on the log files. Have you checked the comms between the server and the SAN and the latency of the log file drives?
And, you said you went to a new machine AND 2017... what did you migrate from for SQL Server?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 24, 2019 at 1:40 pm
The old server was SQL Server 2008 R2.
No, I hadn't checked the latency of the log file drives. I will check that as well.
Thanks.
Things will work out. Get back up, change some parameters and recode.
May 24, 2019 at 2:01 pm
Not sure but, since you went from 2008 R2 to 2017, you're also in the realm of the "improvement" they made to the "Cardinality Estimator" (which happened in 2014, IIRC). That caused me some HUGE performance issues in some areas and no gains in the rest. I used the trace flag that allows me to use the old Cardinality Estimator and things went back to "normal". (Search for the trace flag... I can't recall off the top of my head what it was).
--Jeff Moden
Change is inevitable... Change for the better is not.
May 24, 2019 at 4:11 pm
I am still researching the cardinality estimator. While researching that, I noticed that the database compatibility has not changed from the migration. In fact, it looks like it wasn't changed from when we went to SQL Server 2008 R2 from SQL Server 2008. The database compatibility level is still set at 100 for SQL Server 2008.
Still researching, so not sure if that may be part of the problem.
Things will work out. Get back up, change some parameters and recode.
June 5, 2019 at 1:53 pm
Hello Webtechie,
I would not look for the old cardinality trace flag.
As it turns out, you mentioned that the databases are still in the SQL 2008 Compat level, which is a very good way to use the old cardinality algoritm.
Since SQL2016+ you can set this option on a per database level:
ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON
Before it was:
Trace Flag 9481 to Disable New Cardinality Estimation on SQL Server 2014.
However, you mentioned also that you went to a new machine, using HAG, and online as the index rebuild option.
SQL 2008 did not have HAG but more a per database synchro (mirroring), where as 2017 HAG have a group synchro (async of sync) with multiple worker threads depending on the number of databases involved.
You set a maxdop of 16 for the index rebuild which is a lot of firepower (and threads..).
Now, some problems may arise here:
You may run out of worker threads (check the wait stats) or possibly you run into locking.
Even an ONLINE index rebuild take 2 (normal short) locks, 1 at the start and 1 at the end, as well as a schema stability lock.
Maybe you can activate (or check if already activated) the duration on a per index base, maybe the system was just waiting to get a hand on the required lock to switch over the index after the online rebuild.
Note on using 16 cores for rebuilding..
Watch out that the indexes involved do allow for page locks, or you may end up getting more fragmentation afterwards.
I'm more with Jeff on this, rebuild only when really needed, but spend more attention on the statistics, less overhead, more gain.
Wkr,
Eddy
June 5, 2019 at 3:18 pm
Jeff / Eddy,
I really appreciate the feedback and replies. I've seen that most of our databases have LOB data.
I am preparing to talk to my manager about your suggestions. Here are my thoughts are reading your replies and doing some research.
Questions
Note
Jeff, I would love to hear your presentation this weekend. I really would, but I am in Texas. Hopefully, the presentation will get posted somewhere after Saturday where I can listen to it.
Thanks.
Things will work out. Get back up, change some parameters and recode.
June 6, 2019 at 6:43 am
This was removed by the editor as SPAM
June 6, 2019 at 7:28 am
This was removed by the editor as SPAM
June 6, 2019 at 7:29 am
This was removed by the editor as SPAM
December 25, 2019 at 8:50 pm
Man... I'm sorry. I lost track of this thread and didn't answer your additional questions.
I guess my biggest question would be, did you stop using REORGANIZE or not?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply