August 29, 2017 at 8:24 am
GilaMonster - Sunday, August 27, 2017 2:06 PMArsh - Sunday, August 27, 2017 1:36 PMand what are MARS in the connection strings you mentioned above ?Something that is off by default and almost never used. I have never seen it in use on any client system.
Google will get you details.
Heh... our predecessors from more than 5 years ago (they're all gone now... consultants) had enabled it and no one had ever said anything about it. It took a while to finally bite us but, when the code that did 2-22 second recompiles every time it was started to take more and more hits on larger and larger data, it caused massive KILL/ROLLBACKs with literally hundreds of long term blocks on the underlying tables which caused the KILL/ROLLBACKs to get worse in a vicious circle. It was both amazing (48 CPUs went to 90% usage, never saw such a thing before) and painful to watch because I couldn't do a bloody thing about the KILL/ROLLBACKs once they started nor could I prevent them from starting and it would take anywhere from 20 to 40 minutes of nearly total system paralysis to resolve.
Once we discovered the connection strings had MARS enabled and changed them all everywhere, the code was still obnoxious but we no longer had the KILL/ROLLBACKs going on. And, yeah... we finally converted the ORM code that was causing the recompiles to a stored procedure and CPU usage across 48 CPUs dropped from 22% to 5-6%.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 30, 2017 at 8:11 am
Jeff Moden - Tuesday, August 29, 2017 8:24 AMGilaMonster - Sunday, August 27, 2017 2:06 PMArsh - Sunday, August 27, 2017 1:36 PMand what are MARS in the connection strings you mentioned above ?Something that is off by default and almost never used. I have never seen it in use on any client system.
Google will get you details.Heh... our predecessors from more than 5 years ago (they're all gone now... consultants) had enabled it and no one had ever said anything about it. It took a while to finally bite us but, when the code that did 2-22 second recompiles every time it was started to take more and more hits on larger and larger data, it caused massive KILL/ROLLBACKs with literally hundreds of long term blocks on the underlying tables which caused the KILL/ROLLBACKs to get worse in a vicious circle. It was both amazing (48 CPUs went to 90% usage, never saw such a thing before) and painful to watch because I couldn't do a bloody thing about the KILL/ROLLBACKs once they started nor could I prevent them from starting and it would take anywhere from 20 to 40 minutes of nearly total system paralysis to resolve.
Once we discovered the connection strings had MARS enabled and changed them all everywhere, the code was still obnoxious but we no longer had the KILL/ROLLBACKs going on. And, yeah... we finally converted the ORM code that was causing the recompiles to a stored procedure and CPU usage across 48 CPUs dropped from 22% to 5-6%.
Hey Jeff .. I understand the pain of watching a system that is rich in resources but suffering . Concepts about MARS and ORM made good reading . Thankfully , MARS is not used in our apps , made a check. Glad that I could make some recommendations at the code side (especially index revamp) and also at the DB architecture side which have been effective to some extent and there's marked improvement but still a lot is to be achieved . Need a clarification on the fragmented portion of the indexes , mostly due to the high IO I'm seeing here which I feel a bit abnormal . As the indexes are read into the buffer pool , the fragmented portion too lands in the memory hence a cause for unnecessary IO . In that case, fragmentation must help I suppose ? Any take on that ?
Thank you..Arshad
August 30, 2017 at 10:02 am
Arsh - Wednesday, August 30, 2017 8:11 AMAs the indexes are read into the buffer pool , the fragmented portion too lands in the memory hence a cause for unnecessary IO . In that case, fragmentation must help I suppose ? Any take on that ?
No, that doesn't cause unnecessary IO, and no fragmentation does not help (or hinder) memory usage.
Fragmentation affects large range scans from disk, emphasis *large*. Fragmentation has no meaning once pages are read into memory.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 30, 2017 at 4:54 pm
Arsh - Wednesday, August 30, 2017 8:11 AMJeff Moden - Tuesday, August 29, 2017 8:24 AMGilaMonster - Sunday, August 27, 2017 2:06 PMArsh - Sunday, August 27, 2017 1:36 PMand what are MARS in the connection strings you mentioned above ?Something that is off by default and almost never used. I have never seen it in use on any client system.
Google will get you details.Heh... our predecessors from more than 5 years ago (they're all gone now... consultants) had enabled it and no one had ever said anything about it. It took a while to finally bite us but, when the code that did 2-22 second recompiles every time it was started to take more and more hits on larger and larger data, it caused massive KILL/ROLLBACKs with literally hundreds of long term blocks on the underlying tables which caused the KILL/ROLLBACKs to get worse in a vicious circle. It was both amazing (48 CPUs went to 90% usage, never saw such a thing before) and painful to watch because I couldn't do a bloody thing about the KILL/ROLLBACKs once they started nor could I prevent them from starting and it would take anywhere from 20 to 40 minutes of nearly total system paralysis to resolve.
Once we discovered the connection strings had MARS enabled and changed them all everywhere, the code was still obnoxious but we no longer had the KILL/ROLLBACKs going on. And, yeah... we finally converted the ORM code that was causing the recompiles to a stored procedure and CPU usage across 48 CPUs dropped from 22% to 5-6%.
Hey Jeff .. I understand the pain of watching a system that is rich in resources but suffering . Concepts about MARS and ORM made good reading . Thankfully , MARS is not used in our apps , made a check. Glad that I could make some recommendations at the code side (especially index revamp) and also at the DB architecture side which have been effective to some extent and there's marked improvement but still a lot is to be achieved . Need a clarification on the fragmented portion of the indexes , mostly due to the high IO I'm seeing here which I feel a bit abnormal . As the indexes are read into the buffer pool , the fragmented portion too lands in the memory hence a cause for unnecessary IO . In that case, fragmentation must help I suppose ? Any take on that ?
Thank you..Arshad
Yes. High I/O is (IMHO) rarely caused by fragmentation (the exception I've found is if you've shrunk the database, which seems to do more than just fragment the hell out of everything but haven't put my finger on it and, truthfully, don't need to). What normally causes high I/O is what most people just don't want to hear and that's bad code. Bad code can be qualified as code that can be cached/reused but simply uses too many resources and code that needs to recompile every time it's used. Yes, indexes are a part of writing good code but they aren't a panacea of performance. The right kind of code needs to be present, as well. For example, non-Sargable code can't be helped by indexes at all.
A great example of such bad code is "catch all" queries. Gail Shaw has a great article on how to correctly write "Catch All" queries correctly without exposure to the likes of SQL Injection. Here's the link.
http://www.sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/
Although good hardware and proper indexes are both necessary, performance truly lives in the code and that's where I'd concentrate. This isn't some willy-nilly recommendation, either. When I first started at this particular job, the database was only about 60GB and half of that was in a couple of audit tables. The hardware was 16CPU with 128GB of Ram on a good SAN and yet they'd have 10 minute "outages" where no one could do anything until whatever was running completed. It turned out it wasn't so much the batch run that seemed to be the instigator... it was "tender" front-end code that suffered and needed to be fixed (we fixed this one by caching results on the Web Server because the most they would change was once per day). I/O decreased by more than an order of magnitude (almost two orders) and the web servers ran faster, as well.
Also remember that your worst code probably isn't your longest running code. In our case, it has usually been code that's hit anywhere from thousands of time per hour to code that's executed millions of times a day (I've got some of that and some has been fixed, some has not). "Single run" code (some call it "ad hoc" queries), which is code that is executed a lot but has minor to severe changes and so has to be compiled every time it's run, is a big problem and turned out to be a much larger problem for us than I could have ever imagined. We've recently found and repaired a whole lot of that. I believe I mentioned it previously but one such bit o' code was converted to a stored procedure and it alone dropped CPU from a 22% average across 48 cpu's to just 5-6 %. It took a while to find it and fix it (required a front-end release because the ORM generated crap code had to be converted to a proc and the GUI had to have a new call built into it), but it was worth every second.
Don't get me wrong... batch code is important, as well. We did fix a set of audit triggers that were taking 4 minutes to run when just 4 columns and 10,000 rows of a 3 million row, 138 column table was being updated. After the fix, the code only took 400-800 milliseconds (yeah... I know... I could do better 😉 ).
If you want lower I/O, find the code that's causing it because it's probably not fragmentation that's the cause.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 31, 2017 at 1:30 am
Jeff Moden - Wednesday, August 30, 2017 4:54 PMArsh - Wednesday, August 30, 2017 8:11 AMJeff Moden - Tuesday, August 29, 2017 8:24 AMGilaMonster - Sunday, August 27, 2017 2:06 PMArsh - Sunday, August 27, 2017 1:36 PMand what are MARS in the connection strings you mentioned above ?Something that is off by default and almost never used. I have never seen it in use on any client system.
Google will get you details.Heh... our predecessors from more than 5 years ago (they're all gone now... consultants) had enabled it and no one had ever said anything about it. It took a while to finally bite us but, when the code that did 2-22 second recompiles every time it was started to take more and more hits on larger and larger data, it caused massive KILL/ROLLBACKs with literally hundreds of long term blocks on the underlying tables which caused the KILL/ROLLBACKs to get worse in a vicious circle. It was both amazing (48 CPUs went to 90% usage, never saw such a thing before) and painful to watch because I couldn't do a bloody thing about the KILL/ROLLBACKs once they started nor could I prevent them from starting and it would take anywhere from 20 to 40 minutes of nearly total system paralysis to resolve.
Once we discovered the connection strings had MARS enabled and changed them all everywhere, the code was still obnoxious but we no longer had the KILL/ROLLBACKs going on. And, yeah... we finally converted the ORM code that was causing the recompiles to a stored procedure and CPU usage across 48 CPUs dropped from 22% to 5-6%.
Hey Jeff .. I understand the pain of watching a system that is rich in resources but suffering . Concepts about MARS and ORM made good reading . Thankfully , MARS is not used in our apps , made a check. Glad that I could make some recommendations at the code side (especially index revamp) and also at the DB architecture side which have been effective to some extent and there's marked improvement but still a lot is to be achieved . Need a clarification on the fragmented portion of the indexes , mostly due to the high IO I'm seeing here which I feel a bit abnormal . As the indexes are read into the buffer pool , the fragmented portion too lands in the memory hence a cause for unnecessary IO . In that case, fragmentation must help I suppose ? Any take on that ?
Thank you..Arshad
Yes. High I/O is (IMHO) rarely caused by fragmentation (the exception I've found is if you've shrunk the database, which seems to do more than just fragment the hell out of everything but haven't put my finger on it and, truthfully, don't need to). What normally causes high I/O is what most people just don't want to hear and that's bad code. Bad code can be qualified as code that can be cached/reused but simply uses too many resources and code that needs to recompile every time it's used. Yes, indexes are a part of writing good code but they aren't a panacea of performance. The right kind of code needs to be present, as well. For example, non-Sargable code can't be helped by indexes at all.
A great example of such bad code is "catch all" queries. Gail Shaw has a great article on how to correctly write "Catch All" queries correctly without exposure to the likes of SQL Injection. Here's the link.
http://www.sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/Although good hardware and proper indexes are both necessary, performance truly lives in the code and that's where I'd concentrate. This isn't some willy-nilly recommendation, either. When I first started at this particular job, the database was only about 60GB and half of that was in a couple of audit tables. The hardware was 16CPU with 128GB of Ram on a good SAN and yet they'd have 10 minute "outages" where no one could do anything until whatever was running completed. It turned out it wasn't so much the batch run that seemed to be the instigator... it was "tender" front-end code that suffered and needed to be fixed (we fixed this one by caching results on the Web Server because the most they would change was once per day). I/O decreased by more than an order of magnitude (almost two orders) and the web servers ran faster, as well.
Also remember that your worst code probably isn't your longest running code. In our case, it has usually been code that's hit anywhere from thousands of time per hour to code that's executed millions of times a day (I've got some of that and some has been fixed, some has not). "Single run" code (some call it "ad hoc" queries), which is code that is executed a lot but has minor to severe changes and so has to be compiled every time it's run, is a big problem and turned out to be a much larger problem for us than I could have ever imagined. We've recently found and repaired a whole lot of that. I believe I mentioned it previously but one such bit o' code was converted to a stored procedure and it alone dropped CPU from a 22% average across 48 cpu's to just 5-6 %. It took a while to find it and fix it (required a front-end release because the ORM generated crap code had to be converted to a proc and the GUI had to have a new call built into it), but it was worth every second.
Don't get me wrong... batch code is important, as well. We did fix a set of audit triggers that were taking 4 minutes to run when just 4 columns and 10,000 rows of a 3 million row, 138 column table was being updated. After the fix, the code only took 400-800 milliseconds (yeah... I know... I could do better 😉 ).
If you want lower I/O, find the code that's causing it because it's probably not fragmentation that's the cause.
Thank you so much Jeff for the insights . Intriguing. Yeah we are looking at code as main culprit number but I've been adamant trying to put it to the management for revamping the architecture as well,, like separating out the Data and Log files (As this client only had some DB Devs and no DBA some years back, everything was kept on one drive ; no design concepts considered) , Temp DB management , RAID configurations , HBA queue length etc. Here people didn't know (as they were only DB Devs) what to look at , how to get the blocking code identity , the wait-types etc. Gladly , after I joined , I could give them good stuff to look at and its helped a lot. The architecture revamp recommendations are also in pipeline to be implemented in a new data center with migration . Fingers Crossed !
Thank you again..
August 31, 2017 at 7:11 am
Just a quick note... if the system is on a SAN and unless the SAN admin is very careful with how the logical drives are setup to guarantee the logical drives are, in fact, each on a separate set of spindles for each server, then there is no performance advantage to having MDF and LDF files on separate drives.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 31, 2017 at 11:48 am
Spot on Jeff! Have taken are to put that specifically. Proposed two different SAN groups with RAID 1,0 and RAID 5 for temp DB and the other DB files accordingly..Thank you for adding that.
August 31, 2017 at 11:50 am
Spot on Jeff! Have taken care to put that specifically. Proposed two different SAN groups with RAID 1,0 and RAID 5 for temp DB and the other DB files accordingly..Thank you for adding that.
August 31, 2017 at 2:16 pm
Arsh - Thursday, August 31, 2017 11:48 AMSpot on Jeff! Have taken are to put that specifically. Proposed two different SAN groups with RAID 1,0 and RAID 5 for temp DB and the other DB files accordingly..Thank you for adding that.
I'd still be surprised if you saw much of a performance increase. Let us know, please.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 31, 2017 at 2:19 pm
While I certain appreciate the attention to detail on all the hardware tricks of the trade, I'd really be surprised if much happened in the area of performance. If you get a 2:1 increase in performance thanks to such trickery, consider yourself VERY fortunate because the normal effect is 0 to little, if any.
You're going to have to bite the bullet and fix the code someday. Might as well make that a priority now. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
September 1, 2017 at 3:41 am
Jeff Moden - Thursday, August 31, 2017 2:19 PMWhile I certain appreciate the attention to detail on all the hardware tricks of the trade, I'd really be surprised if much happened in the area of performance. If you get a 2:1 increase in performance thanks to such trickery, consider yourself VERY fortunate because the normal effect is 0 to little, if any.You're going to have to bite the bullet and fix the code someday. Might as well make that a priority now. 😉
Yeah Jeff. we'll come to know after the migration to the new data center , coming up in few weeks time. I'll share the outcome. But yes, tuning the code has been taken as top priority. I've proposed the architectural changes , to have things as per Microsoft manuals . As my previous employer (a staunch believer in manuals) was a premier customer with Microsoft , I would host Microsoft Field Engineers for RAP sessions at our place and then implement those changes as best practices. But why do you think following these will not help improve performance (current setup has only one drive for everything) ? Thank you.
September 6, 2017 at 9:08 pm
Arsh - Friday, September 1, 2017 3:41 AMJeff Moden - Thursday, August 31, 2017 2:19 PMWhile I certain appreciate the attention to detail on all the hardware tricks of the trade, I'd really be surprised if much happened in the area of performance. If you get a 2:1 increase in performance thanks to such trickery, consider yourself VERY fortunate because the normal effect is 0 to little, if any.You're going to have to bite the bullet and fix the code someday. Might as well make that a priority now. 😉
Yeah Jeff. we'll come to know after the migration to the new data center , coming up in few weeks time. I'll share the outcome. But yes, tuning the code has been taken as top priority. I've proposed the architectural changes , to have things as per Microsoft manuals . As my previous employer (a staunch believer in manuals) was a premier customer with Microsoft , I would host Microsoft Field Engineers for RAP sessions at our place and then implement those changes as best practices. But why do you think following these will not help improve performance (current setup has only one drive for everything) ? Thank you.
Like I said, unless you can guarantee that the drives will be physically separated on the SAN according to physical spindles, there will be no advantage. And, even if you can achieve the physical separation, other instances of SQL Server could be using the same physical spindles meaning that your instance won't have exclusive use of the spindles. On top of all that is physical fragmentation of files on the disks. The bottom line is that the Read/Write heads on a SAN are almost always engaged in the most expensive operation there is for disks and that is track and sector seeks. Fortunately, hard disk head mechanisms and their related control electronics have come a very long way from the early days of distinct drives or SANs would be quite the drag. They still can be. That's why having as much memory as possible is good and why SSDs have a real benefit. No R/W head seeks.
As for code and DB design "best practices", be really careful when it comes to such things as using "Microsoft Manuals" unless it was written by one of several of the better authors that folks know about out there. You don't have to look too deep into the table structure of MSDB or any of the stored procedures MS has written to see what I'm talking about. Remember that "Best Practices" are frequently the result of a whole bunch of people that don't actually know what they're talking about saying things enough times and getting equally ignorant people to nod their heads in affirmation until it's accepted as a supposed "Best Practice". The "Best Practices" and resulting industry built around defragging indexes is a classic example that I'm ashamed to admit that I bought into for over 2 decades until 17 Jan 2016. Originally, it made sense because of the drives and low amount of memory available back then but hasn't been necessary since SANs and 64 bit machines hit the streets.
That being said, don't take me wrong. I still and likely always will think that SQL Server (Rushmore engine, IIRC) is the best damned RDBMS there is. And, yes... I'm also a staunch believer in using "manuals"... just not blindly. Test everything. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
September 7, 2017 at 2:30 am
Jeff Moden - Wednesday, September 6, 2017 9:08 PMArsh - Friday, September 1, 2017 3:41 AMJeff Moden - Thursday, August 31, 2017 2:19 PMWhile I certain appreciate the attention to detail on all the hardware tricks of the trade, I'd really be surprised if much happened in the area of performance. If you get a 2:1 increase in performance thanks to such trickery, consider yourself VERY fortunate because the normal effect is 0 to little, if any.You're going to have to bite the bullet and fix the code someday. Might as well make that a priority now. 😉
Yeah Jeff. we'll come to know after the migration to the new data center , coming up in few weeks time. I'll share the outcome. But yes, tuning the code has been taken as top priority. I've proposed the architectural changes , to have things as per Microsoft manuals . As my previous employer (a staunch believer in manuals) was a premier customer with Microsoft , I would host Microsoft Field Engineers for RAP sessions at our place and then implement those changes as best practices. But why do you think following these will not help improve performance (current setup has only one drive for everything) ? Thank you.
Like I said, unless you can guarantee that the drives will be physically separated on the SAN according to physical spindles, there will be no advantage. And, even if you can achieve the physical separation, other instances of SQL Server could be using the same physical spindles meaning that your instance won't have exclusive use of the spindles. On top of all that is physical fragmentation of files on the disks. The bottom line is that the Read/Write heads on a SAN are almost always engaged in the most expensive operation there is for disks and that is track and sector seeks. Fortunately, hard disk head mechanisms and their related control electronics have come a very long way from the early days of distinct drives or SANs would be quite the drag. They still can be. That's why having as much memory as possible is good and why SSDs have a real benefit. No R/W head seeks.
As for code and DB design "best practices", be really careful when it comes to such things as using "Microsoft Manuals" unless it was written by one of several of the better authors that folks know about out there. You don't have to look too deep into the table structure of MSDB or any of the stored procedures MS has written to see what I'm talking about. Remember that "Best Practices" are frequently the result of a whole bunch of people that don't actually know what they're talking about saying things enough times and getting equally ignorant people to nod their heads in affirmation until it's accepted as a supposed "Best Practice". The "Best Practices" and resulting industry built around defragging indexes is a classic example that I'm ashamed to admit that I bought into for over 2 decades until 17 Jan 2016. Originally, it made sense because of the drives and low amount of memory available back then but hasn't been necessary since SANs and 64 bit machines hit the streets.
That being said, don't take me wrong. I still and likely always will think that SQL Server (Rushmore engine, IIRC) is the best damned RDBMS there is. And, yes... I'm also a staunch believer in using "manuals"... just not blindly. Test everything. 😉
You are right Jeff . I take every word mentioned by you ; full of wisdom. Thank you for all the guidance .
Viewing 13 posts - 16 through 27 (of 27 total)
You must be logged in to reply to this topic. Login to reply