February 22, 2010 at 10:26 am
Hi,
We have a table which has grown to 20mm rows. With this table, towards the end of the day
it can have 40000 rows being inserted/deleted/updated during the course of a transaction. If all
goes well, it is committed, rollback if otherwise.
This table is on the primary filegroup of the database, along with a slew of other tables.
When this job runs, it generally takes 2-3 minutes. That's ok. What is interesting to note
is all other jobs (which are NOT accessing this table) comes to a crawl...actually i feel
the other queries don't come back until the 40000 row job is done/committed.
My questions are :
1.) Why is this occurring? I'm thinking because this large table is on the primary filegroup,
everybody is beginning to suffer. How do I go about proving/showing this? Or am i incorrect on
my assumption?
2.) Would putting that 1 specific table on to its own filegroup avoid the other jobs from slowing
down?
Any thoughts, opinions, or advise is appreciated. Thanks!
February 22, 2010 at 2:04 pm
There two possible general causes for this kind of behavior:
1) Physical Resource Contention: This would be if your big job is using up so much of a physical resource, and so aggressively, that other transactions are slowed to an apparent standstill waiting for enough time on that resource to finish their transaction. This is essentially what you are suggesting with respect to your Primary File Group, in which case the "resource" would be the physical disk (or set of disks) supporting the Primary File Group. Thats the most likely case, however, it could easily be some other resource (another disk volume, your CPU, memory, etc.).
2) Virtual Resource Contention: Most commonly seen as Locking & Blocking, wherein the SQL locks that your big job is taking out are blocking other transactions from finishing. That's the most likely kind of virtual resource contention, however, there are other kinds of lock than just SQL locks, and other kinds of Virtual Resource contention than just Locks in general.
In addition to these there are numerous idiosyncratic and/or "pathological" possibilities (such as say, a SQL Server internal bug), but they are much less likely, unless there are other, more unusual symptoms.
There are two general ways for you to proceed here (they are not exclusive of each other). The first is what you suggested which the "systemic" approach, wherein you try to confirm which of the above two is occurring, which actual resource it is happening on, and whether your big job is the cause.
The other is a "constructionist" approach wherein you just assume that your big job is the problem, and that, regardless of the details of why its causing it, it needs to be re-written to be a better DB citizen and start from there. And since they are not exclusive, you can also start with the Systemic approach until you feel you know enough to switch to the Constructionist approach.
Systemic Approach: To start systemically, you use PerfMon to observe and compare the behavior & availability of your Server's physical resources during the time that big job is running, to their availability before and after that, when it is not running. You do this by looking at two specific metrics for each resource: its Utilization% and its Avg Queue Length (there are others that are useful diagnostically, but these are the only ones that you need initially).
For you CPU, you get your Util% by looking at the Total "% Processor Time" under the Processor object, and dividing by the number of cores that you have (this is already there by default in Perfmon, however you should remove the other two metrics, Pages/sec, and Total "Avg. Disk Queue Length"). You get the CPU queue length by taking the "Processor Queue Length" under the System object and adding the Total "% Processor Time"/100% to it (note: do not divide by the number of processors for this use).
For you apparent Disk volumes (i.e., each appears as a single Physical Disk to PerfMon) you want to go to the Physical Disk object and add the following counters for each disk instance there that SQL Server uses (plus, always add the system disk, and any pagefile disks, even if SQL Server is not using them directly, never use the "Total"). The Util% for disks does not come from the "% Disk Time" as many believe (and the MS help claims), its not accurate for anything that I know. Rather you have to use the "% Idle" counter and then subtract it from 100%. Queue length is easy, its the "Avg. Disk Queue Length"
Technically, Memory is not a consumable physical resource, but it is still useful to include it at this point, however, we don't use the same kinds of metrics for it. Instead, you want to go to the Memory object and add "Page Reads/sec" and "Page Faults/sec".
If you want to do this purely visually then set your sampling period to 5 secs (that way everything can fit onto one screen). However if you want to crunch the actual numbers, or if you want us to look at them, then you should set all of this up under PerfMons Counter Logs and set the sample period to 2 seconds. If you do log this then you should specify it as a comma-delimited text file and ideally do the collection remotely so that PerfMons logging does not interfere with any of your SQL Server disks (or vice-versa).
To do the collection/observation, you just start the perfmon collector during a "normal" time when your big job is not running, collect for a couple of minutes, then start the big job, noting exactly when it started, let it run to completion, again noting exactly when it stopped, and finally collect a couple more minutes after it has completed.
What you will be looking for most importantly is what happens to each of the physical resources at the transition from several seconds before the job starts to run, to several seconds after it starts to run. The clearest and easiest is if one of the Physical resources "Pegs", that is, its util% jumps to 100% (or Idle% drops to zero for the disks). That would usually mean that your big job was causing these problems because it was flooding that resource. Other patterns need more analysis, so I will wait until (and if) you have that data/information.
Constructionist Approach: A much shorter description, but certainly no less effort. For us to help you with this path, we will need you to post the code from your big job and necessary associated information & scripts. This article explains how to best do this: http://www.sqlservercentral.com/articles/Best+Practices/61537/.
Let us know what you decide. 🙂
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
February 23, 2010 at 2:54 pm
Hi there,
Thanks for your detailed reply. While I had thought about using perfmon, for whatever reason I was hoping there would be some "magical" SQL/DBCC/showplan thingy to just tell me what was going on :Whistling:
I'll have to dig out some time, and go through the systemic approach advised.
Thanks for the detailed information - it saves time knowing what metrics in Perfmon I'll have to track.
I'll write a followup on this thread when I actually come to a conclusion, and execute a plan to be taken.
February 23, 2010 at 3:03 pm
I think RBarryYoung is on the right track, what you are describing isn't so much as blocking or locking but that other queries are fighting to just complete. That *feels* like resource contention, the problem being what kind, CPU, I/O, memory? This should help you find the userlying issue. I'm pretty comfortable saying that SQL appears to be doing what it is supposed to, just not as fast as we'd like..
CEWII
February 23, 2010 at 3:20 pm
Elliott W (2/23/2010)
I think RBarryYoung is on the right track, what you are describing isn't so much as blocking or locking but that other queries are fighting to just complete. That *feels* like resource contention, the problem being what kind, CPU, I/O, memory? This should help you find the userlying issue. I'm pretty comfortable saying that SQL appears to be doing what it is supposed to, just not as fast as we'd like..CEWII
In the milder cases the two (resource contention vs. lock contention) feel very different. Lock contention as a "Stop & Go" fell, whereas resource contention just make things slower. However, in the more severe cases these two behaviors converge to the point where they a perceptually indistinguishable because "everything just stops/hangs". In those case you have to go to the technical metrics.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
February 23, 2010 at 3:26 pm
samir-847608 (2/23/2010)
... for whatever reason I was hoping there would be some "magical" SQL/DBCC/showplan thingy to just tell me what was going on :Whistling: ...
There is. For your case, its PerfMon. 🙂
I'll have to dig out some time, and go through the systemic approach advised.
Despite my long explanation, it really shouldn't take that long to do. Maybe 10 minutes to set it up (5 if you're more familiar with PerfMon), the time to run the collection, and then another 5 minutes to put the CSV in a zip file and post it here. Well, I guess it does add up, still I encourage you to do it, just for the chance to learn how to do it, if nothing else.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
February 23, 2010 at 3:31 pm
I am absolutely on-board with getting technical metrics. Without them there is really no way to diagnose what the problem here is.
I did get to thinking about foreign key tables off this main table.. During the transaction they are likely shared locked, I could see that locking, but it still highlights the need for metrics..
CEWII
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply