One query is hogging my baby!!!

  • Ok guys, I got a good one here. This is something that I didn't think could happen on my server.

    My boss wrote a query that actually brought the whole server, operating system and all to just about a standstill.

    The real stumper about this is that the query had no table scans and was actually using the indexes correctly(it seems). If the query was ran for one day, then response times were less than 3 seconds. If he ran the query for a range of 10 days, it hosed the system. This machine also runs a webserver which the query caused all of the pages to timeout.(even the static ones)

    Before everyone starts jumping on this thread offering ideas, let me give you a little more info:

    1. Win2000 Advanced Server, SQL 2000 SP4 (8.00.2187), 8gb mem, quad Xeon 2.5ghz proc

    2. I have 8gb total with 7gb allocated to SQL Server, with dynamic allocation turned on

    3. I know that date ranges are by general hard to optimize (I think the BETWEEN clause is nonsargable, but I may be wrong)

    4. The three table sizes - 3mil rows at 5gb, 6mil rows at 6gb and 14k rows at 4mb

    5. I have stressed this server with similar queries(2mil row joins multiple tables), but I have always seen good response time from other processes running at the same time

    6. SQL Server is set to use all available processors for parallel execution

    7. Boost SQL Server priority on Windows is turned on

    8. Everytime that I start the query it makes the server unresponsive and it's a production server, so I can't do too much playing during production hours.

    9 Windows background processes are set to get priority

    Here are some of my thoughts :

    1. First I am thinking about setting the max processors to 7 (8 available). I would guess that this would guarantee that there is always an open process for the webserver and the operating system

    2. I want to fix my server so that if another query like this comes up, only that process suffers and the operating system is still responsive.

    3. I want to setup my server so that no one process can takeover the system.

    Two things that I want to do (in this order) :

    1. Don't allow another query to do this to my baby

    2. Fix this query so that it runs fast


    Live to Throw
    Throw to Live
    Will Summers

  • Will:

    I don't have solutions but I think I might have explanations.

    1. Even though a query is using an index instead of doing table scans is not a guarantee that the query is going to run much faster.  Some clustered indexes can be almost as large as the table and an index scan of a clustered index is expensive.

    2. I assume that you're using a >= or a <= to set the comparison for the dates.  This is reported to be a way to obtain a better performance when queryng on dates.

    3. >> Boost SQL Server priority on Windows is turned on <<  I would not do that since that's probably the cause that you are losing control of the server altogether.

    Carlos.

     

     

     


    Regards,

    Carlos

  • I am using BETWEEN for the two dates.

    The box is in production, so I am going to have to gather a list of stuff to try out this Sunday. The good thing is the I can setup a query window from my machine and run the query to nearly-freeze the server. If I cancel the query from my machine, then the query stops immediately and the server is unfroze. That makes testing much easier. Here is what I planned on doing this sunday:

    1. Test query with 'Boost SQL Server priority' turned off

    2. Test query with 'Boost priority of background applications' turned off

    3. Test query with Lower max memory that SQL Server can allocate

    4. Test query with 7 CPU max that SQL Server can allocate

    Does anyone have anymore suggestions? I don't want to hurt my performance in SQL Server, but I don't want any one query to take over the machine.


    Live to Throw
    Throw to Live
    Will Summers

  • You main problem is in your second line: "My boss wrote a query". Bosses can bring down a hyperdome

     

    Seriously though, I would rather not reduce performance of the server because of a "might happen again" situation. Would rather fix the problem.

    Having said that, I have done it in the past. I had a box with 8 cores, and told SQL to use cores 1-7 leaving core zero for the OS.

    if I remember correctly, most of the kernel, drivers and other bits and pieces use the first core. (But it might have been the last)

    So leaving that off SQL allows you to gain access during a lockup.

    I'll find the document in a minute and post it.

    Another thing you can look at is 2005 DAC. This allows you to connect to SQL even when normal connections are timing out.

    http://coolthingoftheday.blogspot.com/2007/07/using-sql-server-2005-dedicated.html

    Of course, if the box is locked, it's locked.

    One point on keep in mind on the procs: If your procs are HT then you actually have to remove 0 and 1 from SQL.

     

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • The number of CPUs that SQL Server could take was my first red flag that came up. I'll test in this order then.

    1. Test query with 7 CPU max that SQL Server can allocate1.

    2. Test query with 'Boost SQL Server priority' turned off

    3. Test query with 'Boost priority of background applications' turned off

    4. Test query with Lower max memory that SQL Server can allocate

    The query that he wrote should have worked fine, I think it just crossed that threashold of what could be done on the server.

    I'm lucky in that I really don't have a 'pointy-haired-access-is-a-real-database' type boss. He's very open to suggestions, is knowledgeable about technology, is flexable about work hours and pays well.

    And sadly no, there are no job openings.


    Live to Throw
    Throw to Live
    Will Summers

  • I could be wrong, but I don't think that anything you do to the hardware or the server settings or indexes is going to fix this one...

    As you said, it works fine for one day and then chokes pretty bad on, say, 10 days.  There are a couple of things that could be going on here in the code (you probably already know most of these but gotta post them anyway )...

    1. View of a view or (especially) self joined view... have seen it where these will peg CPU's in a most ruthless manner.
    2. Accidental cross join... 'nuff said there... can make billions of internal rows during processing.
    3. Triangular join... normally, these occur in the joins and take the form of <, <=, >=, >.  While those relationships aren't necessarily bad, they're really bad if they form the ONLY relationship in a join.  Triangular joins are roughly equal half a cross join (Cartesian Product).  <> is almost as bad as a cross join.
    4. If the query does any inserts or updates to the base tables of a view, it will sometimes cause the view to recalculate the full result set in it's entirety on every row inserted/updated depending on how the view and the modification code is written.
    5. If the query does an update with joins and the target table is not in the from clause of the join, the query will sometimes lock up cpu's if the indexing is just right when parallelism is spawned.  This can be patched using MAXDOP, but it's far better to fix the update.  This problem slammed 4 cpu's of a 4 cpu box into the wall for more than 2 hours at work.  Once we fixed the update, it took only 30 seconds to run the whole query.
    6. Correlated subqueries, especially in the WHERE clause.  Will sometimes drag performance down as if #3 above where present.

    If the query isn't some thousand line monster, you might want to post it and the related DDL... maybe we can help on this...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ususally, there are two reasons to why a server grinds to a halt.

    You've found a 'feature' (aka bug - the thing hangs/crashes) or you've hit a bottleneck (ie resources are at it's end)

    If we assume it's not a bug issue, then you've hit a hard limit somewhere.

    I'd also suggest that you do not use the 'boost' option (7).

    Didn't quite what (9) was, it sounded like something to do with manual priority settings.

    Don't do that. Let the server decide priorities itself.

    Never use the 'boost' option unless you have proved that it's a good thing.

    It's known that this setting alone may produce the exact behaviour you're describing, given the circumstances.

    To me it sounds like a typical case of resource starvation.

    Adding to this, there *may* be in conjunction with this particular query also.

    Can't say without knowing what it does and what it looks like.

    As a sidenote: It's not very hard to write a query that will hog any server. You just need to be a little 'creative' with your SQL and the thing stops breathing...

    /Kenneth

     

  • One other thought - is there a way of 'observing' the query in execution to identify where/what the bottleneck is? While you test this weekend?

  • Any blocking going on ?

  • I've tried to limit my testing on this query. It basically locks everyone else out and this is production. I will looking into everything that everyone else said.

    the #9 that I am referring to is outside of SQL Server, at the Windows level. It's one of the tabs in the properties of My Computer if I remember correctly.

    I'll post an update on Monday.

    Thanks for everyone's suggestions. I will be rereading this post if the processor change doesn't do anything.


    Live to Throw
    Throw to Live
    Will Summers

  • Is it just a "select" query, or it updating ??   If it's just selecting, use WITH(NOLOCK) ..... Also, you might want to consider the bigger picture and replicate to a reporting database for such queries so that production is not affected.

  • Post the code, Will...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • A reporting database would be nice, but the funds just aren't available.

    The code is just a select statement. I didn't post the code before cause I didn't want to optimize the code, I wanted to minimize the ability for one statement to just about bring my machine to a halt at the OS level. Optimizing this query would waster my time because another could be written tomorrow if I didn't prevent the ability in the first place.

    My boss already rewrote the query to pull one day at a time and it runs in under 10 sec per day. He made the changes to it that day. I kept the query the same for my testing.

    Anyway, good news. I limited the # of processors to 7 out of 8, ran the query again and the OS was responsive. Problem Solved! It was nice because it was the first thing that I tried.

    Thanks to all that provided feedback.


    Live to Throw
    Throw to Live
    Will Summers

  • Depending on your setup, it might not require much in the way of funds. If you have a small group of tables that are queried a lot, you could create a new DB on the same server, and just replicate those tables to the new DB, and let people query of those...  Nothing fancy.

  • Setting up a reporting database on the same server wouldn't help distribute the load. The statement was a SELECT statement, not updating any rows. If I switched back to 8 cpu usage and created a report database on the same server, I would expect that the server to near-lockup like it did before.


    Live to Throw
    Throw to Live
    Will Summers

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply