Limiting impact of poorly designed queries

  • Let me play devil's advocate here - it's a role I enjoy:w00t:

    Whilst I wholeheartedly agree with Gail and with the concept that users should never be allowed to run whatever they want on a database the reality is that this "ideal" cannot always be achieved.

    I'm a consultant and I've worked for large and small companies. The place I'm in at the moment is a tiny development shop where the application itself is buggy (ok, it's very buggy). The net result is that support staff often have to query certain tables because the code ended up putting the wrong data into certain tables. So there's a need to query the tables to find where it's gone wrong - and there's even a need to edit this data to put it right! (Shock - Horror!)

    The reality that many small companies live in is that they cannot just make their applications bug-free overnight. That would involve months or years of change in the way that they do everything from design to testing and to release.

    oops - I hit the post button accidentally......

    Anyway, so I can advise the client on best practices. I can tell them that they should do this and they shouldn't do that. And I can argue the merits of it till I'm blue in the face. But when the IT director tells me that he's happy to let his staff access the live database then I have to deal with that reality and my job is then to reduce the impact.

    Sure, I could leave because they don't listen. Or I could, document my advice, and then set about helping them within the restrictions that they have placed on me.

    It's not ideal but nothing really is if you look closely. Ideal is me sitting on a warm beach with several million in the bank and not having to worry about anything:cool:

  • That is where the correct solution comes into play. At that company, you have a pretty good argument to have a near real-time copy of the database available for querying.

    You could also probably make a pretty good argument that if they spent the time to configure proper testing envrionments and then actually used them, they would have far fewer bugs to be dealing with during production support.

  • Michael Earl (4/18/2008)


    That is where the correct solution comes into play. At that company, you have a pretty good argument to have a near real-time copy of the database available for querying.

    You could also probably make a pretty good argument that if they spent the time to configure proper testing envrionments and then actually used them, they would have far fewer bugs to be dealing with during production support.

    Absolutely. Which is what I'd opt for first. A day-old copy of the database is always available (and suitable more often than not) and a log shipped database is available for queries that require near real-time access. And with mirroring and snapshots it's possible to get even nearer to real-time.

    Problem is, what happens when the IT director says, "sorry, you're not getting another server"? I've not been there yet but I'm sure it happens.

    As for "proper testing". Been there! Maybe I'm not a good communicator or I don't have the right persuasive skills. You'd think it was common sense but I guess many companies are run by people that don't have any. In the end, sales is finding new work and throwing it in the direction of the developers quicker than they can blink - the app is bug-ridden but the company is still making money so no one thinks it's a big problem. Until one day it all comes crashing down.

  • I'll buy that if you have a transaction production system, making a clone for query might be a useful practice. But the problem has just moved, not been solved. If lots of users are using the clone, I still do not want a rogue query disrupting all the other users of the clone system.

    Here are my biases:

    1. Rogue queries are pretty rare. It takes some work to bring a server to its knees, in spite of the fears of some paranoid management types that prophesize that the db will be brought to its knees by some user doing an "outer join". (Honest, that is a quote!)

    2. When the data access is limited to some reporting software that tries to treat everything as a large flat file, or relies entirely on canned queries (and no query specification is allowed), much of the important information contained in the data is lost.

  • I would think that you can easily put together a job that queries sys.dm_exec_sessions and sys.dm_exec_requests to find any tasks run by a subset of users that are using excessive resources (using whatever criteria you want to cover against), and then kill the offending sessions. Have it email you and the user to say why it was killed.

    It won't stop the damage, but it should limit it.

  • Thanks very much Matt. That sounds worth a try. And detecting and limiting the damage is all I need.

  • Jim Russell (4/18/2008)


    "...but what happens when you do need to run a query that's going to take all night to compile its data?"

    Then you get your favorite high-priced DBA (if you have one) to bless the query, or the stored procedure you put the query in, and increase (or remove) the resource limits for that user/procedure/query, after determining that there is no more efficient way to get the data you require.

    "I haven't done it yet, but that doesn't mean it won't ever happen."

    That is exactly the point, all of us (users, developers, programmers, DBAs) are fallible and without resource limits the only two options are:

    1. Hope to hell nobody screws up.

    2. Keep everybody out of the data -- in which case you might as well make your final backup to a 2-by-4, 'cause the data is not going to serve any purpose for anyone!

    So, basically, you think the only two options are to, for every query, set a throttle on it, or get a DBA to turn off the throttle on a query-by-query basis; or to have a useless database?

    I don't understand that.

    The whole purpose of ad hoc user queries is that they don't have to have a DBA review/write them. They control their own needs on the thing. If that's not the case, why give them any ability to write ad hoc queries at all?

    And I most certainly didn't say you need to totally lock out all access to the database. I said I don't like having anyone, myself included, run ad hoc queries in the production OLTP database. There's an infinity of difference between those two concepts.

    Tested, proven stored procedures, for example, you don't have to worry about someone accidentally setting up a cartesian join, or a delete command without a Where on it, or whatever. The proc, if properly written and tested, can only do exactly what it is designed for. Mess up the call for it, and you almost certainly don't destroy the database. You can't "forget to put Where" on a call to a properly written, tested proc. Simple as that.

    That's why I say that ad hoc queries should be on a separate server than the main database. That gives users the ability to get what they need, assuming they know enough to do it, without having to come to the DBA for all of their data mining needs. It also means that, if they accidentally or intentionally run a query that will take all night to process, and will use every CPU cycle available in doing so, your various e-commerce, OLTP, etc., sites and applications aren't brought down by it. The half-hour that it takes for them to call your cell phone (because you're out at lunch), and for you to get to a computer that can connect to the server, get logged in, and kill the SPID, on the accidental query, doesn't kill the whole office, it just delays a few ad hoc queries for a few users.

    Your "only two options" aren't looking at the thing adequately.

    And, if you have limitations because of budget, you really ought to look into setting this separate copy of the databases up on a virtual server. You can throttle those pretty effectively, in terms of CPU, RAM and HDD use, without having to do it on a query-by-query basis. And virtual server software and a copy of MS Small Business Server, isn't all that expensive.

    It's all about risk vs reward in this case. You want to minimize the risk and maximize the reward, and "hoping to hell nobody ever messes up" is NOT an acceptible solution, nor is "lock everyone out of the database and don't let anyone ever use it". A separate copy of the database is far too easy to set up, at low expense, if you have people who want to run ad hoc queries (which can be very beneficial to the company if the data mining thus enabled is done well). Very low risk, possibility for very high reward, low cost to set up and to maintain.

    Why is that a bad idea?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • why not do the following:

    write a code to check for duration of a query and if it doesn't match a certain group of users, that spid is killed. The duration would be a variable stored in a table or something.

    select * from whatever where userid not in (select name from approveUsers)

    And before you kill the query, email that query to your group of DBA and the user, let them know that the query has been killed.

    -----------------------------
    www.cbtr.net
    .: SQL Backup Admin Tool[/url] :.

  • Oberion (4/18/2008)


    why not do the following:

    write a code to check for duration of a query and if it doesn't match a certain group of users, that spid is killed. The duration would be a variable stored in a table or something.

    select * from whatever where userid not in (select name from approveUsers)

    And before you kill the query, email that query to your group of DBA and the user, let them know that the query has been killed.

    Again, why bother going through all that, and taking the risk of killing a valid query, perhaps even a necessary one?

    Ad hoc queries are (almost) always OLAP in nature. At least, they should be. Why run OLAP queries of whatever sort in your production OLTP database?

    Even for a valid user who's allowed to run long queries, and that your SPID-gun proc doesn't kill, what happens when his long-running analysis query, designed for business-critical data mining, is messed up by row locks taken by OLTP queries in the same database? Page splits, etc.

    The way around that, is either snapshot isolation (which is essentially another copy of the database in tempdb), or another copy of the database set up for that kind of query. With no CUD (CRUD without the "Retreive"), no exclusive locks, no page splits, etc., just select locks. Thus, valid data will be generated.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • it is a viable solution and it doesn't take a lot of work. If the users aren't in the approval list, then when their long running query hit or exceed the duration, it gets kill. The approval list is managed via a table, making it very easy to manage whos query stays and who goes.

    Sometimes there are justification to run queries against the prod db and baby sitting dev isn't fun or getting a specification from them and writing the code for them doesn't sound all that exciting either. You might as well switch over from Prod DBA to Dev.

    -----------------------------
    www.cbtr.net
    .: SQL Backup Admin Tool[/url] :.

  • Sorry - but I have to jump in here.

    What happens to your OLTP web based application when a user submits a resource intensive ad-hoc query that blocks access to one of your systems most active tables for five minutes?

    Don't know about you, but I do not want Senior VP's in my cube yelling at me about how slow their business critical system is - and then trying to tell that Senior VP that the reason the system is slow is because the admin ran a query for them.

    If you don't think that can happen - we have a table (~6 million rows now) that every user of the application accesses. We had a programmer testing a query (yeah, in live) that would have returned more than one million rows from that table. Because of the shared read locks on the table - none of the users inserts or updates would complete in less than 30 seconds.

    30 seconds is a lifetime on a web based application - and you bet, that programmer does not write any ad-hoc test queries against live anymore.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • that's a cop-out excuse. Why are u even trying to defend yourself to a manager or a vp as to why a system is crawling on its knees?

    Why should (s)he be mad at you because some loser wrote some bad code? I am having this problem right now with 2 costly queries that will bring sql to its knees when fired in rapid succession. The dev team said it went thru dev, test, and stage without any problem. Well they forgot to think that those environment doesn't involve a crap load of users.

    And no one is yelling at me. I gave them the information they need. The business goes and pressure the development team for a fix, not me.

    The only thing they have asked of me so far is if there's a way to prevent this from happen. I told them, "no. the dev needs to fix the code because it would cost a lot of man hours to roll back to previous release."

    My two tables are well over 10 million rows. Try running a join against those two tables with a SELECT * FROM statement. Stupid *** developers. They even tried to turn a blind eye to it and say it is my problem. I gave the middle finger and shove it back over to them with the backing of my manager, their manager, business, PM, and IT director. Queries, ain't my problem to fix but I will advise. Messed up data, ain't my problem to fix since I don't interact with the business or know what it should look like.

    lastly, it ain't my job to baby sit developers or review all their code. I have other projects and bigger fish to fry.

    -----------------------------
    www.cbtr.net
    .: SQL Backup Admin Tool[/url] :.

  • And yet - you want to allow end users the ability to write ad-hoc queries? Even when you know you have a problem with developers (who should know better) writing bad queries against your production database?

    And yes - when the Senior VP and CIO have the CEO's and CFO's crawling all over them, they come to me to find out what is going on. If I tell them the users submitted ad-hoc queries and caused the problem - the first thing I will hear is why they can do that, the second question is how do I stop it.

    Simple answer, don't let them have that level of access to the OLTP system. Since I do not allow end users to have that access, I will never have them in my office complaining because a user caused a problem.

    Sorry - I think it is a whole lot easier to prevent the situation in the first place, than to try to identify long running queries by users who are on some 'approved' list.

    Just my personal preference...

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Oberion (4/18/2008)


    it is a viable solution and it doesn't take a lot of work. If the users aren't in the approval list, then when their long running query hit or exceed the duration, it gets kill. The approval list is managed via a table, making it very easy to manage whos query stays and who goes.

    Sometimes there are justification to run queries against the prod db and baby sitting dev isn't fun or getting a specification from them and writing the code for them doesn't sound all that exciting either. You might as well switch over from Prod DBA to Dev.

    First, if you want "fun" and "exciting", I highly recommend NOT being a DBA. For a DBA, like an air traffic controller, things are only exciting when a disaster is either about to happen, or it's already too late to prevent it.

    Second, again, why let ANYONE run long-running queries in your production OLTP database? I don't care who they are, why let them create locks like that? Especially when there is a simple, easy-to-implement, non-harmful solution that avoids all of those problems.

    Let's say you have a list of people who are approved to execute long-running queries in your production database. One of them is the marketing manager, who often has to do data mining in the database, in order to analyze customer-sale patterns. By doing so, he is often able to fine tune the company's marketing efforts in such a way as to increase effectiveness by 10% (which would be phenomenal in most any company).

    So, he sets up a query to run. He's on the approved list. dbo.Customers, dbo.Orders, dbo.OrdersDetails, dbo.Invoices, dbo.CustomersDemographics, and dbo.Employees tables all end up with various row, page and table locks, for the next 2 hours. His query runs slower because it's fighting for resources, and all of your employees and customers are kicked out of the web pages and applications that they normally use for most of those 2 hours.

    What do you think will happen when the CIO asks your boss why this happened, and your boss tells him that the marketing manager is one of the people who is approved, essentially, to sabotage the database? And what do you think will happen when the general manager goes online, asks a DBA who knows his business what to do about this kind of thing, and finds out that there is a cheap, effective, easy way to prevent this kind of problem, while still giving the marketing manager his data mining capabilities? Then he asks you why you don't have it implemented already, and you say, "it isn't fun and exciting"?

    Sorry to go off on this one this way, but the primary duty of a DBA is to make sure accurate data can get in and out of the database in the most efficient way possible, so as to enhance the productivity of the users of that data. Separating OLTP and OLAP is a standard fundamental of achieving that goal. If you can't follow simple standards like that, then maybe the quest for fun and excitement should lead to a career in something fun and exciting, but not database administration.

    (Motto as a DBA: I get well-paid for being as bored as possible. I know that sucks, but that's what having a life is for, outside of work.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Oberion (4/18/2008)


    lastly, it ain't my job to baby sit developers or review all their code. I have other projects and bigger fish to fry.

    Sorry, but I can not agree with that.

    The DBA is the custodian of the database. Proper code reviews and proper mentoring/training/assistance for the developers will prevent 90% of the potential problems.

    A quick code review would have picked up that SELECT * FROM 2 10 million row tables and could have ensured that the problem never happened in the first place.

    Since SQL code reviews were started at my current company, the incidents of unexplained poor performance, post-deployment issues and the like have dropped virtually to 0.

    Sure, code reviews take from my 'other projects and bigger fish' time, but so does time spent investigating sudden, unexpected problems, regardless of whether I have to fix it or it it's the dev's problem.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 16 through 30 (of 42 total)

You must be logged in to reply to this topic. Login to reply