Find the problematic Sql Query Caused Application Crash

  • Hi,

    Is there anyway I can find out the sql script ran on a particular date and which might have causes application down? I mean my application was down on 12 April 2024 and based on investigation it was found some sql query running causing the application down.

    How I can find out what all sql queries ran on 12 April and it has high CPU utilisation and who ran the sql query etc?

    Also how I can prevent it by monitoring such threashold and high cost sql queries and alert us in advance?

  • I am 99.9999999999% sure that the problem is not on the SQL side - that is an application bug. SQL takes a query and provides results. There is literally no way that could cause an app to crash UNLESS the app is doing something stupid. What I mean by "something stupid", I mean like the app is expecting an INT back and the stored procedure is returning a string. Or the app is expecting 2 tables to come back and only 1 table comes back. Or the app is expecting a table with 10 columns to come back and the query is giving them less than 10. Or something similar - the app gets back unexpected results and doesn't have good error handling in it so it just crashes. BUT ALL those things should be handled in the app side and not cause a crash but instead log an error and report the error to the end user. If this is a service, then it should log the error, send out an alert about the error, and then depending on the app and the error, it should either retry, ignore, or stop. At least for any well written programs in my opinion. I've written windows services before and the ones I wrote are specific to my company, so they log things in the windows event log and rely on IT to monitor the event log (they have tools to do this so it was a good business decision). Desktop apps I try to put error handling in wherever I can, even if I am certain some use cases will not be hit. Cautious coding adds extra lines of code and makes the app slightly larger, but it ensures that in the future when things change, my app is prepared for some of that. One example is I have an app that does stuff with serial numbers that it pulls from a database along with other values. For the serial number, I pull it from the database and store it in a string variable and then tryparse it to an int. Why you ask, because in the future, the business MAY decide to use alphanumeric for the serial number. And guess what happened - we started using alphanumeric for serial numbers. So rather than the app crashing, it told users that the serial number was not numeric and asked them to verify the serial. If they were certain the serial number was correct, they should put in a support request to get the app updated. If I just blindly assumed serial numbers would ALWAYS be numeric and then someone scanned in a non-numeric serial, the app would have crashed.

    Defensive coding to handle the user error (scan the wrong things in) or process changes (as described above) is a good coding habit because some day, someone will change the process which will break your app OR they will fat finger something wrong and that shouldn't cause an app to crash.

    The EASIEST way to tell what caused an app to crash is to reproduce the problem in the debugger for the app. IF you don't have access to the debugger (ie no source to the app), then I'd raise a bug/support request with the app developer/vendor to get it fixed. IF it is a Java based app, there is a chance (a pretty good chance based on my experience) that it ran out of memory for the JVM or got stuck in a GC loop that crashed the app. If it is .NET AND the error is not handled, then you should have got a stack trace with the crash so you could narrow down the exact function that caused the problem and from there should be able to narrow down what query hit the database from that.

    If the app is created by a 3rd party, they SHOULD have logs you can review to see what went wrong. If they don't capture logs, or you have logging disabled, or your logs are not helpful, you should enable logging or get the app developer to provide logs that are meaningful.

    BUT going back to the original question - if you want to see what is or was being run on a certain date, some SQL monitoring tools will be your friend (RedGate has one we use that is pretty nice at that IF you configure it well). It'll also tell you CPU utilization and such. Alternately, you can set up an extended events session to capture what is running and when and by who and then use WMI to monitor CPU utilization. That being said, CPU hitting 100% on a SQL instance isn't usually a problem. High CPU on the SQL side isn't going to cause an app to crash UNLESS the app has a bug. And knowing "who" ran the query should be easy to tell as the person who ran the query is very likely also the person who is reporting the app crashed.

    Again though, app crashes are NOT DBA issues. Those are developer issues. If the developer blames the database, the developer should be able to provide you with more information than "the database caused the error".

    The ONLY scenario I can think of for the database being at fault for the app crashing would be IF the app and database lost communication for a period of time (login timeout, query timeout, instance not found) and the app didn't handle that scenario properly. BUT, I would argue that this is an app issue as the app should not crash but should retry and/or notify user. Plus, query timeout is not a SQL configured thing - that is configured application side. BUT login timeout or instance not found could be blamed on the database, but is often more network related in my experience.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Have you checked the SQL Server logs?

    Screenshot 2024-04-17 173015

  • Actually Application is stable and live since many years and such kind of issue happened only few times. I am still trying to find anyway I can see the what sql query took high CPU usage which could have caused application performance down.

  • If you want to find the query you could run something like this soon after the problem occurs.

    DECLARE @Database sysname = 'Test'-- Specify your database

    SELECT DB_NAME(qt.dbid) AS DBName,
    OBJECT_SCHEMA_NAME(qt.objectid, qt.dbid) AS SchemaName,
    OBJECT_NAME(qt.objectid, qt.dbid) AS ObjectName,
    qs.execution_count AS ExecutionCount,
    qs.total_worker_time AS TotalCPUTime,
    qs.total_worker_time / qs.execution_count AS AvgCPUTime,
    qs.total_elapsed_time AS TotalElapsedTime,
    qs.total_logical_reads AS TotalLogicalReads,
    qs.total_physical_reads AS TotalPhysicalReads,
    qs.total_logical_writes AS TotalLogicalWrites,
    SUBSTRING(qt.text, qs.statement_start_offset/2 + 1,
    (CASE WHEN qs.statement_end_offset = -1
    THEN LEN(CONVERT(nvarchar(max), qt.text)) * 2
    ELSE qs.statement_end_offset
    END - qs.statement_start_offset)/2) AS QueryText
    FROM sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
    WHERE DB_NAME(qt.dbid) IS NOT NULL
    AND DB_NAME(qt.dbid) LIKE @Database
    ORDER BY TotalCPUTime DESC
    ;
  • Quick question, is the application's query interface a predefined API or dynamically constructed SQL?

    😎

    Capturing the queries on the server comes with a considerable overhead, I suggest you do an impact analysis before implementing any such mechanism in a production environment. An alternative would be a protocol-filtered packet capture on the network level, have implemented that in the past by duplicating network streams on the routing level.

  • RCRock wrote:

    Actually Application is stable and live since many years and such kind of issue happened only few times. I am still trying to find anyway I can see the what sql query took high CPU usage which could have caused application performance down.

    I am not trying to say the app is bad. I have apps I've written or helped write that run fine most of the time. But if an app crashes, I've never had high CPU on the SQL instance as the culprit. High CPU usually means that SQL is slower to return results. Similar to if you have a lot of blocking. The application likely needs to be modified to handle the query timeout condition. Query timeout is something configured at the application level, not at the database level. Alternately, the application could retry on query timeout.

    Best case scenario, applications should handle all cases where it could crash. This isn't always realistic and some scenarios are non-recoverable and the app must close/crash. But defensive coding is a good habit to get into and one I need to do better at. I spent longer than I care to admit recently trying to fix a bug in an app that would cause it to crash. It was due to a 3rd party open source module I was using I THOUGHT, but after further investigation it turned out that the bug was in how I was using the module. The bug was one of those bugs that happened in rare cases, but it was reproduceable. It is now fixed though, but it took a while to figure out where the bug was and to correct it.

    My opinion, and from my experience, if an application crashes, then things need to be fixed in the application layer. Even if that just means that the app catches the error and sends out an error message to the end user and provides a link to the stack trace (copied to clipboard and/or dumped to disk are the approaches I usually take) which can be handed over to the developer for review and remedy.

    My comments were not meant to be a jab at your application developers. ALL applications have bugs. It just depends on how you handle them.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • yes I understand it and thanks everyone for providing the answers.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply