August 11, 2020 at 5:51 pm
We have a web application job that runs from web calling to update database on a SQL server 2017 box. it is an over night job.
The job runs slower after this weekend. it runs 5 hours, now goes to 9 hours.
But we see the RAM and CPU utilizes only avarage 5% on the database server.
Is there a way to adjust the application to use more CPU resources on the database server?
Thanks
August 11, 2020 at 6:12 pm
It depends on what the job is doing. Since memory and CPU don't appear to be the bottleneck (their utilization is low), finding the bottleneck and improving on it would be a way to make the job faster. That being said, it could still be a memory bottleneck.
Now this could be disk I/O, network I/O, page file utilization (if a large continuous memory request was made that won't fit in memory, it may get dumped into the page file instead of system memory), or the job itself.
What I mean by the job itself is if your query is (for example) a loop, that will make the query slow.
Now going from 5 hours to 9 hours, my guess is this is an SSIS package being run on a server that doesn't have enough free memory to perform the operations it needs to. The change that happened over the weekend was probably adding some data that put it at the tipping point where the memory grant requested for the SSIS package exceeds the available memory and thus paged to disk.
Ways you can verify that this is the problem is to reduce the amount of memory being used by the system and re-run the package (may not be realistic), migrate SSIS to its own server with more memory and see if it is still slow, or rewrite the package to use less memory.
Having a server with 5% RAM and CPU utilized doesn't really say much either. A server with 128 GB of RAM with 5% utilized sounds like it is over-allocated for memory, but if it has 2 GB of memory and 5% is being utilized when that package runs, you are very likely going to need more memory. Same logic applies to the CPU - a 32 core server with 5% CPU is a lot different than a single core server with 5% CPU.
I would check page file utilization first and if it seems to be jumping around the time your package is running, then you either need more memory or need to allocate memory better on the server or you need to re-write your package to use less memory. If it is unchanged, my guess in the performance degrading is an increase in data volume.
Now, if this isn't an SSIS package, that is a different beast. My approach here would be to grab an estimated execution plan (actual if you can, but waiting 9 hours for it to complete is a bit of a pain) and look at where it can be optimized and anything that looks odd such as a table/view/index that you are expecting 1 million rows is reporting it is expecting 1 row. Updating statistics (if they are out of date) and reducing cross database queries and linked server usage can help. If you can't remove the cross database and linked server usage, storing those tables in table variables or temp tables can help improve performance sometimes.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
August 11, 2020 at 6:53 pm
Thank you.
This is not a SSIS package it is a web application/batch service , that runs from web server but it updates the database on a separate database server.
The database server RAM is 38 GB in use, 15 GB available.
The CPU is 2 sockets with 12 Virtual processors, utilization is not more than 8% average.
August 11, 2020 at 8:29 pm
Ok, so what is the web actually running against the SQL database?
If it is starting a SQL job (for example), the job may need tuning.
Since it is not SSIS, I am guessing it is TSQL that is being run to update data on the database?
Just so I understand the process and what is "slow", you go to the web interface and click a button which starts a JOB or QUERY on the database and this job/query is taking 9 hours to complete when it used to take 5 hours to complete?
What is this process doing and how large is the data? You say it "updates the database", but is this a row based update or a set based update? How much data is being udpated? And how is this update happening? What I mean by the last question is does it send data back to the web application after each row is updated before it updates the next row? If so, that is going to be incredibly slow.
My first step would be to remove as many "middle-ware" applications in this process as possible. Run the command directly on the database side (on a test server), get an execution plan (actual and/or estimated, depending on how long it runs for), and go from there.
Basically, what is being done in the update and how much data are you updating? Wanting SQL to use more memory and CPU when there is no way to do the query in parallel just won't work. You can't "trick" SQL into using more CPU or memory. Well, you can tell the SQL Server Instance to use more memory, but that is unlikely to make your query any faster.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
August 11, 2020 at 9:24 pm
The job is a web service, it is triggered by a command line batch .bat file on the web server, the bat file calls a web service on the web server. the web service is hosted in IIS on this web server.
Also the command line file is called by our job schedule system, not sql agent but MS orchestrator
The web service is to process person data one by one in the database server. we can see it is processing like 189 of 50000 persons in the log... then next one 190 of 50000, 191 of 50000, we see the intervals takes longer in the long run than usual run.
for example
Average Call in MS: 314
Number over 1 sec: 0
Number over 2 sec: 30
Number over 5 sec: 30
Number over 10 sec: 0
Number over 20 sec: 0
long run:
Average Call in MS: 779
Number over 1 sec: 0
Number over 2 sec: 15819
Number over 5 sec: 47
Number over 10 sec: 4
Number over 20 sec: 0
Maximum Time - Row: 42222, MS: 14329
August 11, 2020 at 10:29 pm
My guess as to the slowness - you are doing things row by row and have a large data set to deal with and you are processing the data on the web server side. The slowness likely isn't due to the SQL side, but this is easy to test - load up profiler or extended events and do a check on a subset of your person data (such as 100 people ) and see how long the SQL side is taking to do stuff compared to the web side. I suspect your web side processing is going to be your bottleneck, especially processing data on 50,000 people. Depending on the back end web technology, that MAY be processed on the client machine OR may be processed on the server, so you need to be careful of that. If it is processed on the client side, the web browser and web browser version may cause performance differences and you may have memory pressure on the web browser. If it is server side processing, it could be too much of a workload for the worker thread and it is needing to do a lot of garbage collection actions between processing each user.
Next step for me would be to find the cause of the slowness - SQL, IIS, web server, or client machine.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
August 11, 2020 at 10:41 pm
Thank you for the analysis.
I don't think we use web browser, it is not something used by front end, it is a batch service that use a web service runs on the server to call database and process each person' data.
I will try to reboot the server tonight to see if it helps.
August 11, 2020 at 11:04 pm
You need to look at the SQL calls being made to the database. It's impossible to diagnose your performance problems by just telling us you have a database job that's running slowly.
August 12, 2020 at 12:35 pm
First, set up Extended Events to capture the overall performance of the queries on the system. You can use rpc_completed and sql_batch_completed events to capture that information. Personally, in your case, I'd also suggest capturing wait statistics. Further, I'd use track causality to correlate the queries to the waits. With that in hand, identify the pain points and get the execution plans for those queries in order to understand what's going on.
That's how we figure this stuff out. Looking only at system metrics doesn't tell you anything about how queries are behaving. To understand that, look at the queries.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 13, 2020 at 12:50 am
after we restarted server, last night it runs in the middle of the process of 2000 of 6000 records, it gave the error:
Errors : { System.Data.SqlClient.SqlError: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available. }
The server connection is dropped different from the previous night it just take much longer time to finish.
We did have an network maintenance over the passed weekend, but our network guy said what they did should not affect this. H said he would be very surprised if there was much if any network latency; it’s just straight forwarding of traffic directly between two hosts.They did migrate the two DNS servers to a physically different set of switches, this occurred during the outage at weekend. However, path for DNS queries from each server should be relatively the same. Additionally, latency measured before and after the migrations were nearly identical for DNS response. The other far less significant change to the domain was completed yesterday hat was simply the addition of a subnet to the sites
August 13, 2020 at 10:12 am
Well, your network persons assurances aside, that looks like a network error.
However, capture the query metrics and the wait statistics. Don't guess. Don't hover at the edges, looking at system settings. Dive in and capture the details that will tell you which queries are running slow and what, specifically, they are waiting on. That's going to give you more answers and better direction than anything else.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 13, 2020 at 3:31 pm
One thing I've learned from my years in IT and as a DBA and an application developer (I wear a lot of hats) - performance in SQL doesn't degrade unless something changes somewhere.
The "something" can be any number of things though that other IT members may tell you "this change won't impact you" should ALWAYS be tested. If you have access to the windows event logs, I would be checking those and if that process fails every time you run it, run a ping -t against the remote server (the SQL server) from the web server and dump the results to file. Let that run while your job is running and check if the pings time out while the job is running. If so, someone somewhere broke something in the network stack.
Since the DNS servers got moved, you may benefit from dumping your arp cache and your DNS cache on the web server.
But back to the suddenly slow job, I would check (as both Grant Fritchey and myself suggested) extended events and check how long the SQL side is actually running for. My expectation is that the slowness isn't caused by SQL, but something running on the web server side. BUT it is always best to test and prove theories in technology rather than just pointing fingers.
My opinion, since things were running in 5 hours before the IT team did network maintenance and suddenly performance dropped to 9 hours, I would not be surprised in the slightest if it was related to the maintenance. I've seen a faulty CAT6 cable cause drastic performance drop, but all network tests from the desktop side said things were working perfectly. Took a network cable tester down and found that one of the CAT6 cables, which SHOULD be able to do 1 Gbps, was running at 10 Mbps. Swapped out the cable for a fresh one and performance was great. But running network latency tests came back saying things were quick, but the tests we were running were not pushing a lot of data and measuring how long it takes to push across a few KB of data on a 10 Mbps connection vs a 1 Gbps connection will hardly be noticeable.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
August 13, 2020 at 5:06 pm
Just an update, last night run is successful, and the duration is normal too.
The two things we changed is to use FQDN instead of server name for the application call as the network tech suggested. Even we know the last failure was it failed with network name is not found when it processed to 18000 of 60000 records. and we also restarted the servers. The second thing we did is to exclude some folders from antivirus on the web server that has the .exe web service. We will continue to monitor for another 3 days run to see if it is consistent. Also we do have DPA tool to add the server for monitoring SQL wait statics and other meters.
Thanks
August 14, 2020 at 12:11 pm
This was removed by the editor as SPAM
August 14, 2020 at 4:45 pm
I'll never understand why people do this to themselves. You're talking about a batch job but you're processing using RBAR on steroids that's highly dependent on successful network traffic and it's taking hours to run and yet you good folks still think that's ok because you got it running.
If someone spent a little time on this, it could be probably be done in a stored procedure or two, which would eliminate the dependency on any network traffic, and would take time measured in minutes or seconds rather than hours.
My recommendation would be to consider changing it.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 35 total)
You must be logged in to reply to this topic. Login to reply