December 3, 2017 at 2:58 am
Hello,
After updating our company's SQL Servers 2016 SP1 to CU6 last week, I am receiving constantly crashes of the service with stack dumps. We have 6 VMs, 4 cores and 28 GB of memory, running on Azure and we run about 200 databases on each (total size about 200 GB). The first hours of working I can receive 3-4 crashes on a server, but if a server survives these first hours, then it runs much more stable.
On Monday, after weekend update to CU6, I counted 20 crashes of these systems.
Sometimes I can see an error like access violation but not always. It looks to me like a problem to memory handling, which getting better after some days of work, (maybe the server is adjusting its policy to the load ?).
I have set min_memory to 0, max_memory to 24GB, enabled Lock Pages In Memory. After these crashes I disabled on 3 of them LPIM after restarting.
Any ideas will be appreciated.
Thanks
December 3, 2017 at 3:43 am
mitsos 4066 - Sunday, December 3, 2017 2:58 AMHello,
After updating our company's SQL Servers 2016 SP1 to CU6 last week, I am receiving constantly crashes of the service with stack dumps. We have 6 VMs, 4 cores and 28 GB of memory, running on Azure and we run about 200 databases on each (total size about 200 GB). The first hours of working I can receive 3-4 crashes on a server, but if a server survives these first hours, then it runs much more stable.
On Monday, after weekend update to CU6, I counted 20 crashes of these systems.
Sometimes I can see an error like access violation but not always. It looks to me like a problem to memory handling, which getting better after some days of work, (maybe the server is adjusting its policy to the load ?).
I have set min_memory to 0, max_memory to 24GB, enabled Lock Pages In Memory. After these crashes I disabled on 3 of them LPIM after restarting.Any ideas will be appreciated.
Thanks
What are the errors in the error log and what do you see in the default trace? Any errors in the Windows Event Log?
😎
December 3, 2017 at 4:21 am
Hello Eirikur,
No errors in Windows Event Log. I am including the errors of 14 Stack dumps of only ONE Server:
* BEGIN STACK DUMP:
* 11/26/17 17:19:39 spid 5424
* Exception Address = 00007FFDC626903C Module(sqllang+000000000003903C)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address FFFFFFFFFFFFFFFF
* BEGIN STACK DUMP:
* 11/26/17 17:20:43 spid 6156
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/26/17 21:01:10 spid 12148
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/26/17 22:14:34 spid 11072
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/27/17 11:10:04 spid 8924
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/27/17 11:21:28 spid 12000
* Exception Address = 00007FFEEA42903C Module(sqllang+000000000003903C)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address FFFFFFFFFFFFFFFF
* BEGIN STACK DUMP:
* 11/27/17 11:22:44 spid 4296
* Non-yielding Scheduler
* BEGIN STACK DUMP:
* 11/27/17 12:04:12 spid 5044
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/27/17 12:29:21 spid 5100
* Exception Address = 00007FFABC01903C Module(sqllang+000000000003903C)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address FFFFFFFFFFFFFFFF
* BEGIN STACK DUMP:
* 11/27/17 12:29:22 spid 5728
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/27/17 13:17:48 spid 1896
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/27/17 13:21:21 spid 6464
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/27/17 13:28:01 spid 936
* ex_terminator - Last chance exception handling
* BEGIN STACK DUMP:
* 11/27/17 14:03:10 spid 1064
* Non-yielding Scheduler
this is a typical memory info:
MemoryLoad = 84%
Total Physical = 28671 MB
Available Physical = 4586 MB
Total Page File = 33023 MB
Available Page File = 8514 MB
Total Virtual = 134217727 MB
Available Virtual = 134174939 MB
After this situation - 14 crashes in one day, the server is running continuously for 5 days.
Another thing I changed last 3 days: The server was exposed to internet - (it has to be accessed by our customer's utilities), and 3 days ago it is behind a firewall.
Thanks.
December 3, 2017 at 4:39 am
What other services apart from SQL Server Service are running on the servers?
😎
Given this:Access Violation occurred reading address FFFFFFFFFFFFFFFF
then this looks like either a bug or an hacking attempt. Have you checked the incoming connections prior to the crashes? Suggest you persist the dm_exec_connection and dm_exec_sessions for post crash analysis.
December 3, 2017 at 5:00 am
How can I check a hacking attempt?
I know for sure that the period with the crashes the server was known to hackers, and there were attempts to login (from China, South America, Russia, etc) using sa and wrong passwd. I wrote a service and as I see a wrong login attempt logged, I block this IP in the firewall. But could it be a hacking attempt without a logged login attempt?
Anyway, now I permit login only from IPs from my country, and I cannot see wrong password trials.
Thanks
December 3, 2017 at 6:36 am
mitsos 4066 - Sunday, December 3, 2017 5:00 AMHow can I check a hacking attempt?
I know for sure that the period with the crashes the server was known to hackers, and there were attempts to login (from China, South America, Russia, etc) using sa and wrong passwd. I wrote a service and as I see a wrong login attempt logged, I block this IP in the firewall. But could it be a hacking attempt without a logged login attempt?
Anyway, now I permit login only from IPs from my country, and I cannot see wrong password trials.Thanks
What I normally do is to persist the dm_exec_connection and dm_exec_sessions by writing the deltas of those into permanent tables, correlate with any available network logs and on top of that, use either the likes of Wireshark or similar to gather packet level information. My favorite is to introduce a Linux box connected to a full dump port on the closest managed switch, normally catches everything in any direction.
😎
IP's can be forged, do not give full security unless one is using SSL etc. are any of the connections unencrypted?
December 3, 2017 at 8:41 am
Connections are unencrypted, but each user has access only to his database. I use contained databases. Do you think the reason for an access violation can be a hacking attack? Or can a user having access to his database only, bring the server down?
Using Windb to examine the dumps, I see the problem is always at module sqllang.dll, which as I can see is the T-SQL processor of the server.
Thank you
December 3, 2017 at 8:59 am
mitsos 4066 - Sunday, December 3, 2017 8:41 AMConnections are unencrypted, but each user has access only to his database. I use contained databases. Do you think the reason for an access violation can be a hacking attack? Or can a user having access to his database only, bring the server down?Using Windb to examine the dumps, I see the problem is always at module sqllang.dll, which as I can see is the T-SQL processor of the server.Thank you
This is very interesting, haven't analyzed the contained DBs in this perspective, probably about time I did so😉
😎
December 4, 2017 at 10:19 am
I did see one report. Try restarting with -x in the service.
December 6, 2017 at 5:04 am
Hello Steve,
Microsoft docs notices:
Warning: When you use the –x startup option, the information that is available for you to diagnose performance and functional problems with SQL Server is greatly reduced.
Do you think that this is ok?
Operating without the capability to diagnose problems maybe is a bigger problem. If you can provide a link to this report I would like to check it.
Thank you
December 6, 2017 at 12:38 pm
I was just thinking a test since this worked for someone else.
December 11, 2017 at 4:17 am
In a new crash in another server (SQL 2016 CU6) I got the following error:
2017-12-11 10:14:22.60 Server Error: 17066, Severity: 16, State: 1.
2017-12-11 10:14:22.60 Server SQL Server Assertion: File: <sosmemobj.cpp>, line=2772 Failed Assertion = 'pvb->FInUse ()'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.
Databases are not corrupted.
The stack dump is always at the same module (sqllang):
2017-12-11 10:14:18.36 Server * Short Stack Dump
2017-12-11 10:14:18.41 Server 00007FFAE4C73C58 Module(KERNELBASE+0000000000033C58)
2017-12-11 10:14:18.41 Server 00007FFAD4DCC54E Module(sqllang+000000000102C54E)
2017-12-11 10:14:18.41 Server 00007FFAD4DD02C9 Module(sqllang+00000000010302C9)
2017-12-11 10:14:18.42 Server 00007FFAD4E0AD49 Module(sqllang+000000000106AD49)
2017-12-11 10:14:18.42 Server 00007FFAD309E294 Module(sqldk+000000000005E294)
Thanks
December 11, 2017 at 6:58 am
At this point, I recommend you get help from Microsoft. We're not going to be able to solve this for you especially since we have no clue about the firewall you also brought up.
In the future, I recommend you only make one change at a time.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 11, 2017 at 8:03 am
Ok Jeff, thank you.
PS: The firewall was the standard Windows Advanced Firewall.
December 11, 2017 at 10:50 am
Somehow the FInUse rings a bell, come a cross problems highlighting the function when the authentication mode was incorrectly configured, might be worth looking into
😎
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply