The Story
Logon triggers were implemented in SQLServer 2005 with SP2. Now we - the instance administrators - are able to perform some stuff at the actual logon time of a user. This way, we could, for example, restrict users access during a certain part of the day, only allow connections from certain addresses.
To check on my licenses, I believed I could use logon triggers to track how many users connected from how many client machines.
I developed a little logon trigger, which executed using the 'sa' context, so I didn't have to grant insert to my logging database and table. I didn't mind that I would have to use "original_login_name" to get the actual connecting user id. This all looked pretty straight forward, not much rocket science at all.
I first tested it at my dba-dev instance. It worked just fine, not much overhead to be noticed, ..... cool. Since the test succeeded, it was time to roll out on the developers instances (8 instances).
Once again, everything went just fine.
We managed to get a good overview of the number of clients hitting our instances and were able to store all connection related data.
One week after the implementation in DEV, it was rolled out to QA.
Even there, no problems at all.
In noticed a number of connections being registrated, so I got at ease and planned implementation in our production environments.
All needed paperwork was done and the "request for change" was submitted and approved.
The rollout was prepared (including fall back scenarios, mind you) for our 15 instances of sql2005 and got launched at the approved time. (Day x at noon).
The production rollout
Within seconds after the launch, the first production instance dumped and crashed !
However, I didn't notice the instance being failed over to the other node of the cluster.
But that wasn't the hottest issue at that time.
Alerts were coming in from our monitoring system and hell opened some doors.
Messages pointed to:
------
Process ID 59 attempted to unlock a resource it does not own: DATABASE: 9 Retry the transaction, because this error may be caused by a timing condition. If the problem persists, contact the database administrator.
------
Process ID 108 attempted to unlock a resource it does not own: DATABASE: 9 Retry the transaction, because this error may be caused by a timing condition. If the problem persists, contact the database administrator.
------
[sqsrvres] ODBC sqldriverconnect failed
------
[sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = 102; message = [Microsoft][SQL Native Client]TCP Provider: Timeout error [258].
------
[sqsrvres] checkODBCConnectError: sqlstate = HYT00; native error = 0; message = [Microsoft][SQL Native Client]Login timeout expired
------
[sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = 102; message = [Microsoft][SQL Native Client]Unable to complete login process due to delay in login response
------
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
------
Logon failed for login 'thelogin' due to trigger execution. [CLIENT:10.16.108.226]
It was quite obvious: my rollout might trigger this kind of errors.
The fallback scenario
Time to roll out the fallback scenario for the cumbersome instance.
The fallback scenario was just a SQLCMD query which would disable the logon trigger.
This scenario was also tested at Dev and QA, so I was pretty at ease with it.
I ran the SQLCMD and it failed !!
It failed, because it was exactly the implemented logon trigger that caused the issues.
It was at that time I really appreciated the Microsoft Dev team's efforts to implement the Dedicated Administrator Connection (DAC) !
That was the way to go to turn off this trigger.
/*
if due to a logon trigger you can nolonger connect to a sql2005 instance you need the local DAC (dedicated administrator connection) using the -A parameter to be able to connect to the instance !
*/
/* There can only be ONE active DAC !! */
[DOSBOX]
sqlcmd -A -d master -S TheInstance
Or start SSMS not connecting to an instance
(Object browser not supported with DAC !! )
then click "New Query" and connect using "admin:TheInstance"
SQLCMD will give you a commandline like this
1>
You will need to enter your statement and the command will only be
executed after you enter 'GO' at the next line.
1> DISABLE TRIGGER S_tr_DBA_ConnectionTracker ON ALL SERVER;
2> go
1> exit
After executing this at the cumbersome instance, things turned back to normal.
However, because of the dumps sqlserver generated, I asked for an emergency intervention to restart the instance so it would no longer suffer any leftovers of the dumps.
It was an instance hosting biztalk databases, so you can imagine the scope of the impact.
Off course I also immediately disabled all these logon triggers at the other production instances that were involved in this rollout operation.
The cause
Now I started digging into the system, trying to figure out what caused all these dumps.
I started a SQLtrace at the instance that had dumped.
At first sight, nothing abnormal going on, until I noticed some of the connections were using the serializable transaction isolation level.
I double checked with my biztalk dev team....Isolation level ??? What the ***...
Like many developers, they were unaware of what kind of isolation level their applications are using.
Why didn't this occur in DEV or QA ?
"We weren't using DEV nor QA for the last month or so"... developers on vacation, others only used their local virtual pc with their full dev environment.
No wonder I missed this kind of errors in the whole process of DEV and QA.
Back to the drawing board for the logon tracker, because its scope was actually not restricted to the little registration at the beginning of a connection, as I supposed it would be.
The solution
Would it be worth the effort of converting it to a service broker application, by just forwarding the connection info into a service broker (SSB) queue, and processing that asynchronously.
As a matter of fact, that concept already exists in SQLServer 2005.
It's called "Event Notifications". In this case using the login events.
Asynchrone by design and the upmost valid alternative for my quest.
Why didn't I think of that in an earlier stage ?
It provided me about all the data I had available with the logon trigger, except for the ip address of the connecting device.
I started off reading this nice article and elaborated on it:
- SQL Server 2005 Logon Triggers by Frederik Vandeputte as SSC
- Logon triggers by Cristian Lefter as Simple-talk
My solution is very similar to the solution Frederik Vandeputte posted as SSC.
So, in my stored procedure that handles the queue, I just added an extra join to see if that user was still connected via that SPID and pull out the ip address.
I didn't want to get caught twice by the same pitfall, so I made sure my biztalk devs tested their stuff whilst the new implementation was active.
Lessons learned:
1) Never assume a server has a typical usage. Double check with all dev teams to test.
2) Always prepare fallback scenarios and make sure you know about the Dedicated Administrator Connection (DAC).
3) Just like with any trigger, be sure its scope is very small.
4) And last but not least: First consider if things actually need to be performed in realtime mode.
p.s. Notice to myself: Check all servers for the isolation levels that are being used.
For your information: my bogus logon trigger:
-- ALZDBA DD 20080903 --
DO NOT USE THIS TRIGGER !!
IT MAKES YOUR SERVER CRASH WHEN USING SERIALIZABLE CONNECTIONS !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--USE master;
--GO
--
--CREATE TRIGGER S_tr_DBA_ConnectionTracker
--ON ALL SERVER
--WITH EXECUTE AS 'sa'
--FOR LOGON
--AS
--BEGIN
--set nocount on
---- need to execute as sa to avoid grant serverstate to public ?
--Update T
-- set [tsLastUpdate] = getdate()
--from master.dbo.T_DBA_ConnectionTracker T
--inner join sys.dm_exec_sessions ES
-- on ES.[session_id] = @@spid
-- and ES.[host_name] = T.[host_name]
-- and ES.[program_name] = T.[program_name]
-- and ES.[nt_domain] = T.[nt_domain]
-- and ES.[nt_user_name] = T.[nt_user_name]
-- and ES.[login_name] = T.[login_name]
-- and ES.[original_login_name] = T.[original_login_name]
--inner join sys.dm_exec_connections EC
-- on EC.[session_id] = @@spid
-- and EC.[client_net_address] = T.[client_net_address] ;
--
--if @@rowcount = 0
--begin
-- INSERT INTO [master].[dbo].[T_DBA_ConnectionTracker] ([host_name], [program_name], [nt_domain], [nt_user_name], [login_name], [original_login_name], [client_net_address] )
-- select rtrim(ES.[host_name])
-- , rtrim(ES.[program_name])
-- , rtrim(ES.[nt_domain])
-- , rtrim(ES.[nt_user_name])
-- , rtrim(ES.[login_name])
-- , rtrim(ES.[original_login_name])
-- , rtrim(EC.[client_net_address])
-- from sys.dm_exec_sessions ES
-- inner join sys.dm_exec_connections EC
-- on EC.[session_id] = ES.[session_id]
-- where ES.[session_id] = @@SPID
-- and not exists (select *
-- from master.dbo.T_DBA_ConnectionTracker T
-- where T.[host_name] = ES.[host_name]
-- and T.[program_name] = ES.[program_name]
-- and T.[nt_domain] = ES.[nt_domain]
-- and T.[nt_user_name] = ES.[nt_user_name]
-- and T.[login_name] = ES.[login_name]
-- and T.[original_login_name] = ES.[original_login_name]
-- and T.[client_net_address] = EC.[client_net_address] ) ;
--
--end
--
--END;
--