January 26, 2008 at 2:18 am
So I've been banging my head against this problem for a bit looking into self-joins and nested queries, but I just can't seem to get my head around the sql needed. Now I'm looking for guidance. We're tracking usage data for a computer lab using the following (simplified) schema.
Events {
hostID int,
eventType int, (1 = Login, 2 = Logout)
eventTime smalldatetime
}
I'm basically trying to create a query that will combine a host's Login event with the next Logout event for the same host in a single row. For example this data set:
hostID, eventType, eventTime
50, 1, '2008-01-01 09:00'
50, 2, '2008-01-01 10:00'
60, 1, '2008-01-01 09:30'
60, 2, '2008-01-01 10:30'
would get reduced into:
hostID, loginTime, logoutTime
50, '2008-01-01 09:00', '2008-01-01 10:00'
60, '2008-01-01 09:30', '2008-01-01 10:30'
In retrospect, I feel like the reduced format should've been the schema originally. I may actually change it in the near future. But I'll still be looking for an efficient way to convert the old data using T-SQL rather than brute forcing it with an external data migration app I'd write.
January 26, 2008 at 4:28 am
There are a few solutions. I would check the execution plan on each of them and pick your favourite.
Both assume that there is a logon always followed by a logoff. There won't be two consecutive logons without a logoff in the middle. I haven't tested them but the idea's there.
1.
select
logon.hostID,
logon.eventTime as logonTime,
(select min(eventTime)
from Events logoff
where logoff.hostID = logon.hostID
and logoff.eventTime > logon.eventTime
and logoff.eventType = 2
) as logoffTime
from Events logon
where logon.eventType = 1
2.
--You could use the row_number() function in SQL 2005 to match up the events with the same row number
select
logon.hostID,
logon.eventTime as logonTime,
logoff.eventTime as logoffTime
from events logon
inner join events logoff
on logon.hostID = logoff.hostID
and (row_number() OVER (logon.hostID, logon.eventType) ORDER BY (logon.eventTime))
=
(row_number() OVER (logoff.hostID, logoff.eventType) ORDER BY (logon.eventTime))
and logon.eventType = 1
and logoff.eventType = 2
I think I may have stuffed up the syntax on the row_number() above. Even if I didn't, the extra eventType column needed in the OVER clause may not be efficient. May sure you have appropriate indices defined - one on hostID, eventType and then eventTime would do the trick. You probably have the clustered index on the table involving eventTime, which is fine.
If the extra eventType in the OVER clause is causing some grief then you should convert the logon and logoff table aliases to derived tables so that you would write
from ( select hostID, eventTime as logonTime, row_number() over hostID as rowNum from events where eventType=1 ) logon
inner join ....
Might be more efficient. Perhaps not... At least the second format lets you do a left join so that the current logon (ie there's no logoff for it yet) gets a null logoff rather than not appearing in the output due to the inner join.
Of course, the queries and ideas above only apply if the pattern always follows logon, logoff, logon, logoff, ....
I think redesigning your schema may be best. 😀
January 26, 2008 at 11:05 am
Thank you so much! You've definitely put me on the right track. I was so close to coming up with the solution in 1. on my own last night and now I don't even remember what was tripping me up.
A quick look in BOL let me fix up the syntax for the row_number() function in 2. However, management studio then complains that "Windowed functions can only appear in the SELECT or ORDER BY clauses." So I moved them into the derived tables like you suggested which works just fine. I had honestly forgotten all about the row_number() function as a possibility to use.
Thanks again for your help. I'm ready to go start tinkering on my own again. But if you're still up for a challenge, here's the next step. As you so insightfully guessed, Logons are not always followed by Logoffs in the production data. Machines can get manually power cycled which prevents the logoff event from getting recorded. Data gets lost in transit or the agent is malfunctioning. Any number of things can seem to produce "bad" data from a reporting perspective without actually invalidating any DB constraints.
So basically, do you think it's even possible to write something that will produce the same basic result with either of the following two conditions:
1) Logons without a corresponding Logoff before the next Logon are just dropped from the results.
2) Logons without a corresponding Logoff before the next Logon are given a Logoff time equivalent to whatever the eventTime of the next event happens to be.
January 26, 2008 at 12:02 pm
ian's first solution should be all you need with this minor change:
select logon.hostID,
logon.eventTime as logonTime,
(select min(eventTime) from Events logoff
where logoff.hostID = logon.hostID
and logoff.eventTime > logon.eventTime
-- and logoff.eventType = 2
) as logoffTime
from Events logonwhere logon.eventType = 1
by eliminating the logoff event filter, the query will now consider the next event, regardless of type, as the logoff time. that should give you implicit and explicit logoutTimes.
January 26, 2008 at 5:27 pm
Indeed. I ended up playing around with it this morning. Thanks again for the help, both of you.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply