Combining sequential events in a table

  • So I've been banging my head against this problem for a bit looking into self-joins and nested queries, but I just can't seem to get my head around the sql needed. Now I'm looking for guidance. We're tracking usage data for a computer lab using the following (simplified) schema.

    Events {

    hostID int,

    eventType int, (1 = Login, 2 = Logout)

    eventTime smalldatetime

    }

    I'm basically trying to create a query that will combine a host's Login event with the next Logout event for the same host in a single row. For example this data set:

    hostID, eventType, eventTime

    50, 1, '2008-01-01 09:00'

    50, 2, '2008-01-01 10:00'

    60, 1, '2008-01-01 09:30'

    60, 2, '2008-01-01 10:30'

    would get reduced into:

    hostID, loginTime, logoutTime

    50, '2008-01-01 09:00', '2008-01-01 10:00'

    60, '2008-01-01 09:30', '2008-01-01 10:30'

    In retrospect, I feel like the reduced format should've been the schema originally. I may actually change it in the near future. But I'll still be looking for an efficient way to convert the old data using T-SQL rather than brute forcing it with an external data migration app I'd write.

  • There are a few solutions. I would check the execution plan on each of them and pick your favourite.

    Both assume that there is a logon always followed by a logoff. There won't be two consecutive logons without a logoff in the middle. I haven't tested them but the idea's there.

    1.

    select

    logon.hostID,

    logon.eventTime as logonTime,

    (select min(eventTime)

    from Events logoff

    where logoff.hostID = logon.hostID

    and logoff.eventTime > logon.eventTime

    and logoff.eventType = 2

    ) as logoffTime

    from Events logon

    where logon.eventType = 1

    2.

    --You could use the row_number() function in SQL 2005 to match up the events with the same row number

    select

    logon.hostID,

    logon.eventTime as logonTime,

    logoff.eventTime as logoffTime

    from events logon

    inner join events logoff

    on logon.hostID = logoff.hostID

    and (row_number() OVER (logon.hostID, logon.eventType) ORDER BY (logon.eventTime))

    =

    (row_number() OVER (logoff.hostID, logoff.eventType) ORDER BY (logon.eventTime))

    and logon.eventType = 1

    and logoff.eventType = 2

    I think I may have stuffed up the syntax on the row_number() above. Even if I didn't, the extra eventType column needed in the OVER clause may not be efficient. May sure you have appropriate indices defined - one on hostID, eventType and then eventTime would do the trick. You probably have the clustered index on the table involving eventTime, which is fine.

    If the extra eventType in the OVER clause is causing some grief then you should convert the logon and logoff table aliases to derived tables so that you would write

    from ( select hostID, eventTime as logonTime, row_number() over hostID as rowNum from events where eventType=1 ) logon

    inner join ....

    Might be more efficient. Perhaps not... At least the second format lets you do a left join so that the current logon (ie there's no logoff for it yet) gets a null logoff rather than not appearing in the output due to the inner join.

    Of course, the queries and ideas above only apply if the pattern always follows logon, logoff, logon, logoff, ....

    I think redesigning your schema may be best. 😀

  • Thank you so much! You've definitely put me on the right track. I was so close to coming up with the solution in 1. on my own last night and now I don't even remember what was tripping me up.

    A quick look in BOL let me fix up the syntax for the row_number() function in 2. However, management studio then complains that "Windowed functions can only appear in the SELECT or ORDER BY clauses." So I moved them into the derived tables like you suggested which works just fine. I had honestly forgotten all about the row_number() function as a possibility to use.

    Thanks again for your help. I'm ready to go start tinkering on my own again. But if you're still up for a challenge, here's the next step. As you so insightfully guessed, Logons are not always followed by Logoffs in the production data. Machines can get manually power cycled which prevents the logoff event from getting recorded. Data gets lost in transit or the agent is malfunctioning. Any number of things can seem to produce "bad" data from a reporting perspective without actually invalidating any DB constraints.

    So basically, do you think it's even possible to write something that will produce the same basic result with either of the following two conditions:

    1) Logons without a corresponding Logoff before the next Logon are just dropped from the results.

    2) Logons without a corresponding Logoff before the next Logon are given a Logoff time equivalent to whatever the eventTime of the next event happens to be.

  • ian's first solution should be all you need with this minor change:

    select logon.hostID,

    logon.eventTime as logonTime,

    (select min(eventTime) from Events logoff

    where logoff.hostID = logon.hostID

    and logoff.eventTime > logon.eventTime

    -- and logoff.eventType = 2

    ) as logoffTime

    from Events logonwhere logon.eventType = 1

    by eliminating the logoff event filter, the query will now consider the next event, regardless of type, as the logoff time. that should give you implicit and explicit logoutTimes.

  • Indeed. I ended up playing around with it this morning. Thanks again for the help, both of you.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply