Deriving a Session Number

Question

Post reply

Deriving a Session Number

Richie T

SSC-Addicted

Points: 478
More actions
October 9, 2013 at 9:57 am

#281116

Hi Guys,
Could really do with some help trying to derive a field (session_number).
I have a staging table which is populated with around 10 million rows daily, each day I need to identify the session number for each row by user_id.
To identify the session_number we need to check if there is a gap greater than 5 minutes between the previous and current start_time.
I have been able to do this on a small subset, but when dealing with the entire 10 million rows the database seems to process for days, any ideas?
e.g.
user_id start_time session_number
123 09:00:00 1
123 09:01:00 1
123 09:02:00 1
123 09:03:00 1
123 09:09:00 2
123 09:22:00 3
Thanks,
Rich

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Keith Tate SSCrazy Eights Points: 9938 More actions · Answer 1

Would this work?

create table SessionNumber (userID int, startTime time, sessionNum int)

insert into SessionNumber

values (123, '09:00:00', 0)

, (123, '09:01:00', 0)

, (123, '09:02:00', 0)

, (123, '09:05:00', 0)

, (123, '09:08:00', 0)

, (123, '09:11:00', 0)

, (123, '09:16:00', 0)

, (123, '09:35:00', 0)

select userID, StartTime, (DATEPART(mi,StartTime) /5)+1 as SessionNum

from SessionNumber

I'm not sure how you handle different hours, i.e. does 10:00 get reset back to 1? If not, do you need to find the earliest start time for a userID and start from there?

Microsoft Certified Master - SQL Server 2008
Follow me on twitter: @keith_tate

Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Richie T SSC-Addicted Points: 478 More actions · Answer 2

Thanks Keith, but I don't think I explained it very well, I need to compare the start_time between the current row and previous row ordered by userid and start_time.

e.g.

userid starttime diffrence_from_previous_row session_number

123 09:00 0 1

123 09:02 2 1

123 09:05 3 1

123 09:11 6 2

123 09:14 3 2

123 09:16 2 2

123 09:25 9 3

As you can see above, when there is a difference of 5 or greater the session number is incremented.

The issue I have is that we're talking about doing this everyday for around 10 million rows, so need to do this in the most efficient manner.

Thanks,

Rich

Keith Tate SSCrazy Eights Points: 9938 More actions · Answer 3

This will work, but I'm not sure how the performance will be on 10 million rows.

insert into SessionNumber

values (123, '09:00:00')

, (123, '09:01:00')

, (123, '09:02:00')

, (123, '09:05:00')

, (123, '09:08:00')

, (123, '09:11:00')

, (123, '09:19:00')

, (123, '09:35:00')

, (321, '09:01:00')

, (321, '09:03:00')

, (321, '09:14:00');

with SessionByUser as

(

select userID, StartTime, 1 as SessionNumber, ROW_NUMBER() over (Partition by userID order by StartTime) as RowNum

from SessionNumber

)

, cte as

(

select a.userID, a.startTime, datediff(mi,coalesce(b.startTime,a.StartTime),a.startTime) / 5 as MinDiff

from SessionByUser a

left join SessionByUser b on a.RowNum = b.RowNum + 1

and a.userID = b.userID

)

select userID, StartTime, dense_RANK() over (partition by userID order by MinDiff) as SessionID

from cte

order by UserID,StartTime

Microsoft Certified Master - SQL Server 2008
Follow me on twitter: @keith_tate

Forum Etiquette: How to post data/code on a forum to get the best help[/url]

jimbobmcgee Ten Centuries Points: 1108 More actions · Answer 4

Since you're using SQL2012, have you looked at LEAD and LAG?

My first naive stab would be:

;WITH

priors AS (

SELECT

*

,prev = LAG(startTime) OVER (

PARTITION BY userID

ORDER BY startTime

)

FROM SessionNumber

),

starts AS (

SELECT DISTINCT

*

,sessionID = ROW_NUMBER() OVER (

PARTITION BY userID

ORDER BY startTime

)

FROM priors s

WHERE prev IS NULL

OR ABS(DATEDIFF(SECOND, s.startTime, s.prev)) >= (5 * 60)

),

boundaries AS (

SELECT DISTINCT

userID

,sessionID

,startTime

,endTime = COALESCE(

LEAD(startTime) OVER (

PARTITION BY userID

ORDER BY startTime

),

CONVERT(TIME, SYSDATETIME())

)

FROM starts

)

SELECT *

FROM boundaries b

LEFT JOIN SessionNumber s ON

s.userID = b.userID AND

s.startTime >= b.startTime AND

s.startTime < b.endTime

...but I'm not sure how well it will work across a large table

J.