October 9, 2013 at 9:57 am
Hi Guys,
Could really do with some help trying to derive a field (session_number).
I have a staging table which is populated with around 10 million rows daily, each day I need to identify the session number for each row by user_id.
To identify the session_number we need to check if there is a gap greater than 5 minutes between the previous and current start_time.
I have been able to do this on a small subset, but when dealing with the entire 10 million rows the database seems to process for days, any ideas?
e.g.
user_id start_time session_number
123 09:00:00 1
123 09:01:00 1
123 09:02:00 1
123 09:03:00 1
123 09:09:00 2
123 09:22:00 3
Thanks,
Rich
October 9, 2013 at 10:09 am
Would this work?
create table SessionNumber (userID int, startTime time, sessionNum int)
insert into SessionNumber
values (123, '09:00:00', 0)
, (123, '09:01:00', 0)
, (123, '09:02:00', 0)
, (123, '09:05:00', 0)
, (123, '09:08:00', 0)
, (123, '09:11:00', 0)
, (123, '09:16:00', 0)
, (123, '09:35:00', 0)
select userID, StartTime, (DATEPART(mi,StartTime) /5)+1 as SessionNum
from SessionNumber
I'm not sure how you handle different hours, i.e. does 10:00 get reset back to 1? If not, do you need to find the earliest start time for a userID and start from there?
October 9, 2013 at 10:29 am
Thanks Keith, but I don't think I explained it very well, I need to compare the start_time between the current row and previous row ordered by userid and start_time.
e.g.
userid starttime diffrence_from_previous_row session_number
123 09:00 0 1
123 09:02 2 1
123 09:05 3 1
123 09:11 6 2
123 09:14 3 2
123 09:16 2 2
123 09:25 9 3
As you can see above, when there is a difference of 5 or greater the session number is incremented.
The issue I have is that we're talking about doing this everyday for around 10 million rows, so need to do this in the most efficient manner.
Thanks,
Rich
October 9, 2013 at 11:47 am
This will work, but I'm not sure how the performance will be on 10 million rows.
insert into SessionNumber
values (123, '09:00:00')
, (123, '09:01:00')
, (123, '09:02:00')
, (123, '09:05:00')
, (123, '09:08:00')
, (123, '09:11:00')
, (123, '09:19:00')
, (123, '09:35:00')
, (321, '09:01:00')
, (321, '09:03:00')
, (321, '09:14:00');
with SessionByUser as
(
select userID, StartTime, 1 as SessionNumber, ROW_NUMBER() over (Partition by userID order by StartTime) as RowNum
from SessionNumber
)
, cte as
(
select a.userID, a.startTime, datediff(mi,coalesce(b.startTime,a.StartTime),a.startTime) / 5 as MinDiff
from SessionByUser a
left join SessionByUser b on a.RowNum = b.RowNum + 1
and a.userID = b.userID
)
select userID, StartTime, dense_RANK() over (partition by userID order by MinDiff) as SessionID
from cte
order by UserID,StartTime
October 11, 2013 at 5:22 am
Since you're using SQL2012, have you looked at LEAD and LAG?
My first naive stab would be:
;WITH
priors AS (
SELECT
*
,prev = LAG(startTime) OVER (
PARTITION BY userID
ORDER BY startTime
)
FROM SessionNumber
),
starts AS (
SELECT DISTINCT
*
,sessionID = ROW_NUMBER() OVER (
PARTITION BY userID
ORDER BY startTime
)
FROM priors s
WHERE prev IS NULL
OR ABS(DATEDIFF(SECOND, s.startTime, s.prev)) >= (5 * 60)
),
boundaries AS (
SELECT DISTINCT
userID
,sessionID
,startTime
,endTime = COALESCE(
LEAD(startTime) OVER (
PARTITION BY userID
ORDER BY startTime
),
CONVERT(TIME, SYSDATETIME())
)
FROM starts
)
SELECT *
FROM boundaries b
LEFT JOIN SessionNumber s ON
s.userID = b.userID AND
s.startTime >= b.startTime AND
s.startTime < b.endTime
...but I'm not sure how well it will work across a large table
J.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply