March 13, 2008 at 7:44 am
Using SQL 2005, I have a table, we'll call table A. It looks like this:
create table PageFlow (
ID int,
Step int,
MatchValue varchar(1000)
);
insert into PageFlow(ID, Step, MatchValue) values (1, 1, 'Test.aspx');
insert into PageFlow(ID, Step, MatchValue) values (1, 2, 'Test2.aspx');
insert into PageFlow(ID, Step, MatchValue) values (1, 3, 'Test3.aspx');
Elsewhere, I have another table, we'll call table B, which looks like this:
create table Requests (
SessionID varchar(36),
PageUrl varchar(36),
Accessed datetime
)
insert into Requests (SessionID, PageUrl, Accessed)
values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test2.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test3.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test3.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test2.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test2.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('685fd28f-6246-4197-8294-400b3b80758e', 'SomeOtherTest.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test3.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'Test.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'AnotherTest.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'YetAnotherTest.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'Test3.aspx', getdate());
insert into Requests (SessionID, PageUrl, Accessed)
values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'Test3.aspx', getdate());
My goal is to find all session IDs which have accessed the entire sequence of pages listed in PageFlow for a given page flow ID. Using the example above, I'd like to find all session IDs which have accessed Test1.aspx, Test2.aspx, AND Test3.aspx. At some point, I'd like to also be able to find only the ones that accessed the steps in sequence, rather than in random order. Note that different page flows will have varying number of steps and pages that must be accessed, so I cannot simply hard code the where clause to match the existing records in my sample.
As it stands, I have a stored proc that performs a couple of "while" loops to loop through each page flow and it steps and find sessions that continue to match the sequence. Those sessions which match all pages in sequence are stored in a separate summary table for later use. However, the stored proc is fairly unwieldy and "ugly". I'd much rather do this using something more elegant and manageable such as using CTEs or a View or a few select statements.
Any suggestions from the gurus?
March 13, 2008 at 7:54 am
If it's a variable number of pages in the flow, I'm not sure what a good T-SQL solution would be. In fact, it might be worse than some loops. You'd be looking at the "x = y and x1 = y1 and x2 = y2..." to ensure all the pages matched. Then you'd have to worry about ordering, which would add another set of complex steps.
It's possible you could use a staging table, then loop and insert data in batches, looking for matches, but I'm not sure that would be better. If you can post the code, perhaps someone has some optimizations.
March 13, 2008 at 8:00 am
Here is the loop I referred to that is being used currently:
set @Campaign = dbo.GetNextCampaign('');
while @Campaign is not null begin
set @GoalID = dbo.GetNextGoal(@Campaign, -1);
while @GoalID is not null begin
delete from @FunnelTemp;
insert into
@FunnelTemp
select
SessionID,
dateAdd(day, datediff(day, 0, Created), 0) as Created,
cast(0 as int) as Converted,
cast('2000-01-01' as datetime) as LastAccess
from
Sessions with (nolock)
where
Campaign = @Campaign;
set @StepNum = dbo.GetNextFunnelStep(@GoalID, -1);
set @MatchValue = dbo.GetFunnelMatchValue(@GoalID, @StepNum);
set @OldStepNum = cast(0 as int);
while @StepNum is not null begin
update
@FunnelTemp
set
Converted = @StepNum,
LastAccess = r.Accessed
from
@FunnelTemp f , Requests r with (nolock)
where
f.SessionID = r.SessionID AND
Converted = @OldStepNum AND
r.Accessed > f.LastAccess AND
charindex(@MatchValue, url) > 0;
set @OldStepNum = @StepNum;
set @StepNum = dbo.GetNextFunnelStep(@GoalID, @StepNum);
set @MatchValue = dbo.GetFunnelMatchValue(@GoalID, @StepNum);
end
insert into
@ConvertTable
select
@GoalID,
@Campaign,
Created,
count(SessionID),
sum (case when converted = @OldStepNum then 1 else 0 end),
sum (case when converted between 1 and (@OldStepNum-1) then 1 else 0 end),
count(SessionID) -
sum (case when converted = @OldStepNum then 1 else 0 end) -
sum (case when converted between 1 and (@OldStepNum-1) then 1 else 0 end)
from
@FunnelTemp
group by
Created
set @GoalID = dbo.GetNextGoal(@Campaign, @GoalID);
end
set @Campaign = dbo.GetNextCampaign(@Campaign);
end
The GetNext functions are just helper functions to eliminate clutter. They simply wrap "set @variable = min(column) from table where XYZ" statements. It should be noted that this is a summary proc that gets executed only 1-4 times a day on a few million records, so performance is less important than say, code clarity.
March 13, 2008 at 8:02 am
Maybe this ?
select r.SessionID
from Requests r
inner join PageFlow p on p.MatchValue=r.PageUrl
group by r.SessionID
having count(distinct r.PageUrl)=(select count(distinct MatchValue) from PageFlow)
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537March 13, 2008 at 8:05 am
Mark (3/13/2008)
Maybe this ?
select r.SessionID
from Requests r
inner join PageFlow p on p.MatchValue=r.PageUrl
group by r.SessionID
having count(distinct r.PageUrl)=(select count(distinct MatchValue) from PageFlow)
This seems like it would most definitely find the matching sessions, but doesn't leave much hope for determining which ones match the steps in sequence. Thanks for your help though, I'll see if I can adapt this. 🙂
March 13, 2008 at 8:13 am
The post that's close uses "relational division". Perhaps you could extend it a bit using some of the ideas in one of today's articles (http://www.sqlservercentral.com/articles/T-SQL/62159/)? specifically (and I haven't thought this through as it just passed midnight here) could you not just match on URLs but also use CTEs to get a row_number of page visits and match on this too. If you look up "relational division" in Google you'll find several great articles and university papers/presentations on various implementations. The method presented was made popular by Joe Celko - see http://www.developersdex.com/gurus/articles/113.asp?Page=1. Particularly look at page #3 and "exact division".
Of course I may be way off base but that should get you roughly on the right track 😀
March 13, 2008 at 8:56 am
Assuming that you might have more than one "PageFlow ID" it might be translated into relational division with something such as:
select
sessionId,
id
from
( select
sessionId,
id,
row_number() over(partition by sessionId, id order by accessed, pageUrl)
as Seq,
pageUrl
from
( select
sessionId,
id,
j.PageUrl,
min(accessed) as accessed
from @Requests j
join @pageFlow k
on j.pageUrl = k.matchValue
group by sessionId, id, j.pageUrl
) as f
) as a
group by sessionId, id
having count(*)
= ( select count(distinct matchValue) from @pageFlow b
where a.id = b.id
)
/* -------- Sample Output: --------
sessionId id
------------------------------------ -----------
685fd28f-6246-4197-8294-400b3b80758e 1
9aef054d-1751-426e-9514-f8da2a02acc1 1
*/
Now how it might perform? That is another matter.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply