Finding a sequence of rows that match another sequence of rows

  • Using SQL 2005, I have a table, we'll call table A. It looks like this:

    create table PageFlow (

    ID int,

    Step int,

    MatchValue varchar(1000)

    );

    insert into PageFlow(ID, Step, MatchValue) values (1, 1, 'Test.aspx');

    insert into PageFlow(ID, Step, MatchValue) values (1, 2, 'Test2.aspx');

    insert into PageFlow(ID, Step, MatchValue) values (1, 3, 'Test3.aspx');

    Elsewhere, I have another table, we'll call table B, which looks like this:

    create table Requests (

    SessionID varchar(36),

    PageUrl varchar(36),

    Accessed datetime

    )

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test2.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test3.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('9aef054d-1751-426e-9514-f8da2a02acc1', 'Test3.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test2.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test2.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('685fd28f-6246-4197-8294-400b3b80758e', 'SomeOtherTest.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('685fd28f-6246-4197-8294-400b3b80758e', 'Test3.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'Test.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'AnotherTest.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'YetAnotherTest.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'Test3.aspx', getdate());

    insert into Requests (SessionID, PageUrl, Accessed)

    values ('1e7dde9d-399d-467d-9434-8202b0eecf69', 'Test3.aspx', getdate());

    My goal is to find all session IDs which have accessed the entire sequence of pages listed in PageFlow for a given page flow ID. Using the example above, I'd like to find all session IDs which have accessed Test1.aspx, Test2.aspx, AND Test3.aspx. At some point, I'd like to also be able to find only the ones that accessed the steps in sequence, rather than in random order. Note that different page flows will have varying number of steps and pages that must be accessed, so I cannot simply hard code the where clause to match the existing records in my sample.

    As it stands, I have a stored proc that performs a couple of "while" loops to loop through each page flow and it steps and find sessions that continue to match the sequence. Those sessions which match all pages in sequence are stored in a separate summary table for later use. However, the stored proc is fairly unwieldy and "ugly". I'd much rather do this using something more elegant and manageable such as using CTEs or a View or a few select statements.

    Any suggestions from the gurus?

  • If it's a variable number of pages in the flow, I'm not sure what a good T-SQL solution would be. In fact, it might be worse than some loops. You'd be looking at the "x = y and x1 = y1 and x2 = y2..." to ensure all the pages matched. Then you'd have to worry about ordering, which would add another set of complex steps.

    It's possible you could use a staging table, then loop and insert data in batches, looking for matches, but I'm not sure that would be better. If you can post the code, perhaps someone has some optimizations.

  • Here is the loop I referred to that is being used currently:

    set @Campaign = dbo.GetNextCampaign('');

    while @Campaign is not null begin

    set @GoalID = dbo.GetNextGoal(@Campaign, -1);

    while @GoalID is not null begin

    delete from @FunnelTemp;

    insert into

    @FunnelTemp

    select

    SessionID,

    dateAdd(day, datediff(day, 0, Created), 0) as Created,

    cast(0 as int) as Converted,

    cast('2000-01-01' as datetime) as LastAccess

    from

    Sessions with (nolock)

    where

    Campaign = @Campaign;

    set @StepNum = dbo.GetNextFunnelStep(@GoalID, -1);

    set @MatchValue = dbo.GetFunnelMatchValue(@GoalID, @StepNum);

    set @OldStepNum = cast(0 as int);

    while @StepNum is not null begin

    update

    @FunnelTemp

    set

    Converted = @StepNum,

    LastAccess = r.Accessed

    from

    @FunnelTemp f , Requests r with (nolock)

    where

    f.SessionID = r.SessionID AND

    Converted = @OldStepNum AND

    r.Accessed > f.LastAccess AND

    charindex(@MatchValue, url) > 0;

    set @OldStepNum = @StepNum;

    set @StepNum = dbo.GetNextFunnelStep(@GoalID, @StepNum);

    set @MatchValue = dbo.GetFunnelMatchValue(@GoalID, @StepNum);

    end

    insert into

    @ConvertTable

    select

    @GoalID,

    @Campaign,

    Created,

    count(SessionID),

    sum (case when converted = @OldStepNum then 1 else 0 end),

    sum (case when converted between 1 and (@OldStepNum-1) then 1 else 0 end),

    count(SessionID) -

    sum (case when converted = @OldStepNum then 1 else 0 end) -

    sum (case when converted between 1 and (@OldStepNum-1) then 1 else 0 end)

    from

    @FunnelTemp

    group by

    Created

    set @GoalID = dbo.GetNextGoal(@Campaign, @GoalID);

    end

    set @Campaign = dbo.GetNextCampaign(@Campaign);

    end

    The GetNext functions are just helper functions to eliminate clutter. They simply wrap "set @variable = min(column) from table where XYZ" statements. It should be noted that this is a summary proc that gets executed only 1-4 times a day on a few million records, so performance is less important than say, code clarity.

  • Maybe this ?

    select r.SessionID

    from Requests r

    inner join PageFlow p on p.MatchValue=r.PageUrl

    group by r.SessionID

    having count(distinct r.PageUrl)=(select count(distinct MatchValue) from PageFlow)

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Mark (3/13/2008)


    Maybe this ?

    select r.SessionID

    from Requests r

    inner join PageFlow p on p.MatchValue=r.PageUrl

    group by r.SessionID

    having count(distinct r.PageUrl)=(select count(distinct MatchValue) from PageFlow)

    This seems like it would most definitely find the matching sessions, but doesn't leave much hope for determining which ones match the steps in sequence. Thanks for your help though, I'll see if I can adapt this. 🙂

  • The post that's close uses "relational division". Perhaps you could extend it a bit using some of the ideas in one of today's articles (http://www.sqlservercentral.com/articles/T-SQL/62159/)? specifically (and I haven't thought this through as it just passed midnight here) could you not just match on URLs but also use CTEs to get a row_number of page visits and match on this too. If you look up "relational division" in Google you'll find several great articles and university papers/presentations on various implementations. The method presented was made popular by Joe Celko - see http://www.developersdex.com/gurus/articles/113.asp?Page=1. Particularly look at page #3 and "exact division".

    Of course I may be way off base but that should get you roughly on the right track 😀

  • Assuming that you might have more than one "PageFlow ID" it might be translated into relational division with something such as:

    select

    sessionId,

    id

    from

    ( select

    sessionId,

    id,

    row_number() over(partition by sessionId, id order by accessed, pageUrl)

    as Seq,

    pageUrl

    from

    ( select

    sessionId,

    id,

    j.PageUrl,

    min(accessed) as accessed

    from @Requests j

    join @pageFlow k

    on j.pageUrl = k.matchValue

    group by sessionId, id, j.pageUrl

    ) as f

    ) as a

    group by sessionId, id

    having count(*)

    = ( select count(distinct matchValue) from @pageFlow b

    where a.id = b.id

    )

    /* -------- Sample Output: --------

    sessionId id

    ------------------------------------ -----------

    685fd28f-6246-4197-8294-400b3b80758e 1

    9aef054d-1751-426e-9514-f8da2a02acc1 1

    */

    Now how it might perform? That is another matter.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply