Unable to understand the Query Logic

  • J Livingston SQL - Tuesday, June 27, 2017 6:20 AM

    I think that this is the origin of the question, along with the some of the solutions

    by Itzik Ben-Gan

    http://sqlmag.com/blog/tsql-challenge-reoccurring-visits
    http://sqlmag.com/blog/solutions-tsql-challenge-reoccurring-visits

    Cool.  Thanks, Graham.  Looks like I got the problem definition wrong after all.  This is for day to day, not overall like my code has it.  Leave it to Peter Larsson on these types of things.  He has one of "those" brains.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, June 27, 2017 9:18 AM

    J Livingston SQL - Tuesday, June 27, 2017 6:20 AM

    I think that this is the origin of the question, along with the some of the solutions

    by Itzik Ben-Gan

    http://sqlmag.com/blog/tsql-challenge-reoccurring-visits
    http://sqlmag.com/blog/solutions-tsql-challenge-reoccurring-visits

    Cool.  Thanks, Graham.  Looks like I got the problem definition wrong after all.  This is for day to day, not overall like my code has it.  Leave it to Peter Larsson on these types of things.  He has one of "those" brains.

    Okay, here goes.   This appears to both solve the problem in less than a second even for 10,000+ rows of input data, although I'm getting a SORT operator and it's probably that I need to adjust an index somewhere.
    USE LOCAL_DB;
    GO

    SET NOCOUNT ON;

    IF OBJECT_ID(N'dbo.DailyVisits', N'U') IS NOT NULL
        BEGIN
        DROP TABLE dbo.DailyVisits;
        END;

    GO

    CREATE TABLE dbo.DailyVisits (
        dt date NOT NULL,
        visitor varchar(10) NOT NULL,
        CONSTRAINT PK_DailyVisits PRIMARY KEY(dt, visitor)
    );
    CREATE NONCLUSTERED INDEX IX_DailyVisits_visitor_dt ON dbo.DailyVisits
        (
        visitor ASC,
        dt ASC
        );

    --===== 10,000 visitors over a 5 year period
    WITH RAW_DATA AS (

        SELECT DISTINCT dt, visitor
        FROM (
            SELECT TOP (11840)
                dt  = DATEADD(dd,ABS(CHECKSUM(NEWID())%DATEDIFF(dd, '2011', '2017')), '2011'), --2017 is exclusive here
                visitor = RIGHT(NEWID(), 1)
            FROM sys.all_columns ac1
                CROSS JOIN sys.all_columns ac2
            ) AS X
    )
    INSERT INTO dbo.DailyVisits (dt, visitor)
    SELECT RD.dt, RD.visitor
    FROM RAW_DATA AS RD
        LEFT OUTER JOIN dbo.DailyVisits AS DV
            ON DV.dt = RD.dt
            AND DV.visitor = RD.visitor
    WHERE DV.dt IS NULL
    ORDER BY dt, visitor;

    SELECT *
    FROM dbo.DailyVisits
    ORDER BY dt, visitor;

    --Solution:
    DECLARE @Start AS date, @End AS date;
    SELECT @Start = MIN(dt), @End = MAX(dt)
    FROM dbo.DailyVisits;

    CREATE TABLE #MERGED_DATA (
        dt date NOT NULL,
        prevday date NOT NULL,
        visitor varchar(10) NULL
    );
    CREATE UNIQUE CLUSTERED INDEX IX_MERGED_DATA_dt_prevday_visitor ON #MERGED_DATA
        (
        dt ASC,
        prevday ASC,
        visitor ASC
        );
    CREATE UNIQUE NONCLUSTERED INDEX IX_MERGED_DATA_visitor_prevday_dt ON #MERGED_DATA
        (
        visitor ASC,
        prevday ASC,
        dt ASC
        );

    WITH E1 AS (

        SELECT 1 AS N UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
        SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
        SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    ),
        CALENDAR AS (

            SELECT X.dt, CAST(DATEADD(day, -1, X.dt) AS date) AS prevday
            FROM (
                SELECT TOP (10000) CAST(DATEADD(day, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) - 1, @Start) AS date) AS dt,
                    CAST(DATEADD(day, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) - 2, @Start) AS date) AS prevday
                FROM E1 AS A, E1 AS B, E1 AS C, E1 AS D
                ) AS X
            WHERE X.dt <= @End
    )
    INSERT INTO #MERGED_DATA(dt, prevday, visitor)
    SELECT C.dt, C.prevday, DV.visitor
    FROM CALENDAR AS C
        LEFT OUTER JOIN dbo.DailyVisits AS DV
            ON C.dt = DV.dt;

    SELECT MD.dt,
        COUNT(DISTINCT MD.visitor) AS NumVisits,
        ISNULL(SUM(CASE WHEN MD.visitor IS NOT NULL AND M2.visitor IS NULL THEN 1 ELSE 0 END), 0) AS Added,
        MIN(R.Removed) AS Removed,
        ISNULL(SUM(CASE WHEN MD.visitor = M2.visitor THEN 1 ELSE 0 END), 0) AS Remained
    FROM #MERGED_DATA AS MD
        LEFT OUTER JOIN #MERGED_DATA AS M2
            ON MD.prevday = M2.dt
            AND MD.visitor = M2.visitor
        OUTER APPLY (
            SELECT COUNT(DISTINCT MD2.visitor) AS Removed
            FROM #MERGED_DATA AS MD2
            WHERE MD2.dt = MD.prevday
                AND MD2.visitor NOT IN (
                    SELECT visitor
                    FROM #MERGED_DATA AS MD3
                    WHERE MD3.dt = MD.dt
                        AND MD3.visitor IS NOT NULL
                    )
            ) AS R
    GROUP BY MD.dt
    ORDER BY MD.dt;

    DROP TABLE #MERGED_DATA;
    DROP TABLE dbo.DailyVisits;
    /*    -- RESULTS SHOULD BE
      dt  #visits #added #removed #remained
    2011-06-01     3         3        0        0
    2011-06-02     2         0        1        2
    2011-06-03     2         1        1        1
    2011-06-04     0         0        2        0
    2011-06-05     0         0        0        0
    2011-06-06     0         0        0        0
    2011-06-07     2         2        0        0
    2011-06-08     3         2        1        1
    */

    EDIT: added the actual execution plan...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I love you Jeff Moden.


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • I've been doing some testing of the other solutions in the article and comments (still have some to go, including another from Peso that might beat the rest).

    Peso's version in the body of the article is probably the most clever/elegant to my eyes so far, but is beaten on performance by a couple versions in the comments.

    His own #6 solution is significantly faster percentage-wise, and dova's query is faster still, by a narrow margin.

    dova's isn't quite to spec, though, since his query only works by getting the MIN and MAX dt from DailyVisits and using that. It can be modified to accept variables for start and end, but that prevents its most efficient query plan, and it then becomes very, very slow.

    Just to make it more accessible, here are Peso's solutions from the body of the article and the comments (just the #6), with some formatting applied:to remove all the white space and sprawl in the former case, and to provide any formatting at all in the latter case:

    I also have the variables set to the range used in Steve's test data, not the test harness they used to test their solutions, as an FYI. Further, do note that these solutions require the GetDates function they define in the article (just a cascading CTE tally to generate all the dates in the range).

    DECLARE
    @from AS DATE = '20110101',
    @to AS DATE = '20161231';

    WITH cteSource(dt, NumVisits, Added, Removed)
    AS
    (
    SELECT  dt,
        SUM(NumVisits) AS NumVisits,
        SUM(Added) AS Added,
        SUM(Removed) AS Removed
    FROM  (
        SELECT DATEADD(DAY, f.DayDelta, w.dt) AS dt,
          w.Visitor,
          f.NumVisits,
          f.Added,
          f.Removed
        FROM dbo.DailyVisits AS w
          CROSS JOIN
          (
          VALUES (0, 1, 1, -1),
             (1, 0, -1, 1)
          ) AS f(DayDelta, NumVisits, Added, Removed)
        WHERE w.dt BETWEEN DATEADD(DAY, -1, @From) AND @To
        ) AS d
    WHERE  dt BETWEEN @From AND @To
    GROUP BY dt,
        Visitor
    )

    SELECT dt,
       SUM(NumVisits) AS NumVisits,
       SUM(Added) AS Added,
       SUM(Removed) AS Removed,
       SUM(Remained) AS Remained
    FROM  (
       SELECT  dt,
           SUM(NumVisits) AS NumVisits,
           SUM(CASE WHEN Added = 1 THEN 1 ELSE 0 END) AS Added,
           SUM(CASE WHEN Removed = 1 THEN 1 ELSE 0 END) AS Removed,
           SUM(CASE WHEN 1 IN (Added, Removed) THEN 0 ELSE 1 END) AS Remained
       FROM  cteSource
       GROUP BY dt
       UNION ALL
       SELECT dt,
          0 AS NumVisits,
          0 AS Added,
          0 AS Removed,
          0 AS Remained
       FROM  dbo.GetDates(@From, @To)
       ) AS d
    GROUP BY dt;

    And the #6:

    DECLARE
    @from AS DATE = '20110101',
    @to AS DATE = '20161231';

    WITH cte AS
    (
    SELECT    f.dt,
       f.NumVisits,
       f.Visitor,
       f.Added,
       f.Removed,
       f.Remained
    FROM  dbo.GetDates(@From, @To) AS gd
       OUTER APPLY
       (
       SELECT
       q.dt,
       q.Visitor,
       SUM(q.NumVisits) AS NumVisits,
       CASE WHEN SUM(q.Added) > 0 THEN 1 ELSE 0 END AS Added,
       CASE WHEN SUM(q.Removed) > 0 THEN 1 ELSE 0 END AS Removed,
       CASE WHEN COUNT(*) = 2 THEN 1 ELSE 0 END AS Remained
       FROM
       (
       SELECT gd.dt,
         d.Visitor,
         1 - d.theDiff AS NumVisits,
         1 - 2 * d.theDiff AS Added,
         2 * d.theDiff - 1 AS Removed
       FROM (
         SELECT dv.Visitor,
           CASE WHEN dv.dt = gd.dt THEN 0 ELSE 1 END AS theDiff
         FROM    dbo.DailyVisits AS dv
         WHERE dv.dt BETWEEN DATEADD(DAY, -1, gd.dt) AND gd.dt
         ) AS d
       ) AS q
       GROUP BY q.dt,
          q.Visitor
       ) AS f
    )

    SELECT dt,
       SUM(NumVisits) AS NumVisits,
       SUM(Added) AS Added,
       SUM(Removed) AS Removed,
       SUM(Remained) AS Remained
    FROM  cte
    GROUP BY dt;

    Cheers!

  • autoexcrement - Tuesday, June 27, 2017 2:18 PM

    I love you Jeff Moden.

    Thanks, but nah... I got the problem definition wrong.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, June 27, 2017 5:51 PM

    autoexcrement - Tuesday, June 27, 2017 2:18 PM

    I love you Jeff Moden.

    Thanks, but nah... I got the problem definition wrong.

    And there I was thinking that the "love" post was just kinda creepy...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Jeff is awesome.


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • The solutions posted seem to have too twisted logic.
    A bit difficult to follow.
    I believe it must be way simpler.

    First part is obvious:
    declare @From datetime, @To Datetime

    select @From = dateadd(dd, -2, MIN(Dt) ), @To = MAX(Dt)
    FROM #DailyVisits

    Then, ideally we should have a Calendar table - a Tally table for dates.
    If there is no such table in place, we have to create it on fly:
    select DATEADD(dd, N, @From) Dte
    INTO #Calendar
    from dbo.TallyGenerator (0, DATEDIFF(dd, @From, @To ), null, 1)

    Function TallyGenerator was posted on this site is Scripts.
    Parameters: From Number, To Number, Number or rows to return (ignored if previous 2 are populated), Step.

    Then we apply all the visitors recorded in the period to every day in the period:
    select *
    from #Calendar Calendar
        cross join (select visitor from #DailyVisits where dt between @From and @To group by visitor) V

    Next, we check if there were actual visits for any customer on the date, and on the date before:
    select Calendar.Dte,
        D.visitor, -- if not NULL then the visitor was in here today
        case when PD.visitor = D.visitor then 1 else 0 end as Remained, -- was in on the previous day, and is in here today
        case when PD.visitor is null and D.visitor is not null then 1 else 0 end as Added, -- was not here previous day, but in today
        case when PD.visitor is not null and D.visitor is null then 1 else 0 end as Removed -- was in previous day, not here today
    from #Calendar Calendar
        --    Assigning every visitor recorded within the specified period to every day within the period
        cross join (select visitor from #DailyVisits where dt between @From and @To group by visitor) V
        --    See if there was actually a visit for the visitor on the date
        left join #DailyVisits D on D.dt = Calendar.Dte and D.visitor = V.visitor
        --    See if there was actually a visit for the visitor on the previous date
        left join #DailyVisits PD on PD.visitor = V.visitor and PD.dt = DATEADD(dd, -1, Calendar.Dte )

    And the last, easiest part, aggregation per day:
    select Dte,
        Count(visitor) NumOfVisits,
        SUM(Added) Added,
        SUM(Remained) Remained,
        SUM(Removed) Removed
    from (    select Calendar.Dte, D.visitor,
            case when PD.visitor = D.visitor then 1 else 0 end as Remained,
            case when PD.visitor is null and D.visitor is not null then 1 else 0 end as Added,
            case when PD.visitor is not null and D.visitor is null then 1 else 0 end as Removed
        from #Calendar Calendar
            --    Assigning every visitor recorded within the specified period to every day within the period
            cross join (select visitor from #DailyVisits where dt between @From and @To group by visitor) V
            --    See if there was actually a visit for the visitor on the date
            left join #DailyVisits D on D.dt = Calendar.Dte and D.visitor = V.visitor
            --    See if there was actually a visit for the visitor on the previous date
            left join #DailyVisits PD on PD.visitor = V.visitor and PD.dt = DATEADD(dd, -1, Calendar.Dte )
        ) A
    group by Dte
    order by Dte

    Performance is marginally better than Jacob's #5.
    The performance difference with Steve's solution appears mainly due to creation and population of #MERGED_DATA table.

    _____________
    Code for TallyGenerator

  • Yeah, that's a very similar approach to dova's from the comments of the original article (I mentioned his solution briefly in my last post), but performs even a bit better than that one.

    Like dova's, it does perform better under the right circumstances, like with an actual table of dates, but when used with an iTVF to generate dates the performance gets very bad.

    Still, in terms of raw performance, that approach with a table of dates is the top of the heap so far.

    Cheers!

  • Thanks a lot Jeff for such an exploring content though you had understood the problem definition incorrect which helped us in different way..
    Thanks sgmunson  for clearly sitting top on this and for clear discussion on this.
    Finally heartfelt thanks to Sergiy who actually helped us me in understanding the solution clearly.
    Thanks again all of you.

  • Anjan@Sql - Friday, June 30, 2017 6:47 AM

    Thanks a lot Jeff for such an exploring content though you had understood the problem definition incorrect which helped us in different way..
    Thanks sgmunson  for clearly sitting top on this and for clear discussion on this.
    Finally heartfelt thanks to Sergiy who actually helped us me in understanding the solution clearly.
    Thanks again all of you.

    Glad I could help.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply