June 27, 2017 at 9:18 am
J Livingston SQL - Tuesday, June 27, 2017 6:20 AMI think that this is the origin of the question, along with the some of the solutionsby Itzik Ben-Gan
http://sqlmag.com/blog/tsql-challenge-reoccurring-visits
http://sqlmag.com/blog/solutions-tsql-challenge-reoccurring-visits
Cool. Thanks, Graham. Looks like I got the problem definition wrong after all. This is for day to day, not overall like my code has it. Leave it to Peter Larsson on these types of things. He has one of "those" brains.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 27, 2017 at 1:43 pm
Jeff Moden - Tuesday, June 27, 2017 9:18 AMJ Livingston SQL - Tuesday, June 27, 2017 6:20 AMI think that this is the origin of the question, along with the some of the solutionsby Itzik Ben-Gan
http://sqlmag.com/blog/tsql-challenge-reoccurring-visits
http://sqlmag.com/blog/solutions-tsql-challenge-reoccurring-visitsCool. Thanks, Graham. Looks like I got the problem definition wrong after all. This is for day to day, not overall like my code has it. Leave it to Peter Larsson on these types of things. He has one of "those" brains.
Okay, here goes. This appears to both solve the problem in less than a second even for 10,000+ rows of input data, although I'm getting a SORT operator and it's probably that I need to adjust an index somewhere.USE LOCAL_DB;
GO
SET NOCOUNT ON;
IF OBJECT_ID(N'dbo.DailyVisits', N'U') IS NOT NULL
BEGIN
DROP TABLE dbo.DailyVisits;
END;
GO
CREATE TABLE dbo.DailyVisits (
dt date NOT NULL,
visitor varchar(10) NOT NULL,
CONSTRAINT PK_DailyVisits PRIMARY KEY(dt, visitor)
);
CREATE NONCLUSTERED INDEX IX_DailyVisits_visitor_dt ON dbo.DailyVisits
(
visitor ASC,
dt ASC
);
--===== 10,000 visitors over a 5 year period
WITH RAW_DATA AS (
SELECT DISTINCT dt, visitor
FROM (
SELECT TOP (11840)
dt = DATEADD(dd,ABS(CHECKSUM(NEWID())%DATEDIFF(dd, '2011', '2017')), '2011'), --2017 is exclusive here
visitor = RIGHT(NEWID(), 1)
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
) AS X
)
INSERT INTO dbo.DailyVisits (dt, visitor)
SELECT RD.dt, RD.visitor
FROM RAW_DATA AS RD
LEFT OUTER JOIN dbo.DailyVisits AS DV
ON DV.dt = RD.dt
AND DV.visitor = RD.visitor
WHERE DV.dt IS NULL
ORDER BY dt, visitor;
SELECT *
FROM dbo.DailyVisits
ORDER BY dt, visitor;
--Solution:
DECLARE @Start AS date, @End AS date;
SELECT @Start = MIN(dt), @End = MAX(dt)
FROM dbo.DailyVisits;
CREATE TABLE #MERGED_DATA (
dt date NOT NULL,
prevday date NOT NULL,
visitor varchar(10) NULL
);
CREATE UNIQUE CLUSTERED INDEX IX_MERGED_DATA_dt_prevday_visitor ON #MERGED_DATA
(
dt ASC,
prevday ASC,
visitor ASC
);
CREATE UNIQUE NONCLUSTERED INDEX IX_MERGED_DATA_visitor_prevday_dt ON #MERGED_DATA
(
visitor ASC,
prevday ASC,
dt ASC
);
WITH E1 AS (
SELECT 1 AS N UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
CALENDAR AS (
SELECT X.dt, CAST(DATEADD(day, -1, X.dt) AS date) AS prevday
FROM (
SELECT TOP (10000) CAST(DATEADD(day, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) - 1, @Start) AS date) AS dt,
CAST(DATEADD(day, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) - 2, @Start) AS date) AS prevday
FROM E1 AS A, E1 AS B, E1 AS C, E1 AS D
) AS X
WHERE X.dt <= @End
)
INSERT INTO #MERGED_DATA(dt, prevday, visitor)
SELECT C.dt, C.prevday, DV.visitor
FROM CALENDAR AS C
LEFT OUTER JOIN dbo.DailyVisits AS DV
ON C.dt = DV.dt;
SELECT MD.dt,
COUNT(DISTINCT MD.visitor) AS NumVisits,
ISNULL(SUM(CASE WHEN MD.visitor IS NOT NULL AND M2.visitor IS NULL THEN 1 ELSE 0 END), 0) AS Added,
MIN(R.Removed) AS Removed,
ISNULL(SUM(CASE WHEN MD.visitor = M2.visitor THEN 1 ELSE 0 END), 0) AS Remained
FROM #MERGED_DATA AS MD
LEFT OUTER JOIN #MERGED_DATA AS M2
ON MD.prevday = M2.dt
AND MD.visitor = M2.visitor
OUTER APPLY (
SELECT COUNT(DISTINCT MD2.visitor) AS Removed
FROM #MERGED_DATA AS MD2
WHERE MD2.dt = MD.prevday
AND MD2.visitor NOT IN (
SELECT visitor
FROM #MERGED_DATA AS MD3
WHERE MD3.dt = MD.dt
AND MD3.visitor IS NOT NULL
)
) AS R
GROUP BY MD.dt
ORDER BY MD.dt;
DROP TABLE #MERGED_DATA;
DROP TABLE dbo.DailyVisits;
/* -- RESULTS SHOULD BE
dt #visits #added #removed #remained
2011-06-01 3 3 0 0
2011-06-02 2 0 1 2
2011-06-03 2 1 1 1
2011-06-04 0 0 2 0
2011-06-05 0 0 0 0
2011-06-06 0 0 0 0
2011-06-07 2 2 0 0
2011-06-08 3 2 1 1
*/
EDIT: added the actual execution plan...
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
June 27, 2017 at 2:18 pm
I love you Jeff Moden.
June 27, 2017 at 3:12 pm
I've been doing some testing of the other solutions in the article and comments (still have some to go, including another from Peso that might beat the rest).
Peso's version in the body of the article is probably the most clever/elegant to my eyes so far, but is beaten on performance by a couple versions in the comments.
His own #6 solution is significantly faster percentage-wise, and dova's query is faster still, by a narrow margin.
dova's isn't quite to spec, though, since his query only works by getting the MIN and MAX dt from DailyVisits and using that. It can be modified to accept variables for start and end, but that prevents its most efficient query plan, and it then becomes very, very slow.
Just to make it more accessible, here are Peso's solutions from the body of the article and the comments (just the #6), with some formatting applied:to remove all the white space and sprawl in the former case, and to provide any formatting at all in the latter case:
I also have the variables set to the range used in Steve's test data, not the test harness they used to test their solutions, as an FYI. Further, do note that these solutions require the GetDates function they define in the article (just a cascading CTE tally to generate all the dates in the range).
DECLARE
@from AS DATE = '20110101',
@to AS DATE = '20161231';
WITH cteSource(dt, NumVisits, Added, Removed)
AS
(
SELECT dt,
SUM(NumVisits) AS NumVisits,
SUM(Added) AS Added,
SUM(Removed) AS Removed
FROM (
SELECT DATEADD(DAY, f.DayDelta, w.dt) AS dt,
w.Visitor,
f.NumVisits,
f.Added,
f.Removed
FROM dbo.DailyVisits AS w
CROSS JOIN
(
VALUES (0, 1, 1, -1),
(1, 0, -1, 1)
) AS f(DayDelta, NumVisits, Added, Removed)
WHERE w.dt BETWEEN DATEADD(DAY, -1, @From) AND @To
) AS d
WHERE dt BETWEEN @From AND @To
GROUP BY dt,
Visitor
)
SELECT dt,
SUM(NumVisits) AS NumVisits,
SUM(Added) AS Added,
SUM(Removed) AS Removed,
SUM(Remained) AS Remained
FROM (
SELECT dt,
SUM(NumVisits) AS NumVisits,
SUM(CASE WHEN Added = 1 THEN 1 ELSE 0 END) AS Added,
SUM(CASE WHEN Removed = 1 THEN 1 ELSE 0 END) AS Removed,
SUM(CASE WHEN 1 IN (Added, Removed) THEN 0 ELSE 1 END) AS Remained
FROM cteSource
GROUP BY dt
UNION ALL
SELECT dt,
0 AS NumVisits,
0 AS Added,
0 AS Removed,
0 AS Remained
FROM dbo.GetDates(@From, @To)
) AS d
GROUP BY dt;
And the #6:
DECLARE
@from AS DATE = '20110101',
@to AS DATE = '20161231';
WITH cte AS
(
SELECT f.dt,
f.NumVisits,
f.Visitor,
f.Added,
f.Removed,
f.Remained
FROM dbo.GetDates(@From, @To) AS gd
OUTER APPLY
(
SELECT
q.dt,
q.Visitor,
SUM(q.NumVisits) AS NumVisits,
CASE WHEN SUM(q.Added) > 0 THEN 1 ELSE 0 END AS Added,
CASE WHEN SUM(q.Removed) > 0 THEN 1 ELSE 0 END AS Removed,
CASE WHEN COUNT(*) = 2 THEN 1 ELSE 0 END AS Remained
FROM
(
SELECT gd.dt,
d.Visitor,
1 - d.theDiff AS NumVisits,
1 - 2 * d.theDiff AS Added,
2 * d.theDiff - 1 AS Removed
FROM (
SELECT dv.Visitor,
CASE WHEN dv.dt = gd.dt THEN 0 ELSE 1 END AS theDiff
FROM dbo.DailyVisits AS dv
WHERE dv.dt BETWEEN DATEADD(DAY, -1, gd.dt) AND gd.dt
) AS d
) AS q
GROUP BY q.dt,
q.Visitor
) AS f
)
SELECT dt,
SUM(NumVisits) AS NumVisits,
SUM(Added) AS Added,
SUM(Removed) AS Removed,
SUM(Remained) AS Remained
FROM cte
GROUP BY dt;
Cheers!
June 27, 2017 at 5:51 pm
autoexcrement - Tuesday, June 27, 2017 2:18 PMI love you Jeff Moden.
Thanks, but nah... I got the problem definition wrong.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 28, 2017 at 11:29 am
Jeff Moden - Tuesday, June 27, 2017 5:51 PMautoexcrement - Tuesday, June 27, 2017 2:18 PMI love you Jeff Moden.Thanks, but nah... I got the problem definition wrong.
And there I was thinking that the "love" post was just kinda creepy...
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
June 28, 2017 at 6:15 pm
Jeff is awesome.
June 29, 2017 at 6:55 am
The solutions posted seem to have too twisted logic.
A bit difficult to follow.
I believe it must be way simpler.
First part is obvious:declare @From datetime, @To Datetime
select @From = dateadd(dd, -2, MIN(Dt) ), @To = MAX(Dt)
FROM #DailyVisits
Then, ideally we should have a Calendar table - a Tally table for dates.
If there is no such table in place, we have to create it on fly:select DATEADD(dd, N, @From) Dte
INTO #Calendar
from dbo.TallyGenerator (0, DATEDIFF(dd, @From, @To ), null, 1)
Function TallyGenerator was posted on this site is Scripts.
Parameters: From Number, To Number, Number or rows to return (ignored if previous 2 are populated), Step.
Then we apply all the visitors recorded in the period to every day in the period:select *
from #Calendar Calendar
cross join (select visitor from #DailyVisits where dt between @From and @To group by visitor) V
Next, we check if there were actual visits for any customer on the date, and on the date before:select Calendar.Dte,
D.visitor, -- if not NULL then the visitor was in here today
case when PD.visitor = D.visitor then 1 else 0 end as Remained, -- was in on the previous day, and is in here today
case when PD.visitor is null and D.visitor is not null then 1 else 0 end as Added, -- was not here previous day, but in today
case when PD.visitor is not null and D.visitor is null then 1 else 0 end as Removed -- was in previous day, not here today
from #Calendar Calendar
-- Assigning every visitor recorded within the specified period to every day within the period
cross join (select visitor from #DailyVisits where dt between @From and @To group by visitor) V
-- See if there was actually a visit for the visitor on the date
left join #DailyVisits D on D.dt = Calendar.Dte and D.visitor = V.visitor
-- See if there was actually a visit for the visitor on the previous date
left join #DailyVisits PD on PD.visitor = V.visitor and PD.dt = DATEADD(dd, -1, Calendar.Dte )
And the last, easiest part, aggregation per day:select Dte,
Count(visitor) NumOfVisits,
SUM(Added) Added,
SUM(Remained) Remained,
SUM(Removed) Removed
from ( select Calendar.Dte, D.visitor,
case when PD.visitor = D.visitor then 1 else 0 end as Remained,
case when PD.visitor is null and D.visitor is not null then 1 else 0 end as Added,
case when PD.visitor is not null and D.visitor is null then 1 else 0 end as Removed
from #Calendar Calendar
-- Assigning every visitor recorded within the specified period to every day within the period
cross join (select visitor from #DailyVisits where dt between @From and @To group by visitor) V
-- See if there was actually a visit for the visitor on the date
left join #DailyVisits D on D.dt = Calendar.Dte and D.visitor = V.visitor
-- See if there was actually a visit for the visitor on the previous date
left join #DailyVisits PD on PD.visitor = V.visitor and PD.dt = DATEADD(dd, -1, Calendar.Dte )
) A
group by Dte
order by Dte
Performance is marginally better than Jacob's #5.
The performance difference with Steve's solution appears mainly due to creation and population of #MERGED_DATA table.
_____________
Code for TallyGenerator
June 29, 2017 at 8:26 am
Yeah, that's a very similar approach to dova's from the comments of the original article (I mentioned his solution briefly in my last post), but performs even a bit better than that one.
Like dova's, it does perform better under the right circumstances, like with an actual table of dates, but when used with an iTVF to generate dates the performance gets very bad.
Still, in terms of raw performance, that approach with a table of dates is the top of the heap so far.
Cheers!
June 30, 2017 at 6:47 am
Thanks a lot Jeff for such an exploring content though you had understood the problem definition incorrect which helped us in different way..
Thanks sgmunson for clearly sitting top on this and for clear discussion on this.
Finally heartfelt thanks to Sergiy who actually helped us me in understanding the solution clearly.
Thanks again all of you.
June 30, 2017 at 10:56 am
Anjan@Sql - Friday, June 30, 2017 6:47 AMThanks a lot Jeff for such an exploring content though you had understood the problem definition incorrect which helped us in different way..
Thanks sgmunson for clearly sitting top on this and for clear discussion on this.
Finally heartfelt thanks to Sergiy who actually helped us me in understanding the solution clearly.
Thanks again all of you.
Glad I could help.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
Viewing 11 posts - 16 through 25 (of 25 total)
You must be logged in to reply to this topic. Login to reply