Hello,
I've written a query that is way to slow and I'm woundering if someone has a smarter (and correct) way to solve my problem and care to share that. Also, my query doesn't produce the perfect outcome, barely good enough.
What needs to be done is to find how many employees (for each GroupingId) there are at each time during a day, from 00:00 to 23:59 (24 hour time) grouped by min and max times, for every date. If number of employees stays the same the time interval continues.
The source table has 8 weeks of data and is updated daily (truncate/load)
Ideally the query is fast enough for a view, that way the daily load can stay the as is. Atm the view is so slow that I have to do it as a separate insert step after the daily update, moving the logic from the view to a proc, because I can't have a view that takes up to 2 minutes to execute.
In source table each row is allways for one date, so if the work shift continues over midnight it's broken into two parts. But for expected output the time interval can continue over midnight.
In my query the EndDateTime sometime is the minute before actual EndDateTime which is fine but not optimal.
It also cuts at midnight, also fine but not optimal.
Not good is that I get wrong number of Employees if there are the same EndDateTime as StartDateTime (see hour 01:00 and 16:00 from my query)
Expected output
GroupingId|StartDateTime|EndDateTime|NumberOfEmployees
1|2023-10-16 19:00:00|2023-10-17 08:00:00|1
1|2023-10-17 08:00:00|2023-10-17 10:00:00|2
1|2023-10-17 10:00:00|2023-10-17 10:30:00|0
1|2023-10-17 10:30:00|2023-10-17 16:00:00|1
1|2023-10-17 16:00:00|2023-10-18 00:00:00|2
Sample data
create table dbo.TimeShift(
GroupingId int
, EmpId int
, RegDate date
, StartDateTime datetime2(0)
, EndDateTime datetime2(0)
)
insert into dbo.TimeShift (GroupingId, Empid, RegDate, StartDateTime, EndDateTime) values
(1, 1, '2023-10-16', '2023-10-16 19:00:00', '2023-10-17 00:00:00'),
(1, 1, '2023-10-17', '2023-10-17 00:00:00', '2023-10-17 01:00:00'),
(1, 7, '2023-10-17', '2023-10-17 01:00:00', '2023-10-17 10:00:00'),
(1, 2, '2023-10-17', '2023-10-17 08:00:00', '2023-10-17 10:00:00'),
(1, 2, '2023-10-17', '2023-10-17 10:30:00', '2023-10-17 16:00:00'),
(1, 3, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(1, 4, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00')
http://sqlfiddle.com/#!18/401a43/6
My query
WITH CTE_cal AS(
SELECT
DATEADD(
MINUTE
, rn
, CAST((SELECT MIN(RegDate) FROM dbo.TimeShift) AS datetime2(0))
) AS CalendarMinuteId
FROM(
SELECT TOP (83520) --all minutes between "MIN(RegDate) AND 8 weeks ahead plus one day"
ROW_NUMBER() OVER(ORDER BY a.object_id) - 1 AS rn
FROM
sys.all_objects AS a
CROSS APPLY
sys.all_objects AS b
) AS n
), CTE_Presence_Intervals AS(
SELECT
GroupingId
, RegDate
, NumberOfEmployees
, MIN(CalendarMinuteId) AS StartDateTime
, MAX(CalendarMinuteId) AS EndDateTime
FROM(
SELECT
*
,COUNT(InitGroup) OVER (ORDER BY GroupingId, RegDate, CalendarMinuteId ROWS UNBOUNDED PRECEDING) AS Grp
FROM(
SELECT
*
,DATEDIFF(MINUTE, LAG(CalendarMinuteId, 1) OVER(PARTITION BY GroupingId ORDER BY CalendarMinuteId), CalendarMinuteId) AS diffwithnextrow
,CASE WHEN
DATEDIFF(MINUTE, LAG(CalendarMinuteId, 1) OVER(PARTITION BY GroupingId, NumberOfEmployees ORDER BY CalendarMinuteId), CalendarMinuteId) IS NULL
THEN 1
ELSE CASE
WHEN DATEDIFF(MINUTE, LAG(CalendarMinuteId, 1) OVER(PARTITION BY GroupingId, NumberOfEmployees ORDER BY CalendarMinuteId), CalendarMinuteId) > 1
THEN 1
ELSE NULL
END
END AS InitGroup
FROM(
SELECT
GroupingId
, RegDate
, CalendarMinuteId
, COUNT(DISTINCT EmpId) AS NumberOfEmployees
FROM(
SELECT --top 3000
tr.GroupingId
, tr.RegDate
, cal.CalendarMinuteId
, tr.EmpId
FROM
dbo.TimeShift AS tr
LEFT OUTER JOIN
CTE_cal AS cal
ON cal.CalendarMinuteId BETWEEN tr.StartDateTime AND tr.EndDateTime
WHERE 1=1
GROUP BY GroupingId, tr.RegDate, cal.CalendarMinuteId, tr.EmpId
) AS n
GROUP BY GroupingId, RegDate, CalendarMinuteId
) AS y
) AS x
) AS m
GROUP BY
GroupingId
, RegDate
, NumberOfEmployees
, Grp
)
SELECT
*
FROM
CTE_Presence_Intervals
UNION ALL
SELECT
GroupingId
, RegDate
, 0 AS NumberOfEmployees
, PreviousEndDateTime AS StartDateTime
, StartDateTime AS EndDateTime
FROM(
SELECT
*
, DATEDIFF(MINUTE, LAG(EndDateTime, 1) OVER(PARTITION BY GroupingId ORDER BY StartDateTime), StartDateTime) AS DiffPreviousEndDateTime
, LAG(EndDateTime, 1) OVER(PARTITION BY GroupingId ORDER BY StartDateTime) AS PreviousEndDateTime
FROM
CTE_Presence_Intervals
) AS n
WHERE 1=1
AND DiffPreviousEndDateTime > 1
ORDER BY 4
Any help is appreciated!
Try something like:
WITH InOuts
AS
(
SELECT T.GroupingId, X.InOut
,CASE
WHEN X.InOut = 1
THEN DATEADD(minute, DATEDIFF(minute, '2020', T.StartDateTime), '2020')
ELSE DATEADD(minute, DATEDIFF(minute, '2020', T.EndDateTime), '2020')
END AS InOutDate
FROM dbo.TimeShift T
CROSS APPLY (VALUES (1), (-1)) X (InOut)
)
,Nums
AS
(
SELECT GroupingId, InOutDate
,SUM(InOut) OVER (PARTITION BY GroupingId ORDER BY InOutDate) AS NumOfEmps
FROM InOuts
)
,Boundaries
AS
(
SELECT GroupingId, InOutDate, NumOfEmps
,CASE
WHEN NumOfEmps <> LAG(NumOfEmps) OVER (PARTITION BY GroupingId ORDER BY InOutDate)
THEN 1
ELSE 0
END AS Boundary
FROM Nums
)
,Grps
AS
(
SELECT GroupingId, InOutDate, NumOfEmps
,SUM(Boundary) OVER (PARTITION BY GroupingId ORDER BY InOutDate) AS Grp
,LEAD(InOutDate) OVER (PARTITION BY GroupingId ORDER BY InOutDate) AS NextInOutDate
FROM Boundaries
)
SELECT GroupingId
,MIN(InOutDate) AS StartDateTime
,MAX(NextInOutDate) AS EndDateTime
,MIN(NumOfEmps) AS NumOfEmps
FROM Grps
WHERE NextInOutDate IS NOT NULL
GROUP BY GroupingId, Grp
HAVING MIN(InOutDate) <> MAX(NextInOutDate)
ORDER BY GroupingId, StartDateTime;
October 17, 2023 at 1:11 pm
Wow that is so amazing, thank you so much!
One qustion: whats the purpose of
THEN DATEADD(minute, DATEDIFF(minute, '2020', T.StartDateTime), '2020')
ELSE DATEADD(minute, DATEDIFF(minute, '2020', T.EndDateTime), '2020')
instead of
THEN T.StartDateTime
ELSE T.EndDateTime
Looks like it produces the same result, but guess I'm missing something.
October 17, 2023 at 2:10 pm
One qustion: whats the purpose of
THEN DATEADD(minute, DATEDIFF(minute, '2020', T.StartDateTime), '2020')
ELSE DATEADD(minute, DATEDIFF(minute, '2020', T.EndDateTime), '2020')instead of
THEN T.StartDateTime
ELSE T.EndDateTimeLooks like it produces the same result, but guess I'm missing something.
You implied per minute so this rounds the time down to the minute. As your test data has no seconds the results will be the same.
You might also like to experiment with making the Nums CTE a SELECT DISTINCT. As DISTINCT is expensive I suspect the number of rows, going into the Boundaries CTE, would need to be less than 15% of Nums without DISTINCT to make it worthwhile.
October 17, 2023 at 2:25 pm
Thanks for your reply.
I've tested the query on a larger set of testdata and it's very fast. But will experiment with distinct as well.
October 17, 2023 at 2:33 pm
This gives the same results as Ken's query. (NOTE: I changed your permanent table to a temp table.)
WITH InOutStatuses AS
(
SELECT ts.GroupingId, i.InOutDate, SUM(i.InOutStatus) AS EmployeeChanges, MAX(i.InOutDate) OVER(PARTITION BY ts.GroupingId) AS EndDate
FROM #TimeShift AS ts
CROSS APPLY (VALUES(ts.StartDateTime, 1), (ts.EndDateTime, -1)) AS i(InOutDate, InOutStatus)
GROUP BY ts.GroupingId, i.InOutDate
HAVING SUM(i.InOutStatus) <> 0
)
SELECT i.GroupingId, i.InOutDate, LEAD(i.InOutDate, 1, i.EndDate) OVER(PARTITION BY i.GroupingId ORDER BY i.InOutDate), SUM(i.EmployeeChanges) OVER(ORDER BY i.InOutDate ROWS UNBOUNDED PRECEDING) AS NumEmployees
FROM InOutStatuses AS i
WHERE i.InOutDate < i.EndDate
ORDER BY i.InOutDate
Here is a comparison of the reads:
-- Ken's query
Table 'Worktable'. Scan count 18, logical reads 146, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#TimeShift___000000009E6D'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
-- Drew's query
Table 'Worktable'. Scan count 3, logical reads 17, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#TimeShift___000000009E6D'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 17, 2023 at 3:47 pm
...
In source table each row is allways for one date, so if the work shift continues over midnight it's broken into two parts. But for expected output the time interval can continue over midnight.
In my query the EndDateTime sometime is the minute before actual EndDateTime which is fine but not optimal.
It also cuts at midnight, also fine but not optimal.
Imo if the EndDateTime is sometimes the minute before it makes sense to address this issue when the data gets loaded to the table. Fixing "close but not matching" issues like this in a query is often problematic. Suppose I just want to fix the interval
drop TABLE if exists #TimeShift;
go
create table #TimeShift(
GroupingId int
, EmpId int
, RegDate date
, StartDateTime datetime2(0)
, EndDateTime datetime2(0)
)
insert into #TimeShift (GroupingId, Empid, RegDate, StartDateTime, EndDateTime) values
(1, 1, '2023-10-16', '2023-10-16 19:00:00', '2023-10-17 00:00:00'),
(1, 1, '2023-10-17', '2023-10-17 00:00:00', '2023-10-17 01:00:00'),
(1, 7, '2023-10-17', '2023-10-17 01:00:00', '2023-10-17 10:00:00'),
(1, 2, '2023-10-17', '2023-10-17 08:00:00', '2023-10-17 10:00:00'),
(1, 2, '2023-10-17', '2023-10-17 10:30:00', '2023-10-17 16:00:00'),
(1, 3, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(1, 4, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(2, 1, '2023-10-16', '2023-10-16 19:00:00', '2023-10-16 23:59:00'),/* changed */
(2, 1, '2023-10-17', '2023-10-17 00:00:00', '2023-10-17 01:00:00'),
(2, 7, '2023-10-17', '2023-10-17 01:00:00', '2023-10-17 10:00:00'),
(2, 2, '2023-10-17', '2023-10-17 08:00:00', '2023-10-17 10:00:00'),
(2, 2, '2023-10-17', '2023-10-17 10:30:00', '2023-10-17 16:00:00'),
(2, 3, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(2, 4, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(3, 1, '2023-10-16', '2023-10-16 19:00:00', '2023-10-17 00:00:00'),
(3, 1, '2023-10-17', '2023-10-17 00:00:00', '2023-10-17 01:00:00'),
(3, 7, '2023-10-17', '2023-10-17 01:00:00', '2023-10-17 10:00:00'),
(3, 2, '2023-10-17', '2023-10-17 08:00:00', '2023-10-17 10:00:00'),
(3, 2, '2023-10-17', '2023-10-17 10:30:00', '2023-10-17 16:00:00'),
(3, 3, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(3, 4, '2023-10-17', '2023-10-17 16:00:00', '2023-10-17 23:58:00'); /* changed */
/* fix the EndDateTime */
declare @minutes_to_midnight int=3;
select ts.*, ddif.dtzm, v.FixedEndDateTime
from #TimeShift ts
cross apply (values (1440+datediff(minute, cast(ts.EndDateTime as time), cast('00:00:00' as time)))) ddif(dtzm)
cross apply (values (iif(ddif.dtzm between 1 and @minutes_to_midnight,
dateadd(minute, ddif.dtzm, ts.EndDateTime),
ts.EndDateTime))) v(FixedEndDateTime);
/* fix the interval */
select ts.*, ddif.dtzm, v.FixedEndDateTime, coalesce(interval.FinalEndDateTime, v.FixedEndDateTime) FinalEndDateTime
from #TimeShift ts
cross apply (values (1440+datediff(minute, cast(ts.EndDateTime as time), cast('00:00:00' as time)))) ddif(dtzm)
cross apply (values (iif(ddif.dtzm between 1 and @minutes_to_midnight,
dateadd(minute, ddif.dtzm, ts.EndDateTime),
ts.EndDateTime))) v(FixedEndDateTime)
outer apply (select tt.EndDateTime
from #TimeShift tt
where tt.GroupingId=ts.GroupingId
and tt.EmpId=ts.EmpId
and tt.StartDateTime=v.FixedEndDateTime) interval(FinalEndDateTime)
where not exists (select 1
from #TimeShift tt
where tt.GroupingId=ts.GroupingId
and tt.EmpId=ts.EmpId
and tt.EndDateTime=ts.StartDateTime); /* should be the FixedEndDateTime */
My query was going to OUTER APPLY Drew Allen's query to the fixed interval above. In the NOT EXISTS condition in the WHERE clause the JOIN condition should use the FixedEndDateTime imo
Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können
October 17, 2023 at 3:59 pm
This gives the same results as Ken's query. (NOTE: I changed your permanent table to a temp table.)
WITH InOutStatuses AS
(
SELECT ts.GroupingId, i.InOutDate, SUM(i.InOutStatus) AS EmployeeChanges, MAX(i.InOutDate) OVER(PARTITION BY ts.GroupingId) AS EndDate
FROM #TimeShift AS ts
CROSS APPLY (VALUES(ts.StartDateTime, 1), (ts.EndDateTime, -1)) AS i(InOutDate, InOutStatus)
GROUP BY ts.GroupingId, i.InOutDate
HAVING SUM(i.InOutStatus) <> 0
)
SELECT i.GroupingId, i.InOutDate, LEAD(i.InOutDate, 1, i.EndDate) OVER(PARTITION BY i.GroupingId ORDER BY i.InOutDate), SUM(i.EmployeeChanges) OVER(ORDER BY i.InOutDate ROWS UNBOUNDED PRECEDING) AS NumEmployees
FROM InOutStatuses AS i
WHERE i.InOutDate < i.EndDate
ORDER BY i.InOutDateHere is a comparison of the reads:
-- Ken's query
Table 'Worktable'. Scan count 18, logical reads 146, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#TimeShift___000000009E6D'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
-- Drew's query
Table 'Worktable'. Scan count 3, logical reads 17, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#TimeShift___000000009E6D'. Scan count 1, logical reads 1, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.Drew
Thank you Drew!
Alfred wrote:... In source table each row is allways for one date, so if the work shift continues over midnight it's broken into two parts. But for expected output the time interval can continue over midnight.
In my query the EndDateTime sometime is the minute before actual EndDateTime which is fine but not optimal.
It also cuts at midnight, also fine but not optimal.
Imo if the EndDateTime is sometimes the minute before it makes sense to address this issue when the data gets loaded to the table. Fixing "close but not matching" issues like this in a query is often problematic. Suppose I just want to fix the interval
drop TABLE if exists #TimeShift;
go
create table #TimeShift(
GroupingId int
, EmpId int
, RegDate date
, StartDateTime datetime2(0)
, EndDateTime datetime2(0)
)
insert into #TimeShift (GroupingId, Empid, RegDate, StartDateTime, EndDateTime) values
(1, 1, '2023-10-16', '2023-10-16 19:00:00', '2023-10-17 00:00:00'),
(1, 1, '2023-10-17', '2023-10-17 00:00:00', '2023-10-17 01:00:00'),
(1, 7, '2023-10-17', '2023-10-17 01:00:00', '2023-10-17 10:00:00'),
(1, 2, '2023-10-17', '2023-10-17 08:00:00', '2023-10-17 10:00:00'),
(1, 2, '2023-10-17', '2023-10-17 10:30:00', '2023-10-17 16:00:00'),
(1, 3, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(1, 4, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(2, 1, '2023-10-16', '2023-10-16 19:00:00', '2023-10-16 23:59:00'),/* changed */
(2, 1, '2023-10-17', '2023-10-17 00:00:00', '2023-10-17 01:00:00'),
(2, 7, '2023-10-17', '2023-10-17 01:00:00', '2023-10-17 10:00:00'),
(2, 2, '2023-10-17', '2023-10-17 08:00:00', '2023-10-17 10:00:00'),
(2, 2, '2023-10-17', '2023-10-17 10:30:00', '2023-10-17 16:00:00'),
(2, 3, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(2, 4, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(3, 1, '2023-10-16', '2023-10-16 19:00:00', '2023-10-17 00:00:00'),
(3, 1, '2023-10-17', '2023-10-17 00:00:00', '2023-10-17 01:00:00'),
(3, 7, '2023-10-17', '2023-10-17 01:00:00', '2023-10-17 10:00:00'),
(3, 2, '2023-10-17', '2023-10-17 08:00:00', '2023-10-17 10:00:00'),
(3, 2, '2023-10-17', '2023-10-17 10:30:00', '2023-10-17 16:00:00'),
(3, 3, '2023-10-17', '2023-10-17 16:00:00', '2023-10-18 00:00:00'),
(3, 4, '2023-10-17', '2023-10-17 16:00:00', '2023-10-17 23:58:00'); /* changed */
/* fix the EndDateTime */
declare @minutes_to_midnight int=3;
select ts.*, ddif.dtzm, v.FixedEndDateTime
from #TimeShift ts
cross apply (values (1440+datediff(minute, cast(ts.EndDateTime as time), cast('00:00:00' as time)))) ddif(dtzm)
cross apply (values (iif(ddif.dtzm between 1 and @minutes_to_midnight,
dateadd(minute, ddif.dtzm, ts.EndDateTime),
ts.EndDateTime))) v(FixedEndDateTime);
/* fix the interval */
select ts.*, ddif.dtzm, v.FixedEndDateTime, coalesce(interval.FinalEndDateTime, v.FixedEndDateTime) FinalEndDateTime
from #TimeShift ts
cross apply (values (1440+datediff(minute, cast(ts.EndDateTime as time), cast('00:00:00' as time)))) ddif(dtzm)
cross apply (values (iif(ddif.dtzm between 1 and @minutes_to_midnight,
dateadd(minute, ddif.dtzm, ts.EndDateTime),
ts.EndDateTime))) v(FixedEndDateTime)
outer apply (select tt.EndDateTime
from #TimeShift tt
where tt.GroupingId=ts.GroupingId
and tt.EmpId=ts.EmpId
and tt.StartDateTime=v.FixedEndDateTime) interval(FinalEndDateTime)
where not exists (select 1
from #TimeShift tt
where tt.GroupingId=ts.GroupingId
and tt.EmpId=ts.EmpId
and tt.EndDateTime=ts.StartDateTime); /* should be the FixedEndDateTime */My query was going to OUTER APPLY Drew Allen's query to the fixed interval above. In the NOT EXISTS condition in the WHERE clause the JOIN condition should use the FixedEndDateTime imo
What I meant by a minute before is the result my bad query produced, not the source data. The source time shift rows can start and end at any minute during the day, never in seconds. And never over midnight. So regardless of type of query it should show same EndDateTime as the actual source row.
I can't try any of the answers now but will do that first thing tomorrow.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply