Hi,
I'm trying to figure out how to group a set of data based on how many days a person is sick. The problem is a gaps and island issue combined with running streak, with a twist.
The records are either of type Work or Sick and are only saved on it's registered date. So there are gaps and islands. I think it's easies to explain based of an example from SQL Fiddle
A group of consecutive rows of type Sick are considered a group if there's not more than 5 days of work between two rows of Sick. Or if there's not more than 7 days between without any Work between. If there's Work between two rows of type Sick the number of sickdays are only incremented by "missing" days between a row of Sick until next row of Work (see date 2022-03-11).
If I do this example in Excel it's easy. But there it's possible to reference the calculated column for previous row to easily increment. The formula then looks like this (I've manually added 0 to previous row before a new group by type Sick:
Column AColumn BColumn C
2022-03-01Work0
2022-03-03Sick=IF(A3-A2>5;0;IF(B3="Sick"; IF(B2="Sick";A3-A2+C2;C2+1);IF(B2="Sick";C2+A3-1-A2;C2)))
In SQL I've solved my problem using Recursive CTE, but that takes way to long. I've manged do narrow it down to roughly 3 seconds per person, but Im dealing with hundreds of thousands of rows so that solution is not doable.
Is it possible to achive the numbering showing same as GoalNbr using SQL technique for gaps and islands (lag, lead, row_number, dense_rank and so on), and how would that solution look?
Regards,
Alfred
SQL Code (if link gets broken)
create table sample
(
RegDate date
, Type varchar(10)
, GoalNbr int
)
insert into sample values ('2022-02-15', 'Work', null)
insert into sample values ('2022-02-17', 'Work', null)
insert into sample values ('2022-02-18', 'Work', null)
insert into sample values ('2022-02-21', 'Work', null)
insert into sample values ('2022-02-22', 'Work', null)
insert into sample values ('2022-02-23', 'Work', null)
insert into sample values ('2022-02-24', 'Sick', 1)
insert into sample values ('2022-02-25', 'Work', null)
insert into sample values ('2022-02-26', 'Work', null)
insert into sample values ('2022-02-27', 'Work', null)
insert into sample values ('2022-02-28', 'Work', null)
insert into sample values ('2022-03-01', 'Work', null)
insert into sample values ('2022-03-03', 'Sick', 1)
insert into sample values ('2022-03-07', 'Sick', 5)
insert into sample values ('2022-03-08', 'Sick', 6)
insert into sample values ('2022-03-10', 'Work', null)
insert into sample values ('2022-03-11', 'Sick', 8)
insert into sample values ('2022-03-14', 'Work', null)
insert into sample values ('2022-03-15', 'Work', null)
insert into sample values ('2022-03-17', 'Sick', 11)
insert into sample values ('2022-03-18', 'Work', null)
insert into sample values ('2022-03-19', 'Work', null)
insert into sample values ('2022-03-20', 'Work', null)
insert into sample values ('2022-03-21', 'Sick', 12)
insert into sample values ('2022-03-22', 'Work', null)
insert into sample values ('2022-03-24', 'Sick', 13)
insert into sample values ('2022-03-25', 'Work', null)
insert into sample values ('2022-03-28', 'Work', null)
insert into sample values ('2022-03-29', 'Sick', 14)
insert into sample values ('2022-03-31', 'Sick', 15)
insert into sample values ('2022-04-01', 'Sick', 16)
insert into sample values ('2022-04-04', 'Work', null)
insert into sample values ('2022-04-05', 'Work', null)
insert into sample values ('2022-04-07', 'Work', null)
select * from sample order by RegDate
February 20, 2023 at 5:31 pm
I don't understand how you get to those results, can you attach the spreadsheet with some data and the formula?
As you have not included any examples of the 7 day condition in the test data I am going to ignore it.
What I have done works with the test data and should give you an idea of how to achieve your result.
WITH Boundaries
AS
(
SELECT RegDate, [Type], GoalNbr
,CASE
WHEN DATEDIFF(day, COALESCE(MAX(CASE WHEN [Type] = 'Sick' THEN RegDate ELSE '1900' END) OVER (ORDER BY RegDate ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING), '1900'), RegDATE) > 6
AND [Type] = 'Sick'
THEN 1
ELSE 0
END AS Boundary5
,CASE
WHEN LAG([Type]) OVER (ORDER BY RegDate) = 'Sick'
THEN (DATEDIFF(day, LAG(RegDate) OVER (ORDER BY RegDate), RegDate)) - 1
ELSE 0
END AS Days2Add
FROM [sample]
)
,Grps
AS
(
SELECT RegDate, [Type], GoalNbr
,SUM(Boundary5) OVER (ORDER BY RegDate) AS Grp
,Days2Add
FROM Boundaries
)
,RNs
AS
(
SELECT RegDate, [Type], GoalNbr, Grp, Days2Add
,ROW_NUMBER() OVER (PARTITION BY Grp, [Type] ORDER BY RegDate) AS rn
FROM Grps
)
SELECT RegDate, [Type], GoalNbr
,CASE
WHEN [Type] = 'Sick'
THEN rn + SUM(Days2Add) OVER (PARTITION BY Grp ORDER BY RegDate)
END AS CalcValue
FROM RNs
ORDER BY RegDate;
February 21, 2023 at 10:02 am
I've done some testing and when I added partition by PersonId and Group7 it gives me exactly what I want.
The trick with datediff (Boundary5 in CTE Boundaries) was what I missed. Very clever solution
Thank you so much Ken ❤️
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply