Advanced grouping for gaps and islands

Question

Post reply

Advanced grouping for gaps and islands

Alfred

SSC-Addicted

Points: 436

February 20, 2023 at 9:12 am

Go to Answer

#4151866

Hi,

I'm trying to figure out how to group a set of data based on how many days a person is sick. The problem is a gaps and island issue combined with running streak, with a twist.

The records are either of type Work or Sick and are only saved on it's registered date. So there are gaps and islands. I think it's easies to explain based of an example from SQL Fiddle

A group of consecutive rows of type Sick are considered a group if there's not more than 5 days of work between two rows of Sick. Or if there's not more than 7 days between without any Work between. If there's Work between two rows of type Sick the number of sickdays are only incremented by "missing" days between a row of Sick until next row of Work (see date 2022-03-11).

If I do this example in Excel it's easy. But there it's possible to reference the calculated column for previous row to easily increment. The formula then looks like this (I've manually added 0 to previous row before a new group by type Sick:

Column AColumn BColumn C
2022-03-01Work0
2022-03-03Sick=IF(A3-A2>5;0;IF(B3="Sick"; IF(B2="Sick";A3-A2+C2;C2+1);IF(B2="Sick";C2+A3-1-A2;C2)))

In SQL I've solved my problem using Recursive CTE, but that takes way to long. I've manged do narrow it down to roughly 3 seconds per person, but Im dealing with hundreds of thousands of rows so that solution is not doable.

Is it possible to achive the numbering showing same as GoalNbr using SQL technique for gaps and islands (lag, lead, row_number, dense_rank and so on), and how would that solution look?

Regards,

Alfred

SQL Code (if link gets broken)

create table sample
(
  RegDate date
  , Type varchar(10)
  , GoalNbr int
)

insert into sample values ('2022-02-15', 'Work', null)
insert into sample values ('2022-02-17', 'Work', null)
insert into sample values ('2022-02-18', 'Work', null)
insert into sample values ('2022-02-21', 'Work', null)
insert into sample values ('2022-02-22', 'Work', null)
insert into sample values ('2022-02-23', 'Work', null)
insert into sample values ('2022-02-24', 'Sick', 1)
insert into sample values ('2022-02-25', 'Work', null)
insert into sample values ('2022-02-26', 'Work', null)
insert into sample values ('2022-02-27', 'Work', null)
insert into sample values ('2022-02-28', 'Work', null)
insert into sample values ('2022-03-01', 'Work', null)
insert into sample values ('2022-03-03', 'Sick', 1)
insert into sample values ('2022-03-07', 'Sick', 5)
insert into sample values ('2022-03-08', 'Sick', 6)
insert into sample values ('2022-03-10', 'Work', null)
insert into sample values ('2022-03-11', 'Sick', 8)
insert into sample values ('2022-03-14', 'Work', null)
insert into sample values ('2022-03-15', 'Work', null)
insert into sample values ('2022-03-17', 'Sick', 11)
insert into sample values ('2022-03-18', 'Work', null)
insert into sample values ('2022-03-19', 'Work', null)
insert into sample values ('2022-03-20', 'Work', null)
insert into sample values ('2022-03-21', 'Sick', 12)
insert into sample values ('2022-03-22', 'Work', null)
insert into sample values ('2022-03-24', 'Sick', 13)
insert into sample values ('2022-03-25', 'Work', null)
insert into sample values ('2022-03-28', 'Work', null)
insert into sample values ('2022-03-29', 'Sick', 14)
insert into sample values ('2022-03-31', 'Sick', 15)
insert into sample values ('2022-04-01', 'Sick', 16)
insert into sample values ('2022-04-04', 'Work', null)
insert into sample values ('2022-04-05', 'Work', null)
insert into sample values ('2022-04-07', 'Work', null)

select * from sample order by RegDate

Ken McKelvey

SSCoach

Points: 18842

February 20, 2023 at 11:15 pm

Answer

#4152109

As you have not included any examples of the 7 day condition in the test data I am going to ignore it.

What I have done works with the test data and should give you an idea of how to achieve your result.

WITH Boundaries
AS
(
   SELECT RegDate, [Type], GoalNbr
      ,CASE
         WHEN DATEDIFF(day, COALESCE(MAX(CASE WHEN [Type] = 'Sick' THEN RegDate ELSE '1900' END) OVER (ORDER BY RegDate ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING), '1900'), RegDATE) > 6
            AND [Type] = 'Sick'
         THEN 1
         ELSE 0
      END AS Boundary5
      ,CASE
         WHEN LAG([Type]) OVER (ORDER BY RegDate) = 'Sick'
         THEN (DATEDIFF(day, LAG(RegDate) OVER (ORDER BY RegDate), RegDate)) - 1
         ELSE 0
      END AS Days2Add
   FROM [sample]
)
,Grps
AS
(
   SELECT RegDate, [Type], GoalNbr
      ,SUM(Boundary5) OVER (ORDER BY RegDate) AS Grp
      ,Days2Add
   FROM Boundaries
)
,RNs
AS
(
   SELECT RegDate, [Type], GoalNbr, Grp, Days2Add
      ,ROW_NUMBER() OVER (PARTITION BY Grp, [Type] ORDER BY RegDate) AS rn
   FROM Grps
)
SELECT RegDate, [Type], GoalNbr
   ,CASE
      WHEN [Type] = 'Sick'
      THEN rn + SUM(Days2Add) OVER (PARTITION BY Grp ORDER BY RegDate)
   END AS CalcValue
FROM RNs
ORDER BY RegDate;

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 1

I don't understand how you get to those results, can you attach the spreadsheet with some data and the formula?

Alfred SSC-Addicted Points: 436 More actions · Answer 2

I've done some testing and when I added partition by PersonId and Group7 it gives me exactly what I want.

The trick with datediff (Boundary5 in CTE Boundaries) was what I missed. Very clever solution

Thank you so much Ken ❤️