Counting the number of unique days between ranges of dates

  • Thanks all for your help with this one. Reading through all your suggestions it looks like using a tally is what'll work for us. The dataset is only about 2,800 rows so performance isn't a huge problem.

    Also came across this solution which is very similar and seems to work on the test data:

    https://bertwagner.com/posts/gaps-and-islands/

     

  • >> Example (dates changed to integers to simplify the problem): <<

    Don't do this. We were very proud of putting temporal data types into the language. Unfortunately, what you posted for DDL is not a table and can never be a table, by definition. A key not as an option, a table has to have a key. Keys cannot have NULLs, by definition. The first thing we need to do is try to fix your DDL and turn it into a real relational database instead of a deck of punch cards written in SQL.

    Since a table models a set of entities, the name should be a collective noun or plural. Unless of course you really do have only one element in the set. The reason people use singular names as the due assembly early standards for models of unit record processing like IDEF. The reason they used singular names is that they did record at a time processing so that was the unit of work. SQL uses sets

    Temporal data also comes in intervals. In SQL Server, we have to model it with start and stop timestamps. The ISO 11179 naming conventions use"<attribute name>_<attribute property>", instead of the old file system single name conventions.

    An identifier ("_id") is on a nominal scale and cannot be integers again, by definition. Have you ever read any introduction to data modeling so that you will know what the scales and units of measurement are for a valid data model?

    Your table should have looked like this in all probability:

    CREATE TABLE UnitMemberships

    (membership_id CHAR(5) NOT NULL,

    unit_id CHAR(5) NOT NULL,

    PRIMARY KEY (membership_id, unit_id, membership_start_date), --- required in RDBMS!!

    membership_start_date DATE NOT NULL,

    membership_end_date DATE NOT NULL,

    CHECK (membership_start_date <= membership_end_date)

    );

    Please pay attention to how the primary key is declared. After 30 years of doing this. I have found that failure to do valid DDL is the cause of most failed RDBM projects. And it's also the reason I can charge $$$$$ as a consultant when I come in to fix them. :-).

    There is no reason to use an old row at a time INSERT INTO syntax that was required decades ago, in the first releases of Sybase SQL Server. We also have date datatypes now in SQL Server (this was always part of the ANSI ISO standards but Microsoft was slow).

    INSERT INTO UnitMemberships

    VALUES

    ('00001', '00001', '2021-01-01', '2021-01-05'),

    ('00001', '00002', '2021-01-08', '2021-01-15'),

    ('00001', '00003', '2021-01-10', '2021-01-18');

    The reason for writing it all in one values clause is that this lets the optimizer take the table constructor as a whole, instead of having to do it row by row by row as if you were still using punch cards.

    You'll probably want to download a copy of "Temporal Queries in SQL" by Rick Snodgrass. It's a free PDF file from the University of Arizona. It is a classic and it should be on your bookshelf. One of the standard programming idioms in SQL is a calendar table. It's just what you think it is but may include other information.

    To get the set of all the dates that are included in your membership table, you can use:

    SELECT COUNT (DISTINCT C.cal_date) AS unique_membership_date_count

    FROM Calendar AS C, UnitMemberships AS U

    WHERE C.cal_date BETWEEN U.membership_start_date AND U.membership_end_date;

    Generally speaking, you want to avoid using function calls and computations in SQL. The language was designed for set-oriented operations and data manipulation. This is not FORTRAN.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • nicod wrote:

    Thanks all for your help with this one. Reading through all your suggestions it looks like using a tally is what'll work for us. The dataset is only about 2,800 rows so performance isn't a huge problem.

    Also came across this solution which is very similar and seems to work on the test data:

    https://bertwagner.com/posts/gaps-and-islands/

    Bert's solution doesn't appear to handle multiple distinct sets.  Could you post the actual code you used?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • nicod wrote:

    ...Example (dates changed to integers to simplify the problem) ...

    Just wondering how the current solution performs with date/datetime data types on a larger set of logging data.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • So, we have 2 problems to solve

    • Overlapping ranges
    • Islands

    Bert's solution can be tweaked to 1st consolidate the overlapping sets, then calculate the islands, and finally do our range calculations.

    DECLARE @UnitMembership table (
    IdentityID int NOT NULL
    , UnitID int NOT NULL
    , StartDate int NOT NULL
    , EndDate int NOT NULL
    );

    INSERT INTO @UnitMembership ( IdentityID, UnitID, StartDate, EndDate )
    VALUES-- Jeffs data set -------------------------------------------
    (6, 1, 1, 10)
    , (6, 2, 2, 9)
    , (6, 3, 3, 8)
    , (6, 4, 4, 7)
    , (6, 5, 5, 6)
    , (6, 6, 5, 5)
    -- Overlapping data set --------------------------------------
    , (9, 1, 1, 3)
    , (9, 2, 5, 7)
    , (9, 3, 6, 10)
    , (9, 4, 7, 8)
    , (9, 5, 9, 11)
    , (9, 5, 16, 18)
    , (9, 6, 17, 19);

    WITH cteOverlaps AS (
    -- We need to know the previous end date, so that we can determine whether the current record is the start of a new island
    SELECT um.IdentityID
    , um.StartDate
    , um.EndDate
    , PrevEndDate = LAG(um.EndDate, 1, 0) OVER (PARTITION BY um.IdentityID ORDER BY um.StartDate, um.EndDate)
    FROM @UnitMembership AS um
    )
    , cteIslands AS (
    -- Now we calculate the start of each island
    SELECT o.IdentityID, o.StartDate, o.EndDate
    , isStartNewIsland = CASE
    WHEN o.StartDate > MAX(o.PrevEndDate) OVER (PARTITION BY o.IdentityID ORDER BY o.StartDate, o.EndDate
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    THEN 1
    ELSE 0
    END
    FROM cteOverlaps AS o
    )
    , cteGroups AS (
    -- A windowed running total will define the group
    SELECT i.IdentityID, i.StartDate, i.EndDate, i.isStartNewIsland
    , GroupID = SUM(i.isStartNewIsland) OVER (PARTITION BY i.IdentityID ORDER BY i.StartDate, i.EndDate
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    FROM cteIslands AS i
    )
    -- Finally we do our calculations per IdentityID and GroupID
    SELECT g.IdentityID
    , StartDate = MIN(g.StartDate)
    , EndDate = MAX(g.EndDate)
    , Duration = MAX(g.EndDate) - MIN(g.StartDate) +1
    FROM cteGroups AS g
    GROUP BY g.IdentityID, g.GroupID
    ORDER BY g.IdentityID, g.GroupID
  • The solution seems much simpler, once you look at the data sorted by startdate and enddate within each IdentityID.    A gap only occurs when the current start date is at least two days greater than the preceding maximum end date.     Could someone (Jeff?) try to break the following code?

    DECLARE @UnitMembership TABLE
    (
    IdentityID int NOT NULL,
    UnitID int not NULL,
    StartDate INT not NULL,
    EndDate INT not NULL,
    primary key(identityid,startdate,enddate,unitid)
    );

    INSERT @UnitMembership
    VALUES (1, 1, 1, 5),
    (1, 2, 8, 15),
    (1, 3, 14, 18),
    (6, 4,1, 10),
    (6, 5,2, 9),
    (6, 6,3, 8),
    (6, 7,4, 7),
    (6, 8,5, 6),
    (6, 9,5, 5),
    (9,10,1,30),
    (9,11,2,10),
    (9,12,15,25);




    set statistics time, io on;

    with cte as (select IdentityID, StartDate, EndDate
    , MX = max(EndDate) over(partition by IdentityID order by StartDate, EndDate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
    from @UnitMembership e
    )


    select IdentityID, TotalDays = 1 + Max(EndDate) - Min(StartDate)
    , MissedDays = sum(case when StartDate > MX + 1 then StartDate - MX - 1 else 0 end )
    from cte
    group by IdentityID
    order by IdentityID

    set statistics time, io off;

    select * from @UnitMembership


    • This reply was modified 3 years, 9 months ago by  The Dixie Flatline. Reason: Had the wrong version of code. TotalDays wasn't calculated correctly

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply