Deriving covered date ranges for drug Prescription data (without using a cursor)

  • I have an interesting little T-SQL programming conundrum that I'm trying to figure out how to solve without the use of a cursor.  I have a solution using a cursor that works, but (as is usually the case with cursors), it's not very performant, especially as the rowcounts grow (and this proc will eventually be churning through a lot of data).  I've read a bunch of articles on dealing with gaps and islands in date ranges and have not found a solution that works for my situation.

    The problem:

    We have data that shows prescriptions for populations.  The data is stored at a level of granularity of person (id), NDC (National Drug Code for the medication) and start_date of the prescription.  There is also a days_supply column which defines the length of the Rx in days, and a computed column called end_date which is the start_date plus days_supply.

    The challenge is that I need to take this data and group the NDC's into "Drug Classes", and then figure out for each combination of PersonID and DrugClass a range of dates of coverage.  There can be gaps and there can be overlaps in the source data. The fun part is that if a person has multiple rows for drugs that fall within the same drug class, I need to "push" the dates of subsequent rows for each combination, if there is no more than a 1 day gap between the end_date of one row and the start_date of the next row.  So if Person 1 has 3 rows for DrugClass "Anti-Inflammatories", for example:

    Person 1, Anti-Inflammatories, 1/1/2018,  30, 1/31/2018

    Person1, Anti-Inflammatories, 2/1/2018,  30, 3/2/2018

    Person1, Anti-Inflammatories, 2/15/2018,  30, 3/17/2018

    Person1, Anti-Inflammatories, 4/1/2018,  30, 5/1/2018

    Person1, Anti-Inflammatories, 7/1/2018,  30, 7/31/2018

    Because row 3's start_date overlaps with the start_date and end_date of row 2, which is for the same person and drug class, the requirements dictate that the third row above gets "pushed" out, to start after the end date of the row prior to it. Then the end_date of that row needs to be re-calculated based on the DaysSupply.  This has a cascade effect on the data above, because the updated row3 now has an end_date that is after the start_date of row4.  Which means row4 needs to be pushed out as well. Rows 1 and 2 remain unchanged, because no overlaps, and row 5 stays unchanged because it still has a gap after the immediate prior row is pushed. So, the final output should look like this:

    Person 1, Anti-Inflammatories, 1/1/2018,  30, 1/31/2018

    Person1, Anti-Inflammatories, 2/1/2018,  30, 3/2/2018

    Person1, Anti-Inflammatories, 3/3/2018,  30, 4/2/2018

    Person1, Anti-Inflammatories, 4/3/2018,  30, 5/3/2018

    Person1, Anti-Inflammatories, 7/1/2018,  30, 7/31/2018

    I was able to get most of it working using the LAG window function in a set-based operation, but ran into problems where there were overlaps.  The problem is that I need to work through the data (sorting by start_date) in order to properly push the dates on the rows within each grouping.  It works until I end up with a new overlap that is created as a consequence of a prior row being updated.  I could not figure out a way to access the new LAG(end_date) of the adjacent row within a set-based query, because the end_date hasn't been updated yet, and we can't nest window functions in SQL Server 2012. And I need to know that value in order to properly set the new start_date for the current row.

    So I came up with a cursor based solution that works by adding two new columns to my temp table that acts as the source data for the proc to consume, new_start_date and new_end_date.  They are initially seeded with the same value as start_date and end_date for that row, and then they are updated by a cursor that iterates datewise through the data for each PersonID/DrugClass grouping. Then when the cursor fetches the next row, it has the "pushed" dates for the prior row (LAG(new_end_date)) available via a CTE to update the current row if needed.  Once the temp table is fully updated by the cursor, I exit the cursor and do a single insert from the temp table into the final target table, which is truncated at the start of each run.

    This procedure works, but it is slow, especially when I start scaling up the number of rows to process, which could end up being in the millions.  I used the usual cursor tricks to try to speed it up as much as possible, (LOCAL and FAST_FORWARD options), but it's still unacceptably slow. Everything I've read about dealing with gaps and overlaps and such with dates does not seem to address this particular problem (basically needing to see a future state of the data, the LAG of a LAG, so to speak). The code is below.  If anyone has any ideas about how to solve this conundrum in a set-based fashion, I would love to hear your ideas.

    Sorry for the utter lack of brevity, and thanks for your ideas. Please let me know if I can provide any more detail to illuminate the issue further.

    -- target table ddl:
    CREATE TABLE [dbo].[PQAMeasureDateRange](
    [PersonID] [int] NOT NULL,
    [EventCodeID] [int] NOT NULL,
    [MedGroup] [varchar](255) NOT NULL,
    [DrugClass] [varchar](50) NOT NULL,
    [StartDate] [date] NOT NULL,
    [EndDate] [date] NOT NULL,
    [InsertedTimestamp] [datetime] NOT NULL
    CONSTRAINT [pk_PQAMeasureDateRange] PRIMARY KEY CLUSTERED
    (
    [PersonID] ASC,
    [StartDate] ASC,
    [DrugClass] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = ON, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    GO

    -- Stored Proc:

    CREATE PROCEDURE dbo.uspPopulatePQAMeasureDateRange
    (
    @BeginDate DATE = NULL
    ,@EndDate DATE = NULL
    )

    AS

    SET NOCOUNT ON

    BEGIN

    BEGIN TRY

    DECLARE @ErrMsg NVARCHAR(255)
    ,@RowsInserted INT
    ,@GetDate DATETIME = GETDATE()

    TRUNCATE TABLE dbo.PQAMeasureDateRange;

    IF @BeginDate IS NULL
    SET @BeginDate = DATEADD(YY,-1,CAST(GETDATE() AS DATE))

    IF @EndDate IS NULL
    SET @EndDate = CAST(GETDATE() AS DATE)

    -- Get data we need to work with into a temp table

    IF OBJECT_ID(N'tempdb..#tempRxDates') IS NOT NULL
    BEGIN
    DROP TABLE #tempRxDates
    END

    CREATE TABLE #tempRxDates
    (
    person_id INT NOT NULL
    ,EventCodeID INT NOT NULL
    ,MedGroup VARCHAR(255) NOT NULL
    ,DrugClass VARCHAR(50) NOT NULL
    ,[start_date] DATE NOT NULL
    ,end_date DATE NOT NULL
    ,new_start_date DATE NOT NULL
    ,new_end_date DATE NOT NULL
    ,days_supply VARCHAR(50) NULL

    )

    INSERT
    INTO #tempRxDates
    SELECT DISTINCT
    pm.person_id
    ,e.EventCodeID
    ,p.MedGroup
    ,p.DrugClass
    ,CAST(pcr.fill_date AS DATE) as [start_date]
    ,pcr.end_date
    ,CAST(pcr.fill_date AS DATE) as new_start_date
    ,pcr.end_date as new_end_date
    ,pcr.days_supply
    FROM dbo.prescription pcr
    INNER JOIN dbo.personMember pm
    ON pcr.member_id = pm.member_id
    INNER JOIN wrk.PQA_NDC_DrugClass p
    ON pcr.ndc = p.NDC
    INNER JOIN rpt.EventCode e
    ON p.MedGroup = e.CodeValue
    WHERE 1=1
    AND pm.active_ind = 'Y'
    AND pcr.delete_ind = 'N'
    AND pcr.fill_date BETWEEN @BeginDate AND @EndDate
    ORDER BY pm.person_id
    ,p.DrugClass
    ,[start_date]

    -- add a clustered index for peformance
    CREATE CLUSTERED INDEX ix_tempRxDates ON #tempRxDates(person_id,DrugClass,[start_date])

    DECLARE @PersonID INT
    ,@DrugClass VARCHAR(50)
    ,@StartDate Date

    DECLARE date_update_csr CURSOR
    LOCAL FAST_FORWARD FOR
    SELECT person_id
    ,DrugClass
    ,[start_date]
    FROM #tempRxDates
    ORDER BY person_id ASC
    ,DrugClass ASC
    ,[start_date]

    OPEN date_update_csr

    FETCH NEXT FROM date_update_csr
    INTO @PersonID
    ,@DrugClass
    ,@StartDate

    WHILE @@FETCH_STATUS = 0
    BEGIN
    -- update new_start_date and new_end_date for each row in the temp table,
    -- working forward, start_date-wise.

    WITH update_dates_cte AS (
    SELECT TOP 100 PERCENT
    person_id
    ,DrugClass
    ,[start_date]
    ,CASE
    WHEN [start_date] BETWEEN LAG([new_start_date]) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
    AND LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
    THEN DATEADD(DD,1,LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] ))
    ELSE new_start_date
    END AS new_start_date
    ,CASE
    WHEN [start_date] BETWEEN LAG([new_start_date]) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
    AND LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
    THEN DATEADD(dd,ISNULL(TRY_CAST(days_supply AS INT),0),LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] ))
    ELSE new_end_date
    END AS new_end_date
    FROM #tempRxDates t
    ORDER BY person_id
    ,DrugClass
    ,[start_date]

    )


    UPDATE t
    SET new_start_date = cte.new_start_date
    ,new_end_date = cte.new_end_date
    FROM #tempRxDates t
    INNER JOIN update_dates_cte cte
    ON t.person_id = cte.person_id
    AND t.DrugClass = cte.DrugClass
    AND t.[start_date] = cte.[start_date]
    WHERE t.person_id = @PersonID
    AND t.DrugClass = @DrugClass
    AND t.[start_date] = @StartDate

    FETCH NEXT FROM date_update_csr
    INTO @PersonID
    ,@DrugClass
    ,@StartDate
    END

    CLOSE date_update_csr
    DEALLOCATE date_update_csr

    -- take just the distinct results for the columns we need:
    INSERT
    INTO dbo.PQAMeasureDateRange
    (
    PersonID
    ,EventCodeID
    ,MedGroup
    ,DrugClass
    ,StartDate
    ,EndDate
    ,InsertedTimestamp
    )
    SELECT DISTINCT
    person_id AS PersonID
    ,EventCodeID
    ,MedGroup
    ,DrugClass
    ,new_start_date AS StartDate
    ,new_end_date AS EndDate
    ,@GetDate
    FROM #tempRxDates
    ORDER BY person_id
    ,DrugClass
    ,new_start_date

    SET @RowsInserted = @@rowcount

    DROP TABLE #tempRxDates

    END TRY

    BEGIN CATCH

    SET @errmsg = ERROR_MESSAGE()

    RAISERROR (@errmsg, 16, 1);

    IF OBJECT_ID(N'tempdb..#tempRxDates') IS NOT NULL
    BEGIN
    DROP TABLE #tempRxDates
    END

    END CATCH

    END

    SET NOCOUNT OFF
  • This is a packing intervals problem, not a gaps and islands problem. The two are different, but related.

    Please provide sample data and expected results. We can't run your query, because we don't have access to your tables.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Thanks for your response, Drew.  After some brainstorming with a coworker, we came up with a potential solution that looks like it is going to work and be much, much faster than the cursor based approach.  Once I'm done testing it I can post the solution here if people are interested.

    Joe

  • I'd definitely be interested in how someone might do this to avoid a Cursor and other forms of RBAR.

     

    Even more interesting would be why the requirement to skew data in such a fashion is a requirement to begin with.  So far as I can see, the data is the data and should not be artificially skewed to prevent overlaps.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply