Deriving covered date ranges for drug Prescription data (without using a cursor)

Question

Deriving covered date ranges for drug Prescription data (without using a cursor)

SSC Eights!

Points: 870

August 27, 2019 at 5:04 am

I have an interesting little T-SQL programming conundrum that I'm trying to figure out how to solve without the use of a cursor. I have a solution using a cursor that works, but (as is usually the case with cursors), it's not very performant, especially as the rowcounts grow (and this proc will eventually be churning through a lot of data). I've read a bunch of articles on dealing with gaps and islands in date ranges and have not found a solution that works for my situation.

The problem:

We have data that shows prescriptions for populations. The data is stored at a level of granularity of person (id), NDC (National Drug Code for the medication) and start_date of the prescription. There is also a days_supply column which defines the length of the Rx in days, and a computed column called end_date which is the start_date plus days_supply.

The challenge is that I need to take this data and group the NDC's into "Drug Classes", and then figure out for each combination of PersonID and DrugClass a range of dates of coverage. There can be gaps and there can be overlaps in the source data. The fun part is that if a person has multiple rows for drugs that fall within the same drug class, I need to "push" the dates of subsequent rows for each combination, if there is no more than a 1 day gap between the end_date of one row and the start_date of the next row. So if Person 1 has 3 rows for DrugClass "Anti-Inflammatories", for example:

Person 1, Anti-Inflammatories, 1/1/2018, 30, 1/31/2018

Person1, Anti-Inflammatories, 2/1/2018, 30, 3/2/2018

Person1, Anti-Inflammatories, 2/15/2018, 30, 3/17/2018

Person1, Anti-Inflammatories, 4/1/2018, 30, 5/1/2018

Person1, Anti-Inflammatories, 7/1/2018, 30, 7/31/2018

Because row 3's start_date overlaps with the start_date and end_date of row 2, which is for the same person and drug class, the requirements dictate that the third row above gets "pushed" out, to start after the end date of the row prior to it. Then the end_date of that row needs to be re-calculated based on the DaysSupply. This has a cascade effect on the data above, because the updated row3 now has an end_date that is after the start_date of row4. Which means row4 needs to be pushed out as well. Rows 1 and 2 remain unchanged, because no overlaps, and row 5 stays unchanged because it still has a gap after the immediate prior row is pushed. So, the final output should look like this:

Person 1, Anti-Inflammatories, 1/1/2018, 30, 1/31/2018

Person1, Anti-Inflammatories, 2/1/2018, 30, 3/2/2018

Person1, Anti-Inflammatories, 3/3/2018, 30, 4/2/2018

Person1, Anti-Inflammatories, 4/3/2018, 30, 5/3/2018

Person1, Anti-Inflammatories, 7/1/2018, 30, 7/31/2018

I was able to get most of it working using the LAG window function in a set-based operation, but ran into problems where there were overlaps. The problem is that I need to work through the data (sorting by start_date) in order to properly push the dates on the rows within each grouping. It works until I end up with a new overlap that is created as a consequence of a prior row being updated. I could not figure out a way to access the new LAG(end_date) of the adjacent row within a set-based query, because the end_date hasn't been updated yet, and we can't nest window functions in SQL Server 2012. And I need to know that value in order to properly set the new start_date for the current row.

So I came up with a cursor based solution that works by adding two new columns to my temp table that acts as the source data for the proc to consume, new_start_date and new_end_date. They are initially seeded with the same value as start_date and end_date for that row, and then they are updated by a cursor that iterates datewise through the data for each PersonID/DrugClass grouping. Then when the cursor fetches the next row, it has the "pushed" dates for the prior row (LAG(new_end_date)) available via a CTE to update the current row if needed. Once the temp table is fully updated by the cursor, I exit the cursor and do a single insert from the temp table into the final target table, which is truncated at the start of each run.

This procedure works, but it is slow, especially when I start scaling up the number of rows to process, which could end up being in the millions. I used the usual cursor tricks to try to speed it up as much as possible, (LOCAL and FAST_FORWARD options), but it's still unacceptably slow. Everything I've read about dealing with gaps and overlaps and such with dates does not seem to address this particular problem (basically needing to see a future state of the data, the LAG of a LAG, so to speak). The code is below. If anyone has any ideas about how to solve this conundrum in a set-based fashion, I would love to hear your ideas.

Sorry for the utter lack of brevity, and thanks for your ideas. Please let me know if I can provide any more detail to illuminate the issue further.

-- target table ddl:
CREATE TABLE [dbo].[PQAMeasureDateRange](
[PersonID] [int] NOT NULL,
[EventCodeID] [int] NOT NULL,
[MedGroup] [varchar](255) NOT NULL,
[DrugClass] [varchar](50) NOT NULL,
[StartDate] [date] NOT NULL,
[EndDate] [date] NOT NULL,
    [InsertedTimestamp] [datetime] NOT NULL
 CONSTRAINT [pk_PQAMeasureDateRange] PRIMARY KEY CLUSTERED 
(
[PersonID] ASC,
[StartDate] ASC,
[DrugClass] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = ON, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

-- Stored Proc:

CREATE PROCEDURE dbo.uspPopulatePQAMeasureDateRange
    (
         @BeginDate DATE = NULL
        ,@EndDate DATE = NULL
    )

AS

SET NOCOUNT ON

BEGIN    

    BEGIN TRY

        DECLARE  @ErrMsg NVARCHAR(255)
                ,@RowsInserted INT
,@GetDate DATETIME = GETDATE()

        TRUNCATE TABLE dbo.PQAMeasureDateRange;

        IF @BeginDate IS NULL 
            SET @BeginDate = DATEADD(YY,-1,CAST(GETDATE() AS DATE))

        IF @EndDate IS NULL 
            SET @EndDate = CAST(GETDATE() AS DATE)

        -- Get data we need to work with into a temp table

        IF OBJECT_ID(N'tempdb..#tempRxDates') IS NOT NULL
            BEGIN
                DROP TABLE #tempRxDates
            END

        CREATE TABLE #tempRxDates
            (
                 person_id INT NOT NULL
                ,EventCodeID INT NOT NULL
                ,MedGroup VARCHAR(255) NOT NULL
                ,DrugClass VARCHAR(50) NOT NULL
                ,[start_date] DATE NOT NULL
                ,end_date DATE NOT NULL
                ,new_start_date DATE NOT NULL
                ,new_end_date DATE NOT NULL
                ,days_supply VARCHAR(50) NULL

            )

        INSERT
        INTO #tempRxDates        
        SELECT DISTINCT
             pm.person_id
            ,e.EventCodeID
            ,p.MedGroup
            ,p.DrugClass
            ,CAST(pcr.fill_date AS DATE) as [start_date]
            ,pcr.end_date 
            ,CAST(pcr.fill_date AS DATE) as new_start_date
            ,pcr.end_date as new_end_date
            ,pcr.days_supply
        FROM dbo.prescription pcr
        INNER JOIN dbo.personMember pm
            ON pcr.member_id = pm.member_id
        INNER JOIN wrk.PQA_NDC_DrugClass p
            ON pcr.ndc = p.NDC
        INNER JOIN rpt.EventCode e
            ON p.MedGroup = e.CodeValue
        WHERE 1=1
            AND pm.active_ind = 'Y'
            AND pcr.delete_ind = 'N'
            AND pcr.fill_date BETWEEN @BeginDate AND @EndDate
        ORDER BY pm.person_id
                ,p.DrugClass
                ,[start_date] 

        -- add a clustered index for peformance
        CREATE CLUSTERED INDEX ix_tempRxDates ON #tempRxDates(person_id,DrugClass,[start_date])
    
        DECLARE  @PersonID INT
                ,@DrugClass VARCHAR(50)
                ,@StartDate Date

        DECLARE date_update_csr CURSOR 
        LOCAL FAST_FORWARD FOR
        SELECT   person_id    
                ,DrugClass
                ,[start_date]        
        FROM #tempRxDates
        ORDER BY person_id ASC
                ,DrugClass ASC
                ,[start_date] 

        OPEN date_update_csr  

        FETCH NEXT FROM date_update_csr 
        INTO @PersonID
            ,@DrugClass
            ,@StartDate

        WHILE @@FETCH_STATUS = 0  
            BEGIN  
                -- update new_start_date and new_end_date for each row in the temp table,
                -- working forward, start_date-wise.

                WITH update_dates_cte AS (
                    SELECT TOP 100 PERCENT
                             person_id
                            ,DrugClass
                            ,[start_date]
                            ,CASE 
                                WHEN [start_date] BETWEEN LAG([new_start_date]) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
                                    AND LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
                                THEN DATEADD(DD,1,LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] ))
                                ELSE new_start_date
                             END AS new_start_date
                            ,CASE 
                                WHEN [start_date] BETWEEN LAG([new_start_date]) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
                                    AND LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
                                THEN DATEADD(dd,ISNULL(TRY_CAST(days_supply AS INT),0),LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] ))
                                ELSE new_end_date
                             END AS new_end_date
                    FROM #tempRxDates t
                    ORDER BY person_id
                            ,DrugClass
                            ,[start_date] 

                )     


                UPDATE t
                SET new_start_date = cte.new_start_date
                   ,new_end_date = cte.new_end_date
                FROM #tempRxDates t
                INNER JOIN update_dates_cte cte
                    ON t.person_id = cte.person_id
                        AND t.DrugClass = cte.DrugClass
                        AND t.[start_date] = cte.[start_date]
                WHERE t.person_id = @PersonID
                    AND t.DrugClass = @DrugClass
                    AND t.[start_date] = @StartDate

                FETCH NEXT FROM date_update_csr 
                INTO @PersonID
                    ,@DrugClass
                    ,@StartDate
            END

        CLOSE date_update_csr 
        DEALLOCATE date_update_csr

        -- take just the distinct results for the columns we need:
        INSERT
        INTO dbo.PQAMeasureDateRange
            (
                 PersonID
,EventCodeID
,MedGroup
,DrugClass
                ,StartDate
                ,EndDate
,InsertedTimestamp
            )
        SELECT DISTINCT 
             person_id AS PersonID 
,EventCodeID
,MedGroup 
,DrugClass
            ,new_start_date AS StartDate
            ,new_end_date AS EndDate 
,@GetDate         
        FROM #tempRxDates
        ORDER BY person_id
,DrugClass
                ,new_start_date

        SET @RowsInserted = @@rowcount
        
        DROP TABLE #tempRxDates
        
    END TRY

    BEGIN CATCH

        SET @errmsg = ERROR_MESSAGE()       

        RAISERROR (@errmsg, 16, 1); 
        
        IF OBJECT_ID(N'tempdb..#tempRxDates') IS NOT NULL
            BEGIN
                DROP TABLE #tempRxDates
            END 

    END CATCH
    
END

SET NOCOUNT OFF

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

drew.allen SSC Guru Points: 76998 More actions · Answer 1

This is a packing intervals problem, not a gaps and islands problem. The two are different, but related.

Please provide sample data and expected results. We can't run your query, because we don't have access to your tables.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Frank The Tank SSC Eights! Points: 870 More actions · Answer 2

Thanks for your response, Drew. After some brainstorming with a coworker, we came up with a potential solution that looks like it is going to work and be much, much faster than the cursor based approach. Once I'm done testing it I can post the solution here if people are interested.

Joe

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 3

I'd definitely be interested in how someone might do this to avoid a Cursor and other forms of RBAR.

Even more interesting would be why the requirement to skew data in such a fashion is a requirement to begin with. So far as I can see, the data is the data and should not be artificially skewed to prevent overlaps.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)