August 27, 2019 at 5:04 am
I have an interesting little T-SQL programming conundrum that I'm trying to figure out how to solve without the use of a cursor. I have a solution using a cursor that works, but (as is usually the case with cursors), it's not very performant, especially as the rowcounts grow (and this proc will eventually be churning through a lot of data). I've read a bunch of articles on dealing with gaps and islands in date ranges and have not found a solution that works for my situation.
The problem:
We have data that shows prescriptions for populations. The data is stored at a level of granularity of person (id), NDC (National Drug Code for the medication) and start_date of the prescription. There is also a days_supply column which defines the length of the Rx in days, and a computed column called end_date which is the start_date plus days_supply.
The challenge is that I need to take this data and group the NDC's into "Drug Classes", and then figure out for each combination of PersonID and DrugClass a range of dates of coverage. There can be gaps and there can be overlaps in the source data. The fun part is that if a person has multiple rows for drugs that fall within the same drug class, I need to "push" the dates of subsequent rows for each combination, if there is no more than a 1 day gap between the end_date of one row and the start_date of the next row. So if Person 1 has 3 rows for DrugClass "Anti-Inflammatories", for example:
Person 1, Anti-Inflammatories, 1/1/2018, 30, 1/31/2018
Person1, Anti-Inflammatories, 2/1/2018, 30, 3/2/2018
Person1, Anti-Inflammatories, 2/15/2018, 30, 3/17/2018
Person1, Anti-Inflammatories, 4/1/2018, 30, 5/1/2018
Person1, Anti-Inflammatories, 7/1/2018, 30, 7/31/2018
Because row 3's start_date overlaps with the start_date and end_date of row 2, which is for the same person and drug class, the requirements dictate that the third row above gets "pushed" out, to start after the end date of the row prior to it. Then the end_date of that row needs to be re-calculated based on the DaysSupply. This has a cascade effect on the data above, because the updated row3 now has an end_date that is after the start_date of row4. Which means row4 needs to be pushed out as well. Rows 1 and 2 remain unchanged, because no overlaps, and row 5 stays unchanged because it still has a gap after the immediate prior row is pushed. So, the final output should look like this:
Person 1, Anti-Inflammatories, 1/1/2018, 30, 1/31/2018
Person1, Anti-Inflammatories, 2/1/2018, 30, 3/2/2018
Person1, Anti-Inflammatories, 3/3/2018, 30, 4/2/2018
Person1, Anti-Inflammatories, 4/3/2018, 30, 5/3/2018
Person1, Anti-Inflammatories, 7/1/2018, 30, 7/31/2018
I was able to get most of it working using the LAG window function in a set-based operation, but ran into problems where there were overlaps. The problem is that I need to work through the data (sorting by start_date) in order to properly push the dates on the rows within each grouping. It works until I end up with a new overlap that is created as a consequence of a prior row being updated. I could not figure out a way to access the new LAG(end_date) of the adjacent row within a set-based query, because the end_date hasn't been updated yet, and we can't nest window functions in SQL Server 2012. And I need to know that value in order to properly set the new start_date for the current row.
So I came up with a cursor based solution that works by adding two new columns to my temp table that acts as the source data for the proc to consume, new_start_date and new_end_date. They are initially seeded with the same value as start_date and end_date for that row, and then they are updated by a cursor that iterates datewise through the data for each PersonID/DrugClass grouping. Then when the cursor fetches the next row, it has the "pushed" dates for the prior row (LAG(new_end_date)) available via a CTE to update the current row if needed. Once the temp table is fully updated by the cursor, I exit the cursor and do a single insert from the temp table into the final target table, which is truncated at the start of each run.
This procedure works, but it is slow, especially when I start scaling up the number of rows to process, which could end up being in the millions. I used the usual cursor tricks to try to speed it up as much as possible, (LOCAL and FAST_FORWARD options), but it's still unacceptably slow. Everything I've read about dealing with gaps and overlaps and such with dates does not seem to address this particular problem (basically needing to see a future state of the data, the LAG of a LAG, so to speak). The code is below. If anyone has any ideas about how to solve this conundrum in a set-based fashion, I would love to hear your ideas.
Sorry for the utter lack of brevity, and thanks for your ideas. Please let me know if I can provide any more detail to illuminate the issue further.
-- target table ddl:
CREATE TABLE [dbo].[PQAMeasureDateRange](
[PersonID] [int] NOT NULL,
[EventCodeID] [int] NOT NULL,
[MedGroup] [varchar](255) NOT NULL,
[DrugClass] [varchar](50) NOT NULL,
[StartDate] [date] NOT NULL,
[EndDate] [date] NOT NULL,
[InsertedTimestamp] [datetime] NOT NULL
CONSTRAINT [pk_PQAMeasureDateRange] PRIMARY KEY CLUSTERED
(
[PersonID] ASC,
[StartDate] ASC,
[DrugClass] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = ON, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
-- Stored Proc:
CREATE PROCEDURE dbo.uspPopulatePQAMeasureDateRange
(
@BeginDate DATE = NULL
,@EndDate DATE = NULL
)
AS
SET NOCOUNT ON
BEGIN
BEGIN TRY
DECLARE @ErrMsg NVARCHAR(255)
,@RowsInserted INT
,@GetDate DATETIME = GETDATE()
TRUNCATE TABLE dbo.PQAMeasureDateRange;
IF @BeginDate IS NULL
SET @BeginDate = DATEADD(YY,-1,CAST(GETDATE() AS DATE))
IF @EndDate IS NULL
SET @EndDate = CAST(GETDATE() AS DATE)
-- Get data we need to work with into a temp table
IF OBJECT_ID(N'tempdb..#tempRxDates') IS NOT NULL
BEGIN
DROP TABLE #tempRxDates
END
CREATE TABLE #tempRxDates
(
person_id INT NOT NULL
,EventCodeID INT NOT NULL
,MedGroup VARCHAR(255) NOT NULL
,DrugClass VARCHAR(50) NOT NULL
,[start_date] DATE NOT NULL
,end_date DATE NOT NULL
,new_start_date DATE NOT NULL
,new_end_date DATE NOT NULL
,days_supply VARCHAR(50) NULL
)
INSERT
INTO #tempRxDates
SELECT DISTINCT
pm.person_id
,e.EventCodeID
,p.MedGroup
,p.DrugClass
,CAST(pcr.fill_date AS DATE) as [start_date]
,pcr.end_date
,CAST(pcr.fill_date AS DATE) as new_start_date
,pcr.end_date as new_end_date
,pcr.days_supply
FROM dbo.prescription pcr
INNER JOIN dbo.personMember pm
ON pcr.member_id = pm.member_id
INNER JOIN wrk.PQA_NDC_DrugClass p
ON pcr.ndc = p.NDC
INNER JOIN rpt.EventCode e
ON p.MedGroup = e.CodeValue
WHERE 1=1
AND pm.active_ind = 'Y'
AND pcr.delete_ind = 'N'
AND pcr.fill_date BETWEEN @BeginDate AND @EndDate
ORDER BY pm.person_id
,p.DrugClass
,[start_date]
-- add a clustered index for peformance
CREATE CLUSTERED INDEX ix_tempRxDates ON #tempRxDates(person_id,DrugClass,[start_date])
DECLARE @PersonID INT
,@DrugClass VARCHAR(50)
,@StartDate Date
DECLARE date_update_csr CURSOR
LOCAL FAST_FORWARD FOR
SELECT person_id
,DrugClass
,[start_date]
FROM #tempRxDates
ORDER BY person_id ASC
,DrugClass ASC
,[start_date]
OPEN date_update_csr
FETCH NEXT FROM date_update_csr
INTO @PersonID
,@DrugClass
,@StartDate
WHILE @@FETCH_STATUS = 0
BEGIN
-- update new_start_date and new_end_date for each row in the temp table,
-- working forward, start_date-wise.
WITH update_dates_cte AS (
SELECT TOP 100 PERCENT
person_id
,DrugClass
,[start_date]
,CASE
WHEN [start_date] BETWEEN LAG([new_start_date]) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
AND LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
THEN DATEADD(DD,1,LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] ))
ELSE new_start_date
END AS new_start_date
,CASE
WHEN [start_date] BETWEEN LAG([new_start_date]) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
AND LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] )
THEN DATEADD(dd,ISNULL(TRY_CAST(days_supply AS INT),0),LAG(new_end_date) over (PARTITION BY person_id,DrugClass ORDER BY [new_start_date] ))
ELSE new_end_date
END AS new_end_date
FROM #tempRxDates t
ORDER BY person_id
,DrugClass
,[start_date]
)
UPDATE t
SET new_start_date = cte.new_start_date
,new_end_date = cte.new_end_date
FROM #tempRxDates t
INNER JOIN update_dates_cte cte
ON t.person_id = cte.person_id
AND t.DrugClass = cte.DrugClass
AND t.[start_date] = cte.[start_date]
WHERE t.person_id = @PersonID
AND t.DrugClass = @DrugClass
AND t.[start_date] = @StartDate
FETCH NEXT FROM date_update_csr
INTO @PersonID
,@DrugClass
,@StartDate
END
CLOSE date_update_csr
DEALLOCATE date_update_csr
-- take just the distinct results for the columns we need:
INSERT
INTO dbo.PQAMeasureDateRange
(
PersonID
,EventCodeID
,MedGroup
,DrugClass
,StartDate
,EndDate
,InsertedTimestamp
)
SELECT DISTINCT
person_id AS PersonID
,EventCodeID
,MedGroup
,DrugClass
,new_start_date AS StartDate
,new_end_date AS EndDate
,@GetDate
FROM #tempRxDates
ORDER BY person_id
,DrugClass
,new_start_date
SET @RowsInserted = @@rowcount
DROP TABLE #tempRxDates
END TRY
BEGIN CATCH
SET @errmsg = ERROR_MESSAGE()
RAISERROR (@errmsg, 16, 1);
IF OBJECT_ID(N'tempdb..#tempRxDates') IS NOT NULL
BEGIN
DROP TABLE #tempRxDates
END
END CATCH
END
SET NOCOUNT OFF
August 27, 2019 at 2:46 pm
This is a packing intervals problem, not a gaps and islands problem. The two are different, but related.
Please provide sample data and expected results. We can't run your query, because we don't have access to your tables.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
August 27, 2019 at 6:47 pm
Thanks for your response, Drew. After some brainstorming with a coworker, we came up with a potential solution that looks like it is going to work and be much, much faster than the cursor based approach. Once I'm done testing it I can post the solution here if people are interested.
Joe
August 27, 2019 at 6:54 pm
I'd definitely be interested in how someone might do this to avoid a Cursor and other forms of RBAR.
Even more interesting would be why the requirement to skew data in such a fashion is a requirement to begin with. So far as I can see, the data is the data and should not be artificially skewed to prevent overlaps.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply