December 4, 2015 at 8:28 am
Dear experts
I have a rather difficult record selection problem in a SQL 2012 database which results from sometimes-missing data.
First, the way it is supposed to work…
I have a table with registration details of patients in a population. There are about half a million patients, and each time their details change, a new record is generated. Each record has a patient ID and a From Date and To Date showing the date range in which the details stored on the record were applicable.
In principle therefore it should be straightforward, given a date, to pull out the relevant record for every patient in the population, using something like SELECT field1, …WHERE (DateAsAt >= IndexStartDate ) AND (DateAsAt<=IndexEndDate).
Very often, the From date in the “first” record for a given patient is null, representing some date in the dim and distant past. OK then, lets modify our naïve selection criteria to WHERE ((IndexStartDate IS NULL) OR( IndexStartDate<=DateAsAt)) AND (DateAsAt<=IndexEndDate).
This approach is however spannered by the fact that cases exist where the FromDate may be null multiple time for a given patient. (The To Date is always present, thankfully!) For instance:
PatientIDFromDateToDate
PatientANULL2015-01-19value11,value12, …
PatientANULL2015-06-30Value21,Value22, …
PatientANULL2500-12-31value31,value32, …
I am assuming that if the data were complete, it would read like this:
PatientIDFromDateToDate
PatientANULL2015-01-19value11,value12, …
PatientA2015-01-202015-06-30Value21,Value22, …
PatientA2015-07-012500-12-31value31,value32, …
Suppose I want to pull out the record that was current as at 2015-01-01. All three records fulfil my second WHERE condition, but I am most interested in the one whose ToDate is nearest to the AsAtDate, while still being greater than or equal to it.
So, what is a good way of framing the criteria to get the record I want, while still (if possible) handling the non-deficient cases correctly? I suspect I am into windowing functions and/or subqueries, but need some inspiration.
Yours hopefully
Mark Dalley
December 4, 2015 at 8:47 am
Can you post some DDL and sample data? It would be helpful to understand, too, where the dates in your second result set are coming from.
-- Itzik Ben-Gan 2001
December 4, 2015 at 8:49 am
Something like this?
SELECT * FROM Patients
WHERE @AsAtDate BETWEEN >= ToDate
AND @AsAtDate < LEAD(ToDate,1) OVER (PARTITION BY PatientID ORDER BY ToDate)
As Alan said, please provide full table DDL and sample data for a fully tested solution.
John
December 4, 2015 at 9:09 am
The simplest solution by far would seem to be fix the data. 🙂 Update all NULL start-date values to the minimum for the data type (1/1/1753 for full datetime for example), and make a DEFAULT for that and also change the column to be NOT NULL. Wouldn't that then allow all properly-crafted code work like a charm without resorting to machinations to deal with that NULL-value scenario?
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 7, 2015 at 8:40 am
Hi guys,
Thanks for the replies so far.
I obviously need to brush up my approach to asking questions! Here therefore is some DDL for a test table (thanks to Jeff Moden for your really good article about this).
CREATE TABLE [dbo].[MDTest2](
[PatientID] [varchar](11) NOT NULL,
[IndexStartDate] [datetime] NULL,
[IndexEndDate] [datetime] NULL,
[RecSource] [varchar](8) NOT NULL
To fill it conveniently, use this DML... Since is is "real" data I have changed the IDs to ensure that they cannot be associated with real people:
INSERT INTO dbo.MDTest2 (PatientID, IndexStartDate, IndexEndDate, RecSource)
SELECT 'P5711063012',NULL,'Mar 24 2015 12:00AM','Historic' UNION ALL
SELECT 'P5711063012',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P1063766833',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL
SELECT 'P1063766833',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P0519084455',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL
SELECT 'P0519084455',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P8362712077',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL
SELECT 'P8362712077',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P6421082878',NULL,'Feb 22 2015 12:00AM','Historic' UNION ALL
SELECT 'P6421082878',NULL,'May 22 2015 12:00AM','Deducted' UNION ALL
SELECT 'P9814006870',NULL,'Jan 14 2015 12:00AM','Historic' UNION ALL
SELECT 'P9814006870',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL
SELECT 'P8293058689',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL
SELECT 'P8293058689',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P8026056091',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL
SELECT 'P8026056091',NULL,'Feb 18 2015 12:00AM','Historic' UNION ALL
SELECT 'P8026056091',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P9037628490',NULL,'Oct 7 2015 12:00AM','Historic' UNION ALL
SELECT 'P9037628490',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P1276762217',NULL,'Aug 25 2015 12:00AM','Historic' UNION ALL
SELECT 'P1276762217',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P5608863213',NULL,'Jul 13 2015 12:00AM','Historic' UNION ALL
SELECT 'P5608863213',NULL,'Jul 22 2015 12:00AM','Historic' UNION ALL
SELECT 'P5608863213',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P4004258144',NULL,'Jun 29 2015 12:00AM','Historic' UNION ALL
SELECT 'P4004258144',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P2695719501',NULL,'May 18 2015 12:00AM','Historic' UNION ALL
SELECT 'P2695719501',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL
SELECT 'P9460813602',NULL,'Mar 4 2015 12:00AM','Historic' UNION ALL
SELECT 'P9460813602',NULL,'Dec 31 2500 12:00AM','Current' ;
Hope this helps
Mark Dalley
December 7, 2015 at 8:53 am
Hi experts
To answer Kevin's question, the table I am getting the data from is a read-only view maintained by another team and originating from a non-relational database of doubtful integrity! So, no, much as I would like to, I cannot clean up the source, despite the obvious added value for future generations.;-)
I can of course have a go at filling in the null start dates in my MDtest2 table. That throws me back to windowing functions, I think, though I'm a newbie with these. I will play with the ideas suggested in John Mitchell's answer.
Mark Dalley
December 7, 2015 at 9:34 am
Dear experts
Here are the records I would expect a working query to return, highlighted in bold. I am taking an As At date of the 1st of June 2015 (2015-06-01):
PatientID IndexStartDate IndexEndDate RecSource
P5711063012NULL2015-03-24 00:00:00.000Historic
P5711063012NULL2500-12-31 00:00:00.000Current
P1063766833NULL2015-09-21 00:00:00.000Historic
P1063766833NULL2500-12-31 00:00:00.000Current
P0519084455NULL2015-09-21 00:00:00.000Historic
P0519084455NULL2500-12-31 00:00:00.000Current
P8362712077NULL2015-09-29 00:00:00.000Historic
P8362712077NULL2500-12-31 00:00:00.000Current
P6421082878NULL2015-02-22 00:00:00.000Historic
P6421082878NULL2015-05-22 00:00:00.000Deducted
P9814006870NULL2015-01-14 00:00:00.000Historic
P9814006870NULL2015-01-19 00:00:00.000Historic
P8293058689NULL2015-09-29 00:00:00.000Historic
P8293058689NULL2500-12-31 00:00:00.000Current
P8026056091NULL2015-01-19 00:00:00.000Historic
P8026056091NULL2015-02-18 00:00:00.000Historic
P8026056091NULL2500-12-31 00:00:00.000Current
P9037628490NULL2015-10-07 00:00:00.000Historic
P9037628490NULL2500-12-31 00:00:00.000Current
P1276762217NULL2015-08-25 00:00:00.000Historic
P1276762217NULL2500-12-31 00:00:00.000Current
P5608863213NULL2015-07-13 00:00:00.000Historic
P5608863213NULL2015-07-22 00:00:00.000Historic
P5608863213NULL2500-12-31 00:00:00.000Current
P4004258144NULL2015-06-29 00:00:00.000Historic
P4004258144NULL2500-12-31 00:00:00.000Current
P2695719501NULL2015-05-18 00:00:00.000Historic
P2695719501NULL2500-12-31 00:00:00.000Current
P9460813602NULL2015-03-04 00:00:00.000Historic
P9460813602NULL2500-12-31 00:00:00.000Current
In each case, the record I am after is the the record with the smallest index end date which comes after the as at date.
MarkD
December 7, 2015 at 10:21 am
TheSQLGuru (12/4/2015)
The simplest solution by far would seem to be fix the data. 🙂 Update all NULL start-date values to the minimum for the data type (1/1/1753 for full datetime for example), and make a DEFAULT for that and also change the column to be NOT NULL. Wouldn't that then allow all properly-crafted code work like a charm without resorting to machinations to deal with that NULL-value scenario?
But you're corrupting the data rather than fixing it. The specific first date is unknown, but it is definitely not 1753! Later NULL values could be updated once/if it's confirmed that the change is assumed to be immediately after the previous end date. Or, it could be that there was a gap between the two, and the specific start date is still unknown, but you decide to just use the next date to insure faster lookups.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
December 8, 2015 at 2:47 am
Does your table have a primary key constraint, please? I can't see a way of writing a robust query without one.
John
December 8, 2015 at 3:27 am
Try this, should do the trick.
Excuse the sloppy formatting 🙂
declare @AsAtDate Datetime = '2015-01-01'
;with
MinPat as
(
select patientID, indexenddate, M.MinDiff
,ROW_NUMBER() over (partition by patientID order by M.MinDiff) as RowNum
from [dbo].[MDTest2]
cross apply (values(datediff(DAY, @AsAtDate, IndexEndDate))) as M(MinDiff)
where IndexEndDate > @AsAtDate
)
select *
from MinPat
where RowNum = 1
order by patientid
December 8, 2015 at 4:26 am
Provided explicitly defined intervals do not overlap try this to infer unknown start dates
declare @AsAtDate Datetime = '2015-06-01';
with cte as (
select *,
strt = isnull([IndexStartDate],lag([IndexEndDate],1,cast('19710101' as date))
over(partition by [PatientID] order by [IndexStartDate])+1)
from [dbo].[MDTest2]
)
select *
from cte
where @AsAtDate between strt and [IndexEndDate]
I also advise against updating nulls directly in table. This may lead to overlapping intervals when some more data are added later.
December 8, 2015 at 5:00 am
In response to John Mitchell,
There are multiple records for each PatientID, each covering a separate date range. (We know that the date ranges can never overlap, though they will often be contiguous.) Since, as we have seen, the IndexFromDate can be null, but the IndexToDate never is. a possible definition would be :
CREATE TABLE [dbo].[MDTest2](
[PatientID] [varchar](11) NOT NULL,
[IndexStartDate] [datetime] NULL,
[IndexEndDate] [datetime] NOT NULL,
[RecSource] [varchar](8) NOT NULL
CONSTRAINT PrimaryKey PRIMARY KEY (PatientID,IndexEndDate)
)
I note that this also touches on the approach suggested by Kevin Boles. Basically, Kevin was suggesting that we use a specific non-null date (1753-01-01) to represent an unknown date in the distant past. This simplifies determination of the relevant interval, but it would have to be understood as being a convention, and adhered to. Looking at the test data, I see that whoever created the data is already doing something similar with the IndexEndDate - 2025-12-31 is used to represent an undefined and as-yet-totally-unknown date in the future - a sort of high-valued null. And in this case not having a null works well - among other things, it makes it possible to have a somewhat sensible primary key!
Now to try and get my head around Dave Morrisons suggested answer...
MarkD
December 8, 2015 at 5:16 am
In that case, this should also work. It has a very similar execution plan to Dave's. That's only for a very small table, though - your mileage may vary when you start using it on production-size data.
WITH MinDates AS (
SELECT
PatientID
,MIN(IndexEndDate) IndexEndDate
FROM dbo.MDTest2
WHERE IndexEndDate > '2015-06-01'
GROUP BY PatientID
)
SELECT
t.PatientID
,t.IndexStartDate
,t.IndexEndDate
,t.RecSource
FROM MinDates m
JOIN MDTest2 t
ON m.PatientID = t.PatientID AND m.IndexEndDate = t.IndexEndDate;
John
December 8, 2015 at 6:06 pm
In similar situations, I found the following to work pretty well:
SELECT PatientID, ...
from dbo.MDTest tbl1
where tbl1.IndexEndDate > @dParamDate
and not exists (select 1
from dbo.MDTest tbl2
where tbl2.PatientID = tbl1.PatientID
and tbl2.IndexEndDate > @dParamDate
and tbl2.IndexEndDate < tbl1.IndexEndDate);
This will return the latest record where the end date is greater than the specified date, but where there are no other records for the patient where the other record's end date is also greater than the specified date, but earlier than the "candidate" record's end date.
This avoids all functions, data type conversions, etc. This would work especially well with a clustered index on (PatientID, IndexEndDate).
December 9, 2015 at 2:15 am
Kim Crosser (12/8/2015)
In similar situations, I found the following to work pretty well:
SELECT PatientID, ...
from dbo.MDTest tbl1
where tbl1.IndexEndDate > @dParamDate
and not exists (select 1
from dbo.MDTest tbl2
where tbl2.PatientID = tbl1.PatientID
and tbl2.IndexEndDate > @dParamDate
and tbl2.IndexEndDate < tbl1.IndexEndDate);
This will return the latest record where the end date is greater than the specified date, but where there are no other records for the patient where the other record's end date is also greater than the specified date, but earlier than the "candidate" record's end date.
This avoids all functions, data type conversions, etc. This would work especially well with a clustered index on (PatientID, IndexEndDate).
Yes, fine on small data sets. But check out the execution plan - it does two scans of the table and therefore as the table gets larger and larger you are likely to see performance deteriorate.
John
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply