November 13, 2008 at 7:36 am
dfarran (11/13/2008)
It is also possible to do the query with just left joins,
SELECT a.PersonID, a.Version , a.dedate,
[is improper] = CASE WHEN MAX(c.version) IS NOT NULL THEN 1
ELSE 0 END
FROM personrecord a
LEFT JOIN personrecord b ON b.personid = a.personid
LEFT JOIN personrecord c ON c.personid = a.personid
AND c.version >= b.version
AND c.dedate < b.dedate
GROUP BY a.PersonID, a.Version, a.dedate
with the primary key on the table (personid, version) this performs slightly better than the cross join with both the small(8) and medium(circa .75m) rowcount.
I would avoid using the version = version -1 because this does not cater for missing versions (try doubling the version numbers in test data)
Good point about not relying on version = version-1!
November 13, 2008 at 7:59 am
Peter M. (11/12/2008)
I'd be more interested in seeing the execution times where the data size is non-trivial. My suspicion would lead me to believe the row_number() solution might scale better based on my own experience using the related dense_rank() function.You could even use two dense rank functions that are partitioned identically but have an order by that is based on the version number in one and the date field in the other, then using that to compare when the resulting ranks are not equal. It would require a derived table joined to the original table, but you could get the results you are looking for.
and thanks, Jeff Moden, for pointing out the use of the clustered index on PersonId, Version.
I was interested to see how the different solutions would scale, so I expanded the PersonRecord table to 1 million PersonIDs, and expanded the rows of data to 5 records per PersonId, using Jeff's code. And I added the clustered index.
And I added dfarran's query.
Here are the results:
Original Query 153,733 Milliseconds duration
James Goodwin -
Left join on inner join 136,703 Milliseconds duration
Matt C -
SQL2005 row_number()
over (partition 184,280 Milliseconds duration
Adam Haines -
inner joins and CASE 138,906 Milliseconds duration
dfarrans -
just left joins,
no version-1 245,016 Milliseconds duration
Granted, these are total execution times, so include I/O.
Also, I did not alter the queries that used version-1, which is a weakness if the version sequence is not perfect.
November 13, 2008 at 8:30 am
Am I missing something ?
select a.PersonID,a.Version, a.DEDate , 1 AS [Is ImProper]
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
November 13, 2008 at 8:36 am
AlexT (11/13/2008)
Am I missing something ?select a.PersonID,a.Version, a.DEDate , 1 AS [Is ImProper]
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
Yes, you then need to apply this flag to all the records for that PersonId. (So, if a PersonId has 5 rows and 2 of them are out of order, then all 5 rows will have the ImProper = 1, and those PersonIds with no rows out of order will have ImProper = 0.)
November 13, 2008 at 9:04 am
yes whatever,
select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]
from PersonRecord a
left join (
select a.PersonID
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
) b on a.PersonID = b.PersonID
but initial tast was just get bad data entry dates
November 13, 2008 at 9:19 am
AlexT (11/13/2008)
yes whatever,select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]
from PersonRecord a
left join (
select a.PersonID
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
) b on a.PersonID = b.PersonID
but initial tast was just get bad data entry dates
Thanks, this achieves the desired results, and it's a straight-forward, easy-to-read solution.
When I tested it on my 5 million row table, total execution time was 164580 Milliseconds duration
Edited: Although I think I would put the "select distinct" in the subselect, for cases where one PersonId has multiple rows that are out of order.
November 13, 2008 at 9:33 am
that supposed to be faster, if clustered index applied on (PersonID, Version )
select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]
from PersonRecord a
left join (
select a.PersonID, a.Version
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
) b on a.PersonID = b.PersonID and a.Version=b.Version
thanks
November 13, 2008 at 9:39 am
AlexT (11/13/2008)
that supposed to be faster, if clustered index applied on (PersonID, Version )select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]
from PersonRecord a
left join (
select a.PersonID, a.Version
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
) b on a.PersonID = b.PersonID and a.Version=b.Version
thanks
Good point about the fact that Version is part of the clustered index.
Unfortunately, incorporating version into the subselct breaks the solution. It no longer returns the desired results.
November 13, 2008 at 10:54 am
yes sorry, my mistake. if you don't mind you can try this
select a.PersonID, a.Version, a.DEDate,
case when b.PersonID is null then 0 else 1 end AS [Is ImProper]
from PersonRecord a
left join (
select distinct a.PersonID
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
) b on a.PersonID = b.PersonID
thanks
November 13, 2008 at 12:58 pm
AlexT (11/13/2008)
yes sorry, my mistake. if you don't mind you can try thisselect a.PersonID, a.Version, a.DEDate,
case when b.PersonID is null then 0 else 1 end AS [Is ImProper]
from PersonRecord a
left join (
select distinct a.PersonID
from PersonRecord a
inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate
) b on a.PersonID = b.PersonID
This query does produce the desired effect, and yes, moving the distinct inside the subselect does result in a significant performance lift. Total execution time was 144470 Milliseconds! (vs. previous run time of was 164580 Milliseconds)
November 13, 2008 at 2:38 pm
that means that you have a lot "bad" de dates in table. in reality difference probably will be more significant
November 13, 2008 at 2:51 pm
No, actually, only 10,000 PersonIds in my test table of 5 million rows (1 million unique PersonIDs) have records that are out of order; that's about 1% of the ids. That also explains the nice lift, because the subselect has far fewer rows for the distinct to operate on.
Note: the subselect finds roughly 20,000 rows, distinct return 10,000 rows.
Viewing 12 posts - 31 through 41 (of 41 total)
You must be logged in to reply to this topic. Login to reply