List out rows from a duplicate record set that has one or more odd values

Question

List out rows from a duplicate record set that has one or more odd values

SQLCurious

SSCrazy

Points: 2339
More actions
May 20, 2013 at 3:44 pm

#278090

I have a table with data that looks as below...
Each CaseID has multiple records in the table. Each of these rows is expected to have same data in all columns except "CreatedDate". However, in certain cases the StudentID may be different from the other row(s) of the same CaseID. I need to report data for all such cases. Additionally, the output needs to be arranged in the order of caseID with highest created date. i.e., in the result set all records belonging to the CaseID which was modified most recently should appear at the top.
Here is the sample of how the desired result set should look like...
Following is the code snippet to create sample data
DECLARE @mySampleTable TABLE (
CaseIDINT,
StudentIDVARCHAR (20),
StudentSocialVARCHAR (11),
StudentNameVARCHAR (50),
CreatedDateDateTime2
)
INSERT @mySampleTable (CaseID, StudentID, StudentSocial, StudentName, CreatedDate)
VALUES(1001, 'FTS-2134', '111-22-3333', 'Roger Federer', '5/12/2013'),
(1001, 'FTS-2314', '111-22-3333', 'Roger Federer', '5/08/2013'),
(1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/14/2013'),
(1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/2/2013'),
(1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/17/2013'),
(1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/06/2013'),
(1003, 'FTS-7864', '333-44-5555', 'Serena Williams', '4/30/2013'),
(1004, 'FTS-9132', '444-55-6666', 'Novak Djokovic', '5/10/2013'),
(1004, 'FTS-9312', '444-55-6666', 'Novak Djokovic', '4/27/2013'),
(1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/19/2013'),
(1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/12/2013')
select * from @mySampleTable
For now, this is what I did...
SELECTDISTINCT a.CaseID, a.StudentID, b.StudentID, a.StudentSocial, a.StudentName, a.CreatedDate
FROM@mySampleTable a
INNER JOIN @mySampleTable b ON a.CaseID = b.CaseID AND a.StudentSocial = b.StudentSocial AND a.StudentID <> b.StudentID
It doesn't do the sorting, but seems to do the rest. I am not sure if this is the best way to achieve what I am looking for though. My actual production tables have fairly large amount of data. I would appreciate your help and inputs with this.
Thanks in advance...
- Rex

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

geoff5 Hall of Fame Points: 3560 More actions · Answer 1

My solution avoids joining the table to itself by using a couple of window functions in a CTE named "flagged_data." It adds a "StudentID_Problem" column (1 for a problem, 0 for no problem). Because COUNT(DISTINCT) does not work in a window function, I compared MIN(StudentID) to MAX(StudentID) to see if they are a match. If they are not a match, there is a problem because there are at least two StudentID values for the given CaseID.

Because this flag needs to be a filter, and because window functions are not allowed in the WHERE clause, there has to be an intermediary step in a CTE. AFter the CTE, though, it is a simple SELECT statement ordered by a LastEntryDate (also calculated with a window function in the CTE), CaseID, and CreatedDate.

with

mySampleTable (CaseID, StudentID, StudentSocial, StudentName, CreatedDate) as

(select 1001, 'FTS-2134', '111-22-3333', 'Roger Federer', '5/12/2013' union all

select 1001, 'FTS-2314', '111-22-3333', 'Roger Federer', '5/08/2013'union all

select 1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/14/2013'union all

select 1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/2/2013'union all

select 1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/17/2013'union all

select 1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/06/2013'union all

select 1003, 'FTS-7864', '333-44-5555', 'Serena Williams', '4/30/2013'union all

select 1004, 'FTS-9132', '444-55-6666', 'Novak Djokovic', '5/10/2013'union all

select 1004, 'FTS-9312', '444-55-6666', 'Novak Djokovic', '4/27/2013'union all

select 1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/19/2013'union all

select 1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/12/2013'),

flagged_data as

(SELECT

CaseID,

StudentID,

StudentSocial,

StudentName,

CreatedDate,

StudentID_Problem = Case when MIN(StudentID) over (PARTITION by CaseID) <> MAX(StudentID) over (partition by CaseID) then 1 else 0 end,

LastEntryDate = MAX(CreatedDate) over (partition by CaseID)

FROM

mySampleTable)

select

CaseID,

StudentID,

StudentSocial,

StudentName,

CreatedDate

from

flagged_data

where

StudentID_Problem = 1

order by

LastEntryDate desc, CaseID, CreatedDate desc

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 2

It can be done without using windowed function (so, will have better performance):

SELECT *

FROM @mySampleTable s

WHERE s.CaseID IN (SELECT CaseID

FROM @mySampleTable

GROUP BY CaseID

HAVING MAX(StudentID) != MIN(StudentID))

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Sean Lange SSC Guru Points: 286609 More actions · Answer 3

The next thing you should do is to stop storing SSN in clear text. Get that stuff encrypted!!!!

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

SQLCurious SSCrazy Points: 2339 More actions · Answer 4

Geoff5,

Thanks a lot for your suggested solution. I appreciate that.

Eugene Elutin,

Your solution looks simpler, but it is not sorting the end result the way I want it. Any suggestions on that?

Sean Lange,

Thanks for the encryption suggestion. This was just a sample data set I created to simulate my requirement. Actual data does not have any data as confidential as SSN. I definitely appreciate your suggestion and keep that in mind when I deal with any sensitive data.

- Rex

Sean Lange SSC Guru Points: 286609 More actions · Answer 5

RexHelios (5/21/2013)
Your solution looks simpler, but it is not sorting the end result the way I want it. Any suggestions on that?

His query does not have an order by. If you want it to be sorted in some certain you need to add an order by.

How do you want it sorted?

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Sean Lange SSC Guru Points: 286609 More actions · Answer 6

Ahh I see how you want it sorted. You could modify Eugene's solution slightly like this.

with MySort as

(

select CaseID,

ROW_NUMBER() over(order by Max(CreatedDate) desc) as SortOrder

from @mySampleTable

group by CaseID

)

SELECT *

FROM @mySampleTable s

join MySort ms on ms.CaseID = s.CaseID

WHERE s.CaseID IN (SELECT CaseID

FROM @mySampleTable

GROUP BY CaseID

HAVING MAX(StudentID) != MIN(StudentID))

order by SortOrder, CreatedDate

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

geoff5 Hall of Fame Points: 3560 More actions · Answer 7

Sean,

I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).

In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?

Sean Lange SSC Guru Points: 286609 More actions · Answer 8

geoff5 (5/21/2013)
Sean,
I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).
In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?

Not at all. I was merely posting it as an option. Once you add the sorting that I put in there it is most likely going to be slower.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

SQLCurious SSCrazy Points: 2339 More actions · Answer 9

SQLCurious

SSCrazy

Points: 2339

May 21, 2013 at 1:37 pm

#1617167

Thank you all!

- Rex

kthakon Newbie Points: 5 More actions · Answer 10

kthakon

Newbie

Points: 5

May 21, 2013 at 11:02 pm

#1617259

Thank you!

Wt.

Sergiy SSC Guru Points: 110209 More actions · Answer 11

I believe Sean's solution has an extra unnecessary step.

If to use the filtering subquery in the JOIN it may provide sorting criteria as well.

Then CTE query may be excluded:

SELECT *

FROM @mySampleTable s

INNER JOIN (SELECT CaseID, MAX(CreatedDate) LatestUpdate

FROM @mySampleTable

GROUP BY CaseID

HAVING MAX(StudentID) != MIN(StudentID)) fs ON s.CaseID = fs.CaseID

ORDER BY LatestUpdate DESC, CreatedDate DESC

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110209 More actions · Answer 12

And of course, it must say:

SELECT s.*

to be precise. 🙂

_____________
Code for TallyGenerator

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 13

geoff5 (5/21/2013)
Sean,
I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).
In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?

1. Number of time the same table is mentioned in the query has no direct impact on performance. As SQL optimizer will decide how many times table/index should be scanned/seeked.

In case of using windowed function, SQL will need to perform lazy table spool operation, which is not the best on in terms of performance. So, even so you have used table only once in your query, SQL will need to scan it once and perform table spool twice (to calculate two different ROW_NUMBERs). For the query which does simple GROUP BY, SQL will need just to scan table twice. And it's only based on the given example. I guess in OP real table CaseId is most likely to be indexed. In this case, my query will only have one table scan and one index seek. Query with ROW_NUMBER will still do the same: one scan, two lazy table spools.

2. IN or INNER JOIN?

Until you wanted to sort by CaseId with highest CreatedDate, you would use IN as it would win slightly over INNER JOIN in this case, as we wouldn't need anything to take from the table (actually sub-query) which we JOIN to. Here is a very good article from Gail Shaw about it: http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/.

However, as soon as you need this particular order, you need to get calculated MAX (CreatedDate). Therefore - INNER JOIN.

So Sergiy's query does exactly that. However I would also enforce order by CaseIs, just in case there are different CaseId's with the same MAX(CreatedDate):

SELECT s.*

FROM @mySampleTable s

INNER JOIN (SELECT CaseID, MAX(CreatedDate) LatestUpdate

FROM @mySampleTable

GROUP BY CaseID

HAVING MAX(StudentID) != MIN(StudentID)) fs ON s.CaseID = fs.CaseID

ORDER BY LatestUpdate DESC,

CaseID,

CreatedDate DESC -- here it's depends how you want to sort records for the same CaseId

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Sergiy SSC Guru Points: 110209 More actions · Answer 14

Eugene Elutin (5/22/2013)

geoff5 (5/21/2013)
However I would also enforce order by CaseIs, just in case there are different CaseId's with the same MAX(CreatedDate)

Spot on!

_____________
Code for TallyGenerator