May 20, 2013 at 3:44 pm
I have a table with data that looks as below...
Each CaseID has multiple records in the table. Each of these rows is expected to have same data in all columns except "CreatedDate". However, in certain cases the StudentID may be different from the other row(s) of the same CaseID. I need to report data for all such cases. Additionally, the output needs to be arranged in the order of caseID with highest created date. i.e., in the result set all records belonging to the CaseID which was modified most recently should appear at the top.
Here is the sample of how the desired result set should look like...
Following is the code snippet to create sample data
DECLARE @mySampleTable TABLE (
CaseIDINT,
StudentIDVARCHAR (20),
StudentSocialVARCHAR (11),
StudentNameVARCHAR (50),
CreatedDateDateTime2
)
INSERT @mySampleTable (CaseID, StudentID, StudentSocial, StudentName, CreatedDate)
VALUES(1001, 'FTS-2134', '111-22-3333', 'Roger Federer', '5/12/2013'),
(1001, 'FTS-2314', '111-22-3333', 'Roger Federer', '5/08/2013'),
(1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/14/2013'),
(1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/2/2013'),
(1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/17/2013'),
(1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/06/2013'),
(1003, 'FTS-7864', '333-44-5555', 'Serena Williams', '4/30/2013'),
(1004, 'FTS-9132', '444-55-6666', 'Novak Djokovic', '5/10/2013'),
(1004, 'FTS-9312', '444-55-6666', 'Novak Djokovic', '4/27/2013'),
(1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/19/2013'),
(1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/12/2013')
select * from @mySampleTable
For now, this is what I did...
SELECTDISTINCT a.CaseID, a.StudentID, b.StudentID, a.StudentSocial, a.StudentName, a.CreatedDate
FROM@mySampleTable a
INNER JOIN @mySampleTable b ON a.CaseID = b.CaseID AND a.StudentSocial = b.StudentSocial AND a.StudentID <> b.StudentID
It doesn't do the sorting, but seems to do the rest. I am not sure if this is the best way to achieve what I am looking for though. My actual production tables have fairly large amount of data. I would appreciate your help and inputs with this.
Thanks in advance...
- Rex
May 20, 2013 at 7:40 pm
My solution avoids joining the table to itself by using a couple of window functions in a CTE named "flagged_data." It adds a "StudentID_Problem" column (1 for a problem, 0 for no problem). Because COUNT(DISTINCT) does not work in a window function, I compared MIN(StudentID) to MAX(StudentID) to see if they are a match. If they are not a match, there is a problem because there are at least two StudentID values for the given CaseID.
Because this flag needs to be a filter, and because window functions are not allowed in the WHERE clause, there has to be an intermediary step in a CTE. AFter the CTE, though, it is a simple SELECT statement ordered by a LastEntryDate (also calculated with a window function in the CTE), CaseID, and CreatedDate.
with
mySampleTable (CaseID, StudentID, StudentSocial, StudentName, CreatedDate) as
(select 1001, 'FTS-2134', '111-22-3333', 'Roger Federer', '5/12/2013' union all
select 1001, 'FTS-2314', '111-22-3333', 'Roger Federer', '5/08/2013'union all
select 1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/14/2013'union all
select 1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/2/2013'union all
select 1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/17/2013'union all
select 1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/06/2013'union all
select 1003, 'FTS-7864', '333-44-5555', 'Serena Williams', '4/30/2013'union all
select 1004, 'FTS-9132', '444-55-6666', 'Novak Djokovic', '5/10/2013'union all
select 1004, 'FTS-9312', '444-55-6666', 'Novak Djokovic', '4/27/2013'union all
select 1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/19/2013'union all
select 1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/12/2013'),
flagged_data as
(SELECT
CaseID,
StudentID,
StudentSocial,
StudentName,
CreatedDate,
StudentID_Problem = Case when MIN(StudentID) over (PARTITION by CaseID) <> MAX(StudentID) over (partition by CaseID) then 1 else 0 end,
LastEntryDate = MAX(CreatedDate) over (partition by CaseID)
FROM
mySampleTable)
select
CaseID,
StudentID,
StudentSocial,
StudentName,
CreatedDate
from
flagged_data
where
StudentID_Problem = 1
order by
LastEntryDate desc, CaseID, CreatedDate desc
May 21, 2013 at 2:46 am
It can be done without using windowed function (so, will have better performance):
SELECT *
FROM @mySampleTable s
WHERE s.CaseID IN (SELECT CaseID
FROM @mySampleTable
GROUP BY CaseID
HAVING MAX(StudentID) != MIN(StudentID))
May 21, 2013 at 8:15 am
The next thing you should do is to stop storing SSN in clear text. Get that stuff encrypted!!!!
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
May 21, 2013 at 11:44 am
Geoff5,
Thanks a lot for your suggested solution. I appreciate that.
Eugene Elutin,
Your solution looks simpler, but it is not sorting the end result the way I want it. Any suggestions on that?
Sean Lange,
Thanks for the encryption suggestion. This was just a sample data set I created to simulate my requirement. Actual data does not have any data as confidential as SSN. I definitely appreciate your suggestion and keep that in mind when I deal with any sensitive data.
- Rex
May 21, 2013 at 12:55 pm
RexHelios (5/21/2013)
Your solution looks simpler, but it is not sorting the end result the way I want it. Any suggestions on that?
His query does not have an order by. If you want it to be sorted in some certain you need to add an order by.
How do you want it sorted?
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
May 21, 2013 at 1:00 pm
Ahh I see how you want it sorted. You could modify Eugene's solution slightly like this.
with MySort as
(
select CaseID,
ROW_NUMBER() over(order by Max(CreatedDate) desc) as SortOrder
from @mySampleTable
group by CaseID
)
SELECT *
FROM @mySampleTable s
join MySort ms on ms.CaseID = s.CaseID
WHERE s.CaseID IN (SELECT CaseID
FROM @mySampleTable
GROUP BY CaseID
HAVING MAX(StudentID) != MIN(StudentID))
order by SortOrder, CreatedDate
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
May 21, 2013 at 1:28 pm
Sean,
I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).
In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?
May 21, 2013 at 1:35 pm
geoff5 (5/21/2013)
Sean,I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).
In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?
Not at all. I was merely posting it as an option. Once you add the sorting that I put in there it is most likely going to be slower.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
May 21, 2013 at 1:37 pm
Thank you all!
- Rex
May 21, 2013 at 11:02 pm
Thank you!
Wt.
May 21, 2013 at 11:53 pm
I believe Sean's solution has an extra unnecessary step.
If to use the filtering subquery in the JOIN it may provide sorting criteria as well.
Then CTE query may be excluded:
SELECT *
FROM @mySampleTable s
INNER JOIN (SELECT CaseID, MAX(CreatedDate) LatestUpdate
FROM @mySampleTable
GROUP BY CaseID
HAVING MAX(StudentID) != MIN(StudentID)) fs ON s.CaseID = fs.CaseID
ORDER BY LatestUpdate DESC, CreatedDate DESC
_____________
Code for TallyGenerator
May 21, 2013 at 11:56 pm
May 22, 2013 at 3:29 am
geoff5 (5/21/2013)
Sean,I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).
In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?
1. Number of time the same table is mentioned in the query has no direct impact on performance. As SQL optimizer will decide how many times table/index should be scanned/seeked.
In case of using windowed function, SQL will need to perform lazy table spool operation, which is not the best on in terms of performance. So, even so you have used table only once in your query, SQL will need to scan it once and perform table spool twice (to calculate two different ROW_NUMBERs). For the query which does simple GROUP BY, SQL will need just to scan table twice. And it's only based on the given example. I guess in OP real table CaseId is most likely to be indexed. In this case, my query will only have one table scan and one index seek. Query with ROW_NUMBER will still do the same: one scan, two lazy table spools.
2. IN or INNER JOIN?
Until you wanted to sort by CaseId with highest CreatedDate, you would use IN as it would win slightly over INNER JOIN in this case, as we wouldn't need anything to take from the table (actually sub-query) which we JOIN to. Here is a very good article from Gail Shaw about it: http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/.
However, as soon as you need this particular order, you need to get calculated MAX (CreatedDate). Therefore - INNER JOIN.
So Sergiy's query does exactly that. However I would also enforce order by CaseIs, just in case there are different CaseId's with the same MAX(CreatedDate):
SELECT s.*
FROM @mySampleTable s
INNER JOIN (SELECT CaseID, MAX(CreatedDate) LatestUpdate
FROM @mySampleTable
GROUP BY CaseID
HAVING MAX(StudentID) != MIN(StudentID)) fs ON s.CaseID = fs.CaseID
ORDER BY LatestUpdate DESC,
CaseID,
CreatedDate DESC -- here it's depends how you want to sort records for the same CaseId
May 22, 2013 at 3:43 am
Eugene Elutin (5/22/2013)
geoff5 (5/21/2013)
However I would also enforce order by CaseIs, just in case there are different CaseId's with the same MAX(CreatedDate)
Spot on!
_____________
Code for TallyGenerator
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply