Select bottom N of each Group By

Question

Select bottom N of each Group By

robert.mason

Old Hand

Points: 386
More actions
February 2, 2009 at 11:45 am

#128439

Greetings -
Let me preface this post with the fact that I am still new to the SQL world, and may communicate in such a fashion. I thank you in advance for your patience in this matter.
That being said, I have a package that runs every morning at about 3am. This package runs several SQL queries which essentially do a data dump from an external database (linked server) to our locally managed data warehouse. The data contained is populated from user-input forms that ultimately produce about 20-something columns. Some of these are manual-input columns, but many of them are system-input in some fashion (i.e., calculated, etc.). When the package runs, it looks for anything that was completed %yesterday% (based on a date/time field). These needs are met perfectly with the existing script/query.
The problem that arises is that some users are making changes to some data that has already been pulled in for a report. Due to the nature of both the form as well as our operational needs, these changes must be permitted and will continue. When a change *is* made, it causes the 'date/time' field to be update to %today%. As long as the user makes changes to the form within the same day, there is no issue. When they make a change to a form they completed prior to %today%, this is where the problem truly occurs. When the package runs, it pulls in the 'new' information, and appends it to the bottom of the table.
What I'm looking to do, is run a separate query/package that will look for these duplicates. Using MS Access (I know, I know...) I was able to develop a 'Select Duplicates' query which shows me each of these duplications. What I need to do is find a way to script the deletion of all but the 'newest' form (based on the 'date/time' field), and delete the rest. Below is an example of what is working for me to produce the data that shows all of the duplicated forms:
Use [Dev_DB]
SELECT *
FROM Reporting_Table
WHERE (((Reporting_Table.Duplicating_Field) In (SELECT [Duplicating_Field] FROM [Reporting_Table] As Tmp GROUP BY [Duplicating_Field] HAVING Count(*)>1 )))
ORDER BY Reporting_Table.Duplicating_Field;
Any assistance or direction on this would be greatly appreciated.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Luke L SSC-Dedicated Points: 35547 More actions · Answer 1

I think you could look at 2 things that would help you greatly...

First, I'd have a talk with the dev team for the front end. Perhaps this whole process could be improved by adding a column to flag the record as new or updated. Your datetime column does this to some extent already but perhaps just a bit could make your life easier getting your data into the DW. If you have a updatedflag set to 1 then do such and such logic, otherwise do something else... You could for instance grab a list of all of your updated rows and delete them from your DW before the import thus eliminating your problem.

Secondly, and more directly addressing your original question... Do you have a Primary identifier for your record? For instance your user enters info on a company, widget whatever and that is record 1, then it gets updated. Is it still record 1 but with a different timestamp, or is it now record 2? If the Primary Identifier doesn't change you could Just wrap your find dups query in a delete statement. Of course depending on how your environment is configured you could look at something like...

DELETE

FROM myTable --in your DW

WHERE primaryID IN (SELECT ID FROM... --the rest of your find dups query)

Basically this would just delete the records that will be duplicated from your DW prior to your daily import then the daily import won't create any dups. For anythign more specific, it would help us to help you tremendously if you could post some DDL and sample data for us to work with... see this article[/url] for help in getting us what we need to help you.

-Luke.

To help us help you read this[/url]For better help with performance problems please read this[/url]

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 2

A common table expression (CTE) should help you.

; with Sequencing as (

SELECT [Key Columns], DuplicatingField,

row_number() over (partition by DuplicatingField

order by [DateTimeField] desc ) as sequence

FROM Reporting_Table )

SELECT R.*

--DELETE Reporting_Table

FROM Reporting_Table R join Sequencing S

on [R.Key Columns] = [S.Key Columns]

and S.sequence > 1

The Sequencing CTE will allow you to easily determine the most recent row with sequence = 1. To identify superseded rows, just use sequence > 1. Once you're sure you're identifying the proper rows, change the SELECT R.* statement to a DELETE statement.

robert.mason Old Hand Points: 386 More actions · Answer 3

Luke L (2/2/2009)
I think you could look at 2 things that would help you greatly...
First, I'd have a talk with the dev team for the front end. Perhaps this whole process could be improved by adding a column to flag the record as new or updated. Your datetime column does this to some extent already but perhaps just a bit could make your life easier getting your data into the DW. If you have a updatedflag set to 1 then do such and such logic, otherwise do something else... You could for instance grab a list of all of your updated rows and delete them from your DW before the import thus eliminating your problem.

I have checked with our local dev team on this issue. Unfortunately, they are too backlogged to take much of a look at it for resolution. They did recommend using the Max(Date) function somehow. The way the current process works is that it selects all records that the 'createdate' is between a specific date range (usually, one day, but it also look on the database to see what the current Max(Date) is, so in case it fails on one day, it will pick up where it left off). All the records that match the other pieces of criteria (the form is completed and not pending, etc.) are appended to the table I use for reporting. I have full control of the final reporting table, so I can write a script against it to look for the duplications.

Luke L (2/2/2009)
Secondly, and more directly addressing your original question... Do you have a Primary identifier for your record? For instance your user enters info on a company, widget whatever and that is record 1, then it gets updated. Is it still record 1 but with a different timestamp, or is it now record 2? If the Primary Identifier doesn't change you could Just wrap your find dups query in a delete statement. Of course depending on how your environment is configured you could look at something like...
DELETE
FROM myTable --in your DW
WHERE primaryID IN (SELECT ID FROM... --the rest of your find dups query)
Basically this would just delete the records that will be duplicated from your DW prior to your daily import then the daily import won't create any dups.

When a user updates a submitted form (say perhaps to add more comments or select a different 'choice' value), the 'createdate' is updated to what is effectively the 'lastmodified' date. It's still called the 'createdate' in the source database. There is a sort of unique id that ties the submitted form to the other relational tables throughout the source database. Let's call it the 'FormID'. It's a GUID (which, personally, I don't care for, but whatever). When a user updates a submitted form, it simply modifies the record(s) within the source database(s). However, since there's a new 'createdate' (which has been updated to the date the user made the modification, essentially 'lastmodifieddate'), our script that pulls it all together into our reporting table sees it as a new record. This 'new record' is appended to the reporting table.

I will see however, if your suggestion will work. 🙂

Luke L (2/2/2009)
For anythign more specific, it would help us to help you tremendously if you could post some DDL and sample data for us to work with... see this article[/url] for help in getting us what we need to help you.
-Luke.

I am working on a way to change the existing SQL used in our packages, to accomplish two things:

* Remove any identifying markers, for internal security purposes

* Insert design intent comments

As mentioned before, I am still new to the SQL world (only been working a few months in a DB/SQL environment), so it is taking me a while to verify/re-verify any changes I am making...

I will post the update as soon as this is completed.

Thanks for your help!

robert.mason Old Hand Points: 386 More actions · Answer 4

I think this willl work. The code design has been kept true to its original state.

DECLARE @RPTStart INT, @RPTEnd INT, @LOC CHAR(50)

Set @RPTStart = 1090101

Set @RPTEnd = 1090131

Set @RPTStart = (Case When(SELECT SECONDARYDB.dbo.cSTD2CDATE(MAX([LocalReviewDateTime])) FROM [CC-DEV]..SOURCEDB_RPT_EVAL_VIEW_PRIMARYLOC) Is Null THEN (SELECT SECONDARYDB.dbo.cSTD2CDATE(GETDATE()-100)) ELSE (SELECT SECONDARYDB.dbo.cSTD2CDATE(MAX([LocalReviewDateTime])) FROM [CC-DEV]..SOURCEDB_RPT_EVAL_VIEW_PRIMARYLOC) END)

Set @RPTEnd = (SELECT SECONDARYDB.dbo.cSTD2CDATE(GETDATE()))

-- "SECONDARYDB.dbo.CST2CDATE" is a stored procedure that essentially converts the date from 'shortdate' (02/05/2009) to 'century date' (1090205); 'century date' is also referred to as 'UDATE' or 'UDAY'. This is how we tie the dates from the 'MainDB' and 'SecondaryDB' to the 'ReportingSource.SourceDB'.

Set @LOC = 'LOCATION_PRIMARY'

Use MainDB

-- MainDB is a database within our locally managed datawarehouse. It ties users (agents) to form submitters (supervisors).

SELECT PRIMARYLOC_Evaluator_Track.TNAME AS Eval_TNAME,

1.[EvaluatorExtension], 1.[DBEvaluatorName], 1.[DBCallDateTime], 1.[LocalCallDateTime], 1.Score,

1.[AgentExtension], 1.[DBAgentName], 1.[DBReviewDateTime], 1.[LocalReviewDateTime],

1.filename AS Filename, 1.training AS [TrainingFlag], 1.reviewstatusid AS [ReviewStatus], 1.[AgentStatus], 1.[CallDuration], 1.[ReviewID],

1.[CallID], 1.[FormType], PRIMARYLOC_Evaluator_Track.TEAM_NUM AS Eval_TNUM, PRIMARYLOC_Agent_Track.TEAM_NUM AS Agent_TNUM,

PRIMARYLOC_Agent_Track.TNAME AS Agent_TNAME, 1.[SwitchCalled], 1.DBRVWUDate, 1.LocalRvwUDate, 1.DBCalUDate,

1.LocalCalUDate, 1.[CallMonth], 1.[ReviewMonth],

CASE WHEN ([CallMonth] <> [ReviewMonth]) THEN 'Error' Else 'Clear' END AS [MismatchMonthFlag],

1.[ApplicationNumber], 1.[ApplicationName]

FROM (SELECT review.reviewid AS [ReviewID], review.mediaid AS [CallID], review.percentscore AS Score, reviewtemplate_1.templatedesc AS [FormType],

iqmuser_evaluator.hostid AS [EvaluatorExtension],

iqmuser_evaluator.lastname + ', ' + iqmuser_evaluator.firstname AS [DBEvaluatorName], iqmuser_agent.hostid AS [AgentExtension],

iqmuser_agent.lastname + ', ' + iqmuser_agent.firstname AS [DBAgentName], switch.name AS [SwitchCalled],

review.reviewdate AS [DBReviewDateTime], CASE WHEN (switch.name IN ('LOCATION_PRIMARY', 'LOCATION_SECONDARY')) THEN DATEADD(hh, - 6,

review.reviewdate) ELSE DATEADD(hh, - 5, review.reviewdate) END AS [LocalReviewDateTime], media.starttime AS [DBCallDateTime],

CASE WHEN (switch.name IN ('LOCATION_PRIMARY', 'LOCATION_SECONDARY')) THEN DATEADD(hh, - 6, media.starttime) ELSE DATEADD(hh, - 5,

media.starttime) END AS [LocalCallDateTime], SECONDARYDB.dbo.cSTD2CDATE(DATEADD(hh, - 0, review.reviewdate)) AS DBRVWUDate,

CASE WHEN (switch.name IN ('LOCATION_PRIMARY', 'LOCATION_SECONDARY')) THEN SECONDARYDB.dbo.cSTD2CDATE(DateAdd(hh, - 6, review.reviewdate))

ELSE SECONDARYDB.dbo.cSTD2CDATE(DateAdd(hh, - 6, review.reviewdate)) END AS LocalRvwUDate, SECONDARYDB.dbo.cSTD2CDATE(DATEADD(hh, - 0,

media.starttime)) AS DBCalUDate, CASE WHEN (switch.name IN ('LOCATION_PRIMARY', 'LOCATION_SECONDARY'))

THEN SECONDARYDB.dbo.cSTD2CDATE(DateAdd(hh, - 6, media.starttime)) ELSE SECONDARYDB.dbo.cSTD2CDATE(DateAdd(hh, - 6, media.starttime))

END AS LocalCalUDate, media.filename, review.training, review.reviewstatusid, iqmuser_agent.status AS [AgentStatus],

media.callduration AS [CallDuration], DATEPART(mm, DATEADD(hh, - 6, media.starttime)) AS [CallMonth], DATEPART(mm, DATEADD(hh, - 6, review.reviewdate)) AS [ReviewMonth],

application.hostid AS [ApplicationNumber], application.name AS [ApplicationName]

FROM REPORTINGSOURCE.SOURCEDB.dbo.workgroup_iqmuser AS workgroup_iqmuser INNER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.workgroup AS workgroup ON workgroup_iqmuser.workgroupid = workgroup.workgroupid INNER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.iqmuser AS iqmuser_evaluator ON workgroup_iqmuser.userid = iqmuser_evaluator.userid INNER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.media AS media INNER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.review AS review ON media.mediaid = review.mediaid ON

iqmuser_evaluator.userid = review.owneruserid AND iqmuser_evaluator.userid = review.updateuserid INNER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.iqmuser AS iqmuser_agent ON media.userid = iqmuser_agent.userid INNER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.switch AS switch ON media.switchid = switch.switchid INNER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.reviewtemplate AS reviewtemplate_1 ON review.templateid = reviewtemplate_1.templateid LEFT OUTER JOIN

REPORTINGSOURCE.SOURCEDB.dbo.application ON media.applicationid = application.applicationid AND switch.switchid = application.switchid)

AS 1 LEFT OUTER JOIN

ACD_USERS_TEAM_ASSIGN_PRIMARYLOC AS PRIMARYLOC_Agent_Track ON 1.LocalCalUDate = PRIMARYLOC_Agent_Track.UDAY AND

1.[AgentExtension] = PRIMARYLOC_Agent_Track.EXT_NUM LEFT OUTER JOIN

ACD_USERS_TEAM_ASSIGN_PRIMARYLOC AS PRIMARYLOC_Evaluator_Track ON 1.LocalRVWUDate = PRIMARYLOC_Evaluator_Track.UDAY AND

1.[EvaluatorExtension] = PRIMARYLOC_Evaluator_Track.EXT_NUM

WHERE (1.LocalCalUDate BETWEEN @RPTStart AND @RPTEnd) AND (1.[SwitchCalled] = @LOC) AND (1.LocalRvwUDate Between @RPTStart AND (@RPTEnd + 2)) AND (1.ReviewStatusid = 1)

Order By DBCalUDate

Attached is an example of the data it produces... (Not sure if the attachment will work, so here is the link, as well)

Keep in mind that day-to-day (results from each run of the package) produces no duplications. It's when I go back to do reporting against the reporting table that I am seeing duplications...

Here is the link that shows the duplications I see when I use the query originally posted....

robert.mason Old Hand Points: 386 More actions · Answer 5

An internal solution to our problem was found. Below is the code we used...

Thanks for everyone's help!

DELETE FROM DELRCRD

FROM ReportingTable DELRCRD JOIN

(SELECT REVIEWID, MAX(LOCALRVWUDATE) AS MAX_DATE FROM ReportingTable

GROUP BY REVIEWID

HAVING COUNT(REVIEWID) > 1) KEEPRCRD

ON DELRCRD.REVIEWID = KEEPRCRD.REVIEWID AND DELRCRD.LOCALRVWUDATE < KEEPRCRD.MAX_DATE

This code deletes all the 'older' versions of the duplications.