November 11, 2009 at 8:02 am
Hello,
I have written the following script which is designed to update what is deemed as a 'duplicate' based on the time difference between the 'duplicate' and other qualifying data. A duplicate is classed a row that has the same UserKey, ProductKey and DateKey as another row and where the TimeKey is within 300 (seconds) of the other row's TimeKey. I should probably give an example:
ExtranetLinkKey UserKey ProductKey DateKey TimeKey
669270 77255 1219 3629 56749
669289 77255 1219 3629 56900
669323 77255 1219 3629 57122
Given the above 3 rows, what should happen is that Row 2 would be classed as a 'duplicate' and would have it's LinkCount set to 0. Row 3 should not be classed as a 'duplicate' because I want it to be compared against the 1st row, not the 2nd row and the difference in TimeKey between the 1st and 3rd rows is > 300.
What is actually happening is that rows 2 and 3 are being treated as 'duplicates'.
I have a feeling that the solution involves the use of ROW_NUMBER - I just need a push in the right direction!
Anyway, here's the script...
/*
Script detects duplicate links and updates the LinkCount to 0 where duplicates are found
Duplicates defined as links that have been submitted for the same user and product within 5 minutes of each other
Script counts the first links and then sets subdequent links to have an LinkCount of 0
*/
DECLARE @Msg VARCHAR(255)
DECLARE @Severity INT
DECLARE @Subj VARCHAR(255)
DECLARE @SecondsDifference SMALLINT-- Time difference searched to detect duplicates
DECLARE @MaxExtranetLinkKey INT
SET @MaxExtranetLinkKey = (SELECT ExtranetLinkKey
FROM tmp_ExtranetLinkKey)
-- Holds details of duplicated links
DECLARE @Working TABLE
(ExtranetLinkKey INT NOT NULL PRIMARY KEY,
DateKey INT NOT NULL,
UserKey INT NOT NULL,
ProductKey INT NOT NULL,
Seconds CHAR(8) NOT NULL,
TimeKey INT NOT NULL)
DECLARE @RemoveDups TABLE (ExtranetLinkKey INT) -- Holds ExtranetLinkKeys to be updated.
SET @SecondsDifference = 300;
BEGIN TRY
BEGIN TRAN
-- Get duplicate links based on user and product
INSERT INTO @Working (ExtranetLinkKey, DateKey, UserKey, ProductKey, Seconds, TimeKey)
SELECT el.ExtranetLinkKey, el.DateKey, el.UserKey, el.ProductKey, Seconds, el.TimeKey
FROM dbo.Fact_ExtranetLinks el
INNER JOIN Dim_Time t ON el.TimeKey = t.TimeKey
INNER JOIN Dim_Date d ON el.DateKey = d.DateKey
INNER JOIN
(SELECT DateKey, UserKey, ProductKey
FROM Fact_ExtranetLinks el
GROUP BY DateKey, UserKey, ProductKey
HAVING COUNT(*)<>1) AS A ON el.UserKey = A.UserKey AND el.DateKey = A.DateKey AND el.ProductKey = A.ProductKey
WHERE el.ExtranetLinkKey > @MaxExtranetLinkKey
-- Join duplicate links to themselves on user and product where the extranetkey is different and the link was submitted within the specified time period i.e. links for the same user and product within the time specified in @SecondsDifference
INSERT INTO @RemoveDups
SELECT DISTINCT CAST(ds1.ExtranetLinkKey AS VARCHAR(10))
FROM @Working ds
INNER JOIN @Working ds1 ON ds.UserKey=ds1.UserKey AND ds.DateKey=ds1.DateKey AND ds.ProductKey=ds1.ProductKey AND ds.ExtranetLinkKey<>ds1.ExtranetLinkKey
--AND ds.TimeKey = ds1.TimeKey
WHERE DATEDIFF(ss, ds.Seconds, ds1.Seconds) BETWEEN 1 AND @SecondsDifference
UPDATE dbo.Fact_ExtranetLinks
SET LinkCount = 0
WHERE ExtranetLinkKey IN
(SELECT ExtranetLinkKey FROM @RemoveDups)
AND LinkCount = 1
COMMIT
END TRY
BEGIN CATCH
IF @@TRANCOUNT<>0
ROLLBACK TRAN
SET @Msg = REPLACE(ERROR_MESSAGE(), '.', '') + ' (error no: ' + CAST(ERROR_NUMBER() AS VARCHAR(10)) + ') on line ' + CAST(ERROR_LINE() AS VARCHAR(10)) + CASE WHEN ERROR_PROCEDURE() IS NOT NULL THEN ' in ' + ERROR_PROCEDURE() ELSE '' END
SET @Severity = ERROR_SEVERITY()
SET @Subj = 'Failed to update LinkCount for ReAP_MI_ExtranetExcursions_History '
RAISERROR(@Msg, @Severity , 1)
EXEC master..sp_sendmail
@recipients= <recipients list>,
@subject=@Subj,
@message=@Msg
END CATCH
RETURN
Thanks
Lempster
November 11, 2009 at 8:09 am
I think the way I'd do it would be to join rows based on UserKey, ProductKey and DateKey, with equality tests on those columns, and a join based on the range for the TimeKey, where there's a row that's within 300 seconds before.
Then add a Row_Number() partitioned by UserKey, ProductKey and DateKey, ordered by TimeKey.
Delete the ones where the Row_Number() = 1, then repeat as long as @@Rowcount > 0.
I'd have to see actual table definitions and some data to make sure that would do what's needed, but it's probably the easiest way to accomplish what you need.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 11, 2009 at 8:15 am
Thanks GSquared,
Nice to know I was in the right area. 😉
I must admit I am not very good at using some of the (now not so) new T-SQL functions, so this will force me into it!
Lempster
November 11, 2009 at 2:04 pm
GSquared (11/11/2009)
I think the way I'd do it would be to join rows based on UserKey, ProductKey and DateKey, with equality tests on those columns, and a join based on the range for the TimeKey, where there's a row that's within 300 seconds before.
Hi GSquared,
Is that not what I've done with my two table variables? The equality-based join in the @Working table variable and the TimeKey range join (ok I've used the Seconds column simply because I find it easier to verify the results by looking at the actual time rather an integer) in the @RemoveDups table variable.
Then add a Row_Number() partitioned by UserKey, ProductKey and DateKey, ordered by TimeKey.
Ok, I've modified the script like so:
-- Join duplicate links to themselves on user and product where the extranetkey is different and the link was submitted within the specified time period i.e. links for the same user and product within the time specified in @SecondsDifference
INSERT INTO @RemoveDups
SELECT CAST(ds1.ExtranetLinkKey AS VARCHAR(10)),
ROW_NUMBER() OVER(PARTITION BY ds.UserKey, ds.ProductKey, ds.DateKey
ORDER BY ds.TimeKey)
FROM @Working ds
INNER JOIN @Working ds1 ON ds.UserKey=ds1.UserKey AND ds.DateKey=ds1.DateKey AND ds.ProductKey=ds1.ProductKey AND ds.ExtranetLinkKey<>ds1.ExtranetLinkKey
--AND ds.TimeKey = ds1.TimeKey
WHERE DATEDIFF(ss, ds.Seconds, ds1.Seconds) BETWEEN 1 AND @SecondsDifference
ORDER BY ds1.ExtranetLinkKey
Delete the ones where the Row_Number() = 1, then repeat as long as @@Rowcount > 0.
I'm not sure what you mean by the second part of this statement - the 'repeat' bit?
If you take this set of set test data as being the entire contents of the dbo.Fact_ExtranetLinks table, you can omit the WHERE el.ExtranetLinkKey > @MaxExtranetLinkKey condition. (Sorry about the formatting; I don't know how to get the coulmn values to line up)
ExtranetLinkKeyDateKeyUserKeyProductKeySecondsTimeKey
6702843971184610117310:41:0138461
6702863971178855117310:58:2139501
6702873971184610117310:59:1739557
6702883971184610117310:59:4439584
6702893971184610117310:59:4839588
6702903971184610117310:59:5139591
6702913971178855117311:10:5740257
6702933971183948117315:08:4954529
6702953971186733117314:19:4151581
6702963971186733117314:45:2453124
6702973971186733117315:04:4054280
6702983971186733117314:55:2453724
6702993971186733117314:41:4652906
6703003971186733117314:02:0250522
6703013971186733117314:02:5050570
6703023971186733117315:57:4757467
6703033971186733117315:31:1955879
6703043971186733117315:10:1754617
6703053971186733117315:28:3355713
6703063971178678117316:22:0958929
6703073971178678117316:19:0658746
6703083971178678117316:19:1258752
6703093971178678117316:19:1758757
6703103971179182117317:31:1863078
6703113971185714119412:07:5443674
Out of the above I would want the following ExtranetLinkKeys to be classed as duplicates and subsequently deleted:
670288
670289
670290
670296
670301
670303
670306
670308
670309
Regards
Lempster
November 13, 2009 at 6:52 am
In re-reading your post, I realized you're planning on an update instead of a deletion, so change that part of what I wrote.
The "repeat till @@rowcount=0" is because you could, theoretically, have two rows that are within 300 seconds of a prior row. You need to make sure you get both of them.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 16, 2009 at 4:05 am
Got it...I needed to include the TimeKey range in the join clause in the @Working table variable as well as the @RemoveDups table variable like so:
INSERT INTO @Working (ExtranetLinkKey, DateKey, UserKey, ProductKey, Seconds, TimeKey)
SELECT DISTINCT el.ExtranetLinkKey, el.DateKey, el.UserKey, el.ProductKey, Seconds, el.TimeKey
FROM dbo.Fact_ExtranetLinks el
INNER JOIN Dim_Time t ON el.TimeKey = t.TimeKey
INNER JOIN Dim_Date d ON el.DateKey = d.DateKey
INNER JOIN
(SELECT ExtranetLinkKey, DateKey, UserKey, ProductKey, TimeKey
FROM Fact_ExtranetLinks el) AS A
ON el.UserKey = A.UserKey AND el.DateKey = A.DateKey AND el.ProductKey = A.ProductKey
AND el.TimeKey - A.TimeKey < 300 AND el.ExtranetLinkKey <> A.ExtranetLinkKey
WHERE el.ExtranetLinkKey > @MaxExtranetLinkKey
Thanks for your help 🙂
November 16, 2009 at 7:18 am
You're welcome.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply