October 24, 2011 at 11:34 am
The task is to identify duplicated records [Note: NOT true dups but as a PAIR(s) only.]
A Pair is two records with Identical [DataKey] values but different [TransCode], in this example first TransCode = 'A' and another = 'D'
--Test Table
IF OBJECT_ID('tempdb..#TestTBL','U') IS NOT NULL
DROP TABLE #TestTBL
CREATE TABLE #TestTBL
(
RecID int PRIMARY KEY CLUSTERED,
DataKey varchar(20),
TransCode char(1),
DupInd int
)
--Test Data
INSERT INTO #TestTBL
(RecID,DataKey, TransCode, DupInd)
SELECT 1,'Susan','A',0 UNION ALL
SELECT 2,'Susan','D',0 UNION ALL
SELECT 3,'Susan','D',0 UNION ALL
SELECT 4,'Susan','A',0 UNION ALL
SELECT 5,'Charles','A',0 UNION ALL
SELECT 6,'Charles','D',0 UNION ALL
SELECT 7,'Charles','A',0 UNION ALL
SELECT 8,'Charles','D',0 UNION ALL
SELECT 9,'Charles','A',0 UNION ALL
SELECT 10,'Michael','D',0 UNION ALL
SELECT 11,'Michael','A',0 UNION ALL
SELECT 12,'Michael','D',0 UNION ALL
SELECT 13,'Michael','A',0 UNION ALL
SELECT 14,'Michael','A',0 UNION ALL
SELECT 15,'Michael','A',0
In most cases there is more than one pair. So, the loop is used to make sure that ALL pairs are selected...
/*Straightforward Loop*/
--declare and init variables
DECLARE @vintRCI int = (-1), @vintCount int = 101
WHILE @vintRCI <> 0
BEGIN
UPDATE a --set flag for 'A' type
SET a.DupInd = @vintCount
FROM #TestTBL a
INNER JOIN #TestTBL b ON a.DataKey = b.DataKey
INNER JOIN (
SELECT DataKey, MIN(RecID) as minID
FROM #TestTBL
WHERE TransCode = 'A' AND DupInd = 0
GROUP BY DataKey
) x ON a.DataKey = x.DataKey AND a.RecID = x.minID
WHERE a.TransCode = 'A' AND b.TransCode = 'D'
AND a.DupInd = 0 AND b.DupInd = 0
SELECT @vintRCI = @@ROWCOUNT
UPDATE a --Set flag for corresponding 'D' type
SET a.DupInd = @vintCount
FROM #TestTBL a
INNER JOIN #TestTBL b ON a.DataKey = b.DataKey
INNER JOIN (
SELECT DataKey, MIN(RecID) as minID
FROM #TestTBL
WHERE TransCode = 'D' AND DupInd = 0
GROUP BY DataKey
) x ON a.DataKey = x.DataKey AND a.RecID = x.minID
WHERE a.TransCode = 'D' AND b.TransCode = 'A'
AND a.DupInd = 0 AND b.DupInd = @vintCount
SELECT @vintRCI = @@ROWCOUNT
SET @vintCount=@vintCount + 100
END
--set unified flag
UPDATE #TestTBL SET DupInd = -2 WHERE DupInd > 100
--Show result
SELECT RecID, DataKey, TransCode, DupInd,
CASE DupInd WHEN -2 THEN 'This is a member of a DUP_PAIR' ELSE 'NOT A DUP' END 'Notes'
FROM #TestTBL
"Loop based" script above would return correct results, but I have a feeling we can do better. Right?
Thanks
October 24, 2011 at 12:07 pm
If I may, that won't be an easy one to do. You might want to have a look at the cumulative update or running total articles by Jeff Moden for some ideas...
October 24, 2011 at 12:11 pm
What is your expected results when the process is completed based on your sample data?
October 24, 2011 at 12:21 pm
Lynn, There is a select at the bottom of the script...
1SusanA-2This is a member of a DUP_PAIR
2SusanD-2This is a member of a DUP_PAIR
3SusanD-2This is a member of a DUP_PAIR
4SusanA-2This is a member of a DUP_PAIR
5CharlesA-2This is a member of a DUP_PAIR
6CharlesD-2This is a member of a DUP_PAIR
7CharlesA-2This is a member of a DUP_PAIR
8CharlesD-2This is a member of a DUP_PAIR
9CharlesA0NOT A DUP
10MichaelD-2This is a member of a DUP_PAIR
11MichaelA-2This is a member of a DUP_PAIR
12MichaelD-2This is a member of a DUP_PAIR
13MichaelA-2This is a member of a DUP_PAIR
14MichaelA0NOT A DUP
15MichaelA0NOT A DUP
October 24, 2011 at 12:23 pm
hm... not has hard as I might have thought...
UPDATE #TestTBL SET DupInd = -2;
WITH mycte AS (
SELECT
Row = ROW_NUMBER() OVER (PARTITION BY [datakey] ORDER BY [RecID]),
[RecID], [DataKey], [DupInd]
FROM
#TestTBL
WHERE
TransCode = 'A'
),
mycte2 AS (
SELECT Row = ROW_NUMBER() OVER (PARTITION BY [datakey] ORDER BY [RecID]),
[RecID], [DataKey], [DupInd]
FROM
#TestTBL
WHERE
TransCode = 'D'
)
UPDATE
#TestTBL
SET
DupInd = 0
FROM
(
SELECT
[recid1] = mycte.RecID,
[recid2]=mycte2.RecID
FROM
mycte FULL OUTER JOIN
mycte2 ON mycte.datakey = mycte2.datakey AND mycte.Row = mycte2.Row
WHERE mycte.RecID IS NULL OR mycte2.RecID IS NULL) AS a
where
(RecID = a.RecID1 AND a.recid2 IS NULL) OR
(recid = a.recid2 AND a.recid1 IS NULL)
SELECT RecID, DataKey, TransCode, DupInd,
CASE WHEN DupInd = -2 THEN 'This is a member of a DUP_PAIR' ELSE 'NOT A DUP' END 'Notes'
FROM #TestTBL
Edit: changed order by to match loop query in results. Performance: 45 rows - 55 reads, 14 duration. Loop: 461 reads, 24 duration
October 24, 2011 at 1:04 pm
Outstanding !
This is a small part of major data scrubbing/cleansing within 2000 --> 2K8 R2 migration project.
The plan was to leave "dups" behind on a SQL 2000 side. I didn't even think of CTE.
So, now I have options:
Either convert your code to pre-CTE
or may as well move all data to 2008 for processing.
Many thanks
October 24, 2011 at 1:06 pm
Venom, you're using the same theory I am but your method won't deal with generic TransCodes, you'd have to code for each one. I'm assuming this sample is restricted for ease of use.
Mike, I need to understand a few things about your pattern. Also, if you'd be so kind, please change the code="plain" to code="sql" for the two portions, it'll make copy/paste easier. the Plain does strange things to the CR/LFs when I cut/paste.
Regarding this:
1 Susan A -2 This is a member of a DUP_PAIR
2 Susan D -2 This is a member of a DUP_PAIR
3 Susan D -2 This is a member of a DUP_PAIR
4 Susan A -2 This is a member of a DUP_PAIR
5 Charles A -2 This is a member of a DUP_PAIR
6 Charles D -2 This is a member of a DUP_PAIR
7 Charles A -2 This is a member of a DUP_PAIR
8 Charles D -2 This is a member of a DUP_PAIR
9 Charles A 0 NOT A DUP
10 Michael D -2 This is a member of a DUP_PAIR
11 Michael A -2 This is a member of a DUP_PAIR
12 Michael D -2 This is a member of a DUP_PAIR
13 Michael A -2 This is a member of a DUP_PAIR
14 Michael A 0 NOT A DUP
15 Michael A 0 NOT A DUP
Why is Charles on 5 a dupe when there's no previous charles entry yet Charles on 9 is not a dupe?
The reason this comes into play is on Row2, Susan is obviously duping because of Row 1, but Row3 is the same. What is the exact business logic of the dupe selection?
This is the code so far:
;WITH RowNumberOrdering AS
(SELECT
ROW_NUMBER() OVER ( ORDER BY DataKey, RecID) AS rn,
RecID, DataKey, TransCode
FROM
#TestTBL
)
SELECT
TestRec.RecID,
CASE WHEN TestRec.DataKey = CheckRec.DataKey
THEN CASE WHEN TestRec.TransCode <> CheckRec.TransCode
THEN 'This is a member of a DUP_PAIR'
ELSE 'NOT A DUP'
END
ELSE 'NOT A DUP'
END AS DupeCheckResult,
tt.*
FROM
RowNumberOrdering AS TestRec
LEFT JOIN
RowNumberOrdering AS CheckRec
ONTestRec.rn = CheckRec.rn - 1
LEFT JOIN #TestTBL AS tt
ONTestRec.RecID = tt.RecID
ORDER BY
testRec.RecID
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 24, 2011 at 1:12 pm
Kraig,
I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup". you can see this if you expand the original inserts as such
--Test Table
IF OBJECT_ID('tempdb..#TestTBL','U') IS NOT NULL
DROP TABLE #TestTBL
CREATE TABLE #TestTBL
(
RecID int IDENTITY(1,1) PRIMARY KEY CLUSTERED,
DataKey varchar(20),
TransCode char(1),
DupInd int
)
--Test Data
INSERT INTO #TestTBL
(RecID,DataKey, TransCode, DupInd)
SELECT 'Susan','A',0 UNION ALL
SELECT 'Susan','D',0 UNION ALL
SELECT 'Susan','D',0 UNION ALL
SELECT 'Susan','A',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Charles','D',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Charles','D',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Michael','D',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','D',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Susan','A',0 UNION ALL
SELECT 'Susan','D',0 UNION ALL
SELECT 'Susan','D',0 UNION ALL
SELECT 'Susan','A',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Charles','D',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Charles','D',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Michael','D',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','D',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Susan','A',0 UNION ALL
SELECT 'Susan','D',0 UNION ALL
SELECT 'Susan','D',0 UNION ALL
SELECT 'Susan','A',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Charles','D',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Charles','D',0 UNION ALL
SELECT 'Charles','A',0 UNION ALL
SELECT 'Michael','D',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','D',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','A',0 UNION ALL
SELECT 'Michael','A',0
October 24, 2011 at 1:13 pm
venoym (10/24/2011)
Kraig,I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".
Then I'm not sure about Rows 13 through 15.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 24, 2011 at 1:16 pm
OP can correct, but it looks like you start at the top. Select the first row that is "D" and the first row of "A" that have matching DataKeys. If one is missing, it's not a dup. If both exist, both are dups.
So in Rows 13-15
Row 13 has Row 12 it's a dup with.
Row 14 has no Row with a "D", so not a Dup
Row 15 has no Row with a "D", so not a Dup
October 24, 2011 at 1:18 pm
Mike,
For option 1, Use a view in place of the CTE. That's all it's doing is providing a query on the Data.
October 24, 2011 at 1:26 pm
Evil Kraig F (10/24/2011)
venoym (10/24/2011)
Kraig,I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".
Then I'm not sure about Rows 13 through 15.
Yeah, I agree with what the code appears to do, but I wanted to make sure that was the intent. Susan gets herself knocked out completely this way... there is no non-dupe left, so I was looking to confirm the intent from the business rule perspective.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 24, 2011 at 1:53 pm
Mike Usoffski (10/24/2011)
Outstanding !This is a small part of major data scrubbing/cleansing within 2000 --> 2K8 R2 migration project.
The plan was to leave "dups" behind on a SQL 2000 side. I didn't even think of CTE.
So, now I have options:
Either convert your code to pre-CTE
or may as well move all data to 2008 for processing.
Many thanks
CTE's are pretty easy to convert in most cases... they just become "Derived Tables" in a FROM clause.
Unfortunately, it's not quite as easy to duplicate what ROW_NUMBER does but it's also not so hard. Just have the query from each CTE insert into a Temp Table with an IDENTITY column using an ORDER BY.
Of course, cleansing the data on the 2k8 side like you said is also a simple option.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 24, 2011 at 2:07 pm
Evil Kraig F (10/24/2011)
Evil Kraig F (10/24/2011)
venoym (10/24/2011)
Kraig,I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".
Then I'm not sure about Rows 13 through 15.
Yeah, I agree with what the code appears to do, but I wanted to make sure that was the intent. Susan gets herself knocked out completely this way... there is no non-dupe left, so I was looking to confirm the intent from the business rule perspective.
Craig,
Sorry, late to the party. Setting my test-bed to see performance on a 12-15mil records. venoym is absolutely right...
venoym (10/24/2011) Any "orphan" or "single" DataKey is "Not a dup".
For this snippet it doesn't really matter which record is selected to be a Dup and which one is "orphan". In reality bus rules are implemented by using few datetime columns as a tiebreaker.
Thanks again guys and girls
October 24, 2011 at 2:45 pm
If in girls you were referring to me, hate to tell you but I'm a guy.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply