Help Converting While Loop to true Set Based

Question

Help Converting While Loop to true Set Based

Mike Usoffski

SSC Rookie

Points: 38
More actions
October 24, 2011 at 11:34 am

#246823

The task is to identify duplicated records [Note: NOT true dups but as a PAIR(s) only.]
A Pair is two records with Identical [DataKey] values but different [TransCode], in this example first TransCode = 'A' and another = 'D'
--Test Table
IF OBJECT_ID('tempdb..#TestTBL','U') IS NOT NULL
DROP TABLE #TestTBL
CREATE TABLE #TestTBL
(
RecID int PRIMARY KEY CLUSTERED,
DataKey varchar(20),
TransCode char(1),
DupInd int
)
--Test Data
INSERT INTO #TestTBL
(RecID,DataKey, TransCode, DupInd)
SELECT 1,'Susan','A',0 UNION ALL
SELECT 2,'Susan','D',0 UNION ALL
SELECT 3,'Susan','D',0 UNION ALL
SELECT 4,'Susan','A',0 UNION ALL
SELECT 5,'Charles','A',0 UNION ALL
SELECT 6,'Charles','D',0 UNION ALL
SELECT 7,'Charles','A',0 UNION ALL
SELECT 8,'Charles','D',0 UNION ALL
SELECT 9,'Charles','A',0 UNION ALL
SELECT 10,'Michael','D',0 UNION ALL
SELECT 11,'Michael','A',0 UNION ALL
SELECT 12,'Michael','D',0 UNION ALL
SELECT 13,'Michael','A',0 UNION ALL
SELECT 14,'Michael','A',0 UNION ALL
SELECT 15,'Michael','A',0
In most cases there is more than one pair. So, the loop is used to make sure that ALL pairs are selected...
/*Straightforward Loop*/
--declare and init variables
DECLARE @vintRCI int = (-1), @vintCount int = 101
WHILE @vintRCI <> 0
BEGIN
UPDATE a --set flag for 'A' type
SET a.DupInd = @vintCount
FROM #TestTBL a
INNER JOIN #TestTBL b ON a.DataKey = b.DataKey
INNER JOIN (
SELECT DataKey, MIN(RecID) as minID
FROM #TestTBL
WHERE TransCode = 'A' AND DupInd = 0
GROUP BY DataKey
) x ON a.DataKey = x.DataKey AND a.RecID = x.minID
WHERE a.TransCode = 'A' AND b.TransCode = 'D'
AND a.DupInd = 0 AND b.DupInd = 0
SELECT @vintRCI = @@ROWCOUNT
UPDATE a --Set flag for corresponding 'D' type
SET a.DupInd = @vintCount
FROM #TestTBL a
INNER JOIN #TestTBL b ON a.DataKey = b.DataKey
INNER JOIN (
SELECT DataKey, MIN(RecID) as minID
FROM #TestTBL
WHERE TransCode = 'D' AND DupInd = 0
GROUP BY DataKey
) x ON a.DataKey = x.DataKey AND a.RecID = x.minID
WHERE a.TransCode = 'D' AND b.TransCode = 'A'
AND a.DupInd = 0 AND b.DupInd = @vintCount
SELECT @vintRCI = @@ROWCOUNT
SET @vintCount=@vintCount + 100
END
--set unified flag
UPDATE #TestTBL SET DupInd = -2 WHERE DupInd > 100
--Show result
SELECT RecID, DataKey, TransCode, DupInd,
CASE DupInd WHEN -2 THEN 'This is a member of a DUP_PAIR' ELSE 'NOT A DUP' END 'Notes'
FROM #TestTBL
"Loop based" script above would return correct results, but I have a feeling we can do better. Right?
Thanks

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

venoym SSCarpal Tunnel Points: 4161 More actions · Answer 1

If I may, that won't be an easy one to do. You might want to have a look at the cumulative update or running total articles by Jeff Moden for some ideas...

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

What is your expected results when the process is completed based on your sample data?

Mike Usoffski SSC Rookie Points: 38 More actions · Answer 3

Lynn, There is a select at the bottom of the script...

1SusanA-2This is a member of a DUP_PAIR

2SusanD-2This is a member of a DUP_PAIR

3SusanD-2This is a member of a DUP_PAIR

4SusanA-2This is a member of a DUP_PAIR

5CharlesA-2This is a member of a DUP_PAIR

6CharlesD-2This is a member of a DUP_PAIR

7CharlesA-2This is a member of a DUP_PAIR

8CharlesD-2This is a member of a DUP_PAIR

9CharlesA0NOT A DUP

10MichaelD-2This is a member of a DUP_PAIR

11MichaelA-2This is a member of a DUP_PAIR

12MichaelD-2This is a member of a DUP_PAIR

13MichaelA-2This is a member of a DUP_PAIR

14MichaelA0NOT A DUP

15MichaelA0NOT A DUP

venoym SSCarpal Tunnel Points: 4161 More actions · Answer 4

hm... not has hard as I might have thought...

UPDATE #TestTBL SET DupInd = -2;

WITH mycte AS (

SELECT

Row = ROW_NUMBER() OVER (PARTITION BY [datakey] ORDER BY [RecID]),

[RecID], [DataKey], [DupInd]

FROM

#TestTBL

WHERE

TransCode = 'A'

),

mycte2 AS (

SELECT Row = ROW_NUMBER() OVER (PARTITION BY [datakey] ORDER BY [RecID]),

[RecID], [DataKey], [DupInd]

FROM

#TestTBL

WHERE

TransCode = 'D'

)

UPDATE

#TestTBL

SET

DupInd = 0

FROM

(

SELECT

[recid1] = mycte.RecID,

[recid2]=mycte2.RecID

FROM

mycte FULL OUTER JOIN

mycte2 ON mycte.datakey = mycte2.datakey AND mycte.Row = mycte2.Row

WHERE mycte.RecID IS NULL OR mycte2.RecID IS NULL) AS a

where

(RecID = a.RecID1 AND a.recid2 IS NULL) OR

(recid = a.recid2 AND a.recid1 IS NULL)

SELECT RecID, DataKey, TransCode, DupInd,

CASE WHEN DupInd = -2 THEN 'This is a member of a DUP_PAIR' ELSE 'NOT A DUP' END 'Notes'

FROM #TestTBL

Edit: changed order by to match loop query in results. Performance: 45 rows - 55 reads, 14 duration. Loop: 461 reads, 24 duration

Mike Usoffski SSC Rookie Points: 38 More actions · Answer 5

Outstanding !

This is a small part of major data scrubbing/cleansing within 2000 --> 2K8 R2 migration project.

The plan was to leave "dups" behind on a SQL 2000 side. I didn't even think of CTE.

So, now I have options:

Either convert your code to pre-CTE

or may as well move all data to 2008 for processing.

Many thanks

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 6

Venom, you're using the same theory I am but your method won't deal with generic TransCodes, you'd have to code for each one. I'm assuming this sample is restricted for ease of use.

Mike, I need to understand a few things about your pattern. Also, if you'd be so kind, please change the code="plain" to code="sql" for the two portions, it'll make copy/paste easier. the Plain does strange things to the CR/LFs when I cut/paste.

Regarding this:

1 Susan A -2 This is a member of a DUP_PAIR

2 Susan D -2 This is a member of a DUP_PAIR

3 Susan D -2 This is a member of a DUP_PAIR

4 Susan A -2 This is a member of a DUP_PAIR

5 Charles A -2 This is a member of a DUP_PAIR

6 Charles D -2 This is a member of a DUP_PAIR

7 Charles A -2 This is a member of a DUP_PAIR

8 Charles D -2 This is a member of a DUP_PAIR

9 Charles A 0 NOT A DUP

10 Michael D -2 This is a member of a DUP_PAIR

11 Michael A -2 This is a member of a DUP_PAIR

12 Michael D -2 This is a member of a DUP_PAIR

13 Michael A -2 This is a member of a DUP_PAIR

14 Michael A 0 NOT A DUP

15 Michael A 0 NOT A DUP

Why is Charles on 5 a dupe when there's no previous charles entry yet Charles on 9 is not a dupe?

The reason this comes into play is on Row2, Susan is obviously duping because of Row 1, but Row3 is the same. What is the exact business logic of the dupe selection?

This is the code so far:

;WITH RowNumberOrdering AS

(SELECT

ROW_NUMBER() OVER ( ORDER BY DataKey, RecID) AS rn,

RecID, DataKey, TransCode

FROM

#TestTBL

)

SELECT

TestRec.RecID,

CASE WHEN TestRec.DataKey = CheckRec.DataKey

THEN CASE WHEN TestRec.TransCode <> CheckRec.TransCode

THEN 'This is a member of a DUP_PAIR'

ELSE 'NOT A DUP'

END

ELSE 'NOT A DUP'

END AS DupeCheckResult,

tt.*

FROM

RowNumberOrdering AS TestRec

LEFT JOIN

RowNumberOrdering AS CheckRec

ONTestRec.rn = CheckRec.rn - 1

LEFT JOIN #TestTBL AS tt

ONTestRec.RecID = tt.RecID

ORDER BY

testRec.RecID

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

venoym SSCarpal Tunnel Points: 4161 More actions · Answer 7

Kraig,

I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup". you can see this if you expand the original inserts as such

--Test Table

IF OBJECT_ID('tempdb..#TestTBL','U') IS NOT NULL

DROP TABLE #TestTBL

CREATE TABLE #TestTBL

(

RecID int IDENTITY(1,1) PRIMARY KEY CLUSTERED,

DataKey varchar(20),

TransCode char(1),

DupInd int

)

--Test Data

INSERT INTO #TestTBL

(RecID,DataKey, TransCode, DupInd)

SELECT 'Susan','A',0 UNION ALL

SELECT 'Susan','D',0 UNION ALL

SELECT 'Susan','A',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Charles','D',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Charles','D',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Michael','D',0 UNION ALL

SELECT 'Michael','A',0 UNION ALL

SELECT 'Michael','D',0 UNION ALL

SELECT 'Michael','A',0 UNION ALL

SELECT 'Susan','A',0 UNION ALL

SELECT 'Susan','D',0 UNION ALL

SELECT 'Susan','A',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Charles','D',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Charles','D',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Michael','D',0 UNION ALL

SELECT 'Michael','A',0 UNION ALL

SELECT 'Michael','D',0 UNION ALL

SELECT 'Michael','A',0 UNION ALL

SELECT 'Susan','A',0 UNION ALL

SELECT 'Susan','D',0 UNION ALL

SELECT 'Susan','A',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Charles','D',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Charles','D',0 UNION ALL

SELECT 'Charles','A',0 UNION ALL

SELECT 'Michael','D',0 UNION ALL

SELECT 'Michael','A',0 UNION ALL

SELECT 'Michael','D',0 UNION ALL

SELECT 'Michael','A',0 UNION ALL

SELECT 'Michael','A',0

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 8

venoym (10/24/2011)
Kraig,
I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".

Then I'm not sure about Rows 13 through 15.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

venoym SSCarpal Tunnel Points: 4161 More actions · Answer 9

OP can correct, but it looks like you start at the top. Select the first row that is "D" and the first row of "A" that have matching DataKeys. If one is missing, it's not a dup. If both exist, both are dups.

So in Rows 13-15

Row 13 has Row 12 it's a dup with.

Row 14 has no Row with a "D", so not a Dup

Row 15 has no Row with a "D", so not a Dup

venoym SSCarpal Tunnel Points: 4161 More actions · Answer 10

Mike,

For option 1, Use a view in place of the CTE. That's all it's doing is providing a query on the Data.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 11

Evil Kraig F (10/24/2011)
venoym (10/24/2011)
Kraig,
I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".
Then I'm not sure about Rows 13 through 15.

Yeah, I agree with what the code appears to do, but I wanted to make sure that was the intent. Susan gets herself knocked out completely this way... there is no non-dupe left, so I was looking to confirm the intent from the business rule perspective.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 12

Mike Usoffski (10/24/2011)
Outstanding !
This is a small part of major data scrubbing/cleansing within 2000 --> 2K8 R2 migration project.
The plan was to leave "dups" behind on a SQL 2000 side. I didn't even think of CTE.
So, now I have options:
Either convert your code to pre-CTE
or may as well move all data to 2008 for processing.
Many thanks

CTE's are pretty easy to convert in most cases... they just become "Derived Tables" in a FROM clause.

Unfortunately, it's not quite as easy to duplicate what ROW_NUMBER does but it's also not so hard. Just have the query from each CTE insert into a Temp Table with an IDENTITY column using an ORDER BY.

Of course, cleansing the data on the 2k8 side like you said is also a simple option.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mike Usoffski SSC Rookie Points: 38 More actions · Answer 13

Evil Kraig F (10/24/2011)
Evil Kraig F (10/24/2011)
venoym (10/24/2011)
Kraig,
I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".
Then I'm not sure about Rows 13 through 15.
Yeah, I agree with what the code appears to do, but I wanted to make sure that was the intent. Susan gets herself knocked out completely this way... there is no non-dupe left, so I was looking to confirm the intent from the business rule perspective.

Craig,

Sorry, late to the party. Setting my test-bed to see performance on a 12-15mil records. venoym is absolutely right...

venoym (10/24/2011) Any "orphan" or "single" DataKey is "Not a dup".

For this snippet it doesn't really matter which record is selected to be a Dup and which one is "orphan". In reality bus rules are implemented by using few datetime columns as a tiebreaker.

Thanks again guys and girls

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 14

If in girls you were referring to me, hate to tell you but I'm a guy.