Help Converting While Loop to true Set Based

  • The task is to identify duplicated records [Note: NOT true dups but as a PAIR(s) only.]

    A Pair is two records with Identical [DataKey] values but different [TransCode], in this example first TransCode = 'A' and another = 'D'

    --Test Table

    IF OBJECT_ID('tempdb..#TestTBL','U') IS NOT NULL

    DROP TABLE #TestTBL

    CREATE TABLE #TestTBL

    (

    RecID int PRIMARY KEY CLUSTERED,

    DataKey varchar(20),

    TransCode char(1),

    DupInd int

    )

    --Test Data

    INSERT INTO #TestTBL

    (RecID,DataKey, TransCode, DupInd)

    SELECT 1,'Susan','A',0 UNION ALL

    SELECT 2,'Susan','D',0 UNION ALL

    SELECT 3,'Susan','D',0 UNION ALL

    SELECT 4,'Susan','A',0 UNION ALL

    SELECT 5,'Charles','A',0 UNION ALL

    SELECT 6,'Charles','D',0 UNION ALL

    SELECT 7,'Charles','A',0 UNION ALL

    SELECT 8,'Charles','D',0 UNION ALL

    SELECT 9,'Charles','A',0 UNION ALL

    SELECT 10,'Michael','D',0 UNION ALL

    SELECT 11,'Michael','A',0 UNION ALL

    SELECT 12,'Michael','D',0 UNION ALL

    SELECT 13,'Michael','A',0 UNION ALL

    SELECT 14,'Michael','A',0 UNION ALL

    SELECT 15,'Michael','A',0

    In most cases there is more than one pair. So, the loop is used to make sure that ALL pairs are selected...

    /*Straightforward Loop*/

    --declare and init variables

    DECLARE @vintRCI int = (-1), @vintCount int = 101

    WHILE @vintRCI <> 0

    BEGIN

    UPDATE a --set flag for 'A' type

    SET a.DupInd = @vintCount

    FROM #TestTBL a

    INNER JOIN #TestTBL b ON a.DataKey = b.DataKey

    INNER JOIN (

    SELECT DataKey, MIN(RecID) as minID

    FROM #TestTBL

    WHERE TransCode = 'A' AND DupInd = 0

    GROUP BY DataKey

    ) x ON a.DataKey = x.DataKey AND a.RecID = x.minID

    WHERE a.TransCode = 'A' AND b.TransCode = 'D'

    AND a.DupInd = 0 AND b.DupInd = 0

    SELECT @vintRCI = @@ROWCOUNT

    UPDATE a --Set flag for corresponding 'D' type

    SET a.DupInd = @vintCount

    FROM #TestTBL a

    INNER JOIN #TestTBL b ON a.DataKey = b.DataKey

    INNER JOIN (

    SELECT DataKey, MIN(RecID) as minID

    FROM #TestTBL

    WHERE TransCode = 'D' AND DupInd = 0

    GROUP BY DataKey

    ) x ON a.DataKey = x.DataKey AND a.RecID = x.minID

    WHERE a.TransCode = 'D' AND b.TransCode = 'A'

    AND a.DupInd = 0 AND b.DupInd = @vintCount

    SELECT @vintRCI = @@ROWCOUNT

    SET @vintCount=@vintCount + 100

    END

    --set unified flag

    UPDATE #TestTBL SET DupInd = -2 WHERE DupInd > 100

    --Show result

    SELECT RecID, DataKey, TransCode, DupInd,

    CASE DupInd WHEN -2 THEN 'This is a member of a DUP_PAIR' ELSE 'NOT A DUP' END 'Notes'

    FROM #TestTBL

    "Loop based" script above would return correct results, but I have a feeling we can do better. Right?

    Thanks

  • If I may, that won't be an easy one to do. You might want to have a look at the cumulative update or running total articles by Jeff Moden for some ideas...

  • What is your expected results when the process is completed based on your sample data?

  • Lynn, There is a select at the bottom of the script...

    1SusanA-2This is a member of a DUP_PAIR

    2SusanD-2This is a member of a DUP_PAIR

    3SusanD-2This is a member of a DUP_PAIR

    4SusanA-2This is a member of a DUP_PAIR

    5CharlesA-2This is a member of a DUP_PAIR

    6CharlesD-2This is a member of a DUP_PAIR

    7CharlesA-2This is a member of a DUP_PAIR

    8CharlesD-2This is a member of a DUP_PAIR

    9CharlesA0NOT A DUP

    10MichaelD-2This is a member of a DUP_PAIR

    11MichaelA-2This is a member of a DUP_PAIR

    12MichaelD-2This is a member of a DUP_PAIR

    13MichaelA-2This is a member of a DUP_PAIR

    14MichaelA0NOT A DUP

    15MichaelA0NOT A DUP

  • hm... not has hard as I might have thought...

    UPDATE #TestTBL SET DupInd = -2;

    WITH mycte AS (

    SELECT

    Row = ROW_NUMBER() OVER (PARTITION BY [datakey] ORDER BY [RecID]),

    [RecID], [DataKey], [DupInd]

    FROM

    #TestTBL

    WHERE

    TransCode = 'A'

    ),

    mycte2 AS (

    SELECT Row = ROW_NUMBER() OVER (PARTITION BY [datakey] ORDER BY [RecID]),

    [RecID], [DataKey], [DupInd]

    FROM

    #TestTBL

    WHERE

    TransCode = 'D'

    )

    UPDATE

    #TestTBL

    SET

    DupInd = 0

    FROM

    (

    SELECT

    [recid1] = mycte.RecID,

    [recid2]=mycte2.RecID

    FROM

    mycte FULL OUTER JOIN

    mycte2 ON mycte.datakey = mycte2.datakey AND mycte.Row = mycte2.Row

    WHERE mycte.RecID IS NULL OR mycte2.RecID IS NULL) AS a

    where

    (RecID = a.RecID1 AND a.recid2 IS NULL) OR

    (recid = a.recid2 AND a.recid1 IS NULL)

    SELECT RecID, DataKey, TransCode, DupInd,

    CASE WHEN DupInd = -2 THEN 'This is a member of a DUP_PAIR' ELSE 'NOT A DUP' END 'Notes'

    FROM #TestTBL

    Edit: changed order by to match loop query in results. Performance: 45 rows - 55 reads, 14 duration. Loop: 461 reads, 24 duration

  • Outstanding !

    This is a small part of major data scrubbing/cleansing within 2000 --> 2K8 R2 migration project.

    The plan was to leave "dups" behind on a SQL 2000 side. I didn't even think of CTE.

    So, now I have options:

    Either convert your code to pre-CTE

    or may as well move all data to 2008 for processing.

    Many thanks

  • Venom, you're using the same theory I am but your method won't deal with generic TransCodes, you'd have to code for each one. I'm assuming this sample is restricted for ease of use.

    Mike, I need to understand a few things about your pattern. Also, if you'd be so kind, please change the code="plain" to code="sql" for the two portions, it'll make copy/paste easier. the Plain does strange things to the CR/LFs when I cut/paste.

    Regarding this:

    1 Susan A -2 This is a member of a DUP_PAIR

    2 Susan D -2 This is a member of a DUP_PAIR

    3 Susan D -2 This is a member of a DUP_PAIR

    4 Susan A -2 This is a member of a DUP_PAIR

    5 Charles A -2 This is a member of a DUP_PAIR

    6 Charles D -2 This is a member of a DUP_PAIR

    7 Charles A -2 This is a member of a DUP_PAIR

    8 Charles D -2 This is a member of a DUP_PAIR

    9 Charles A 0 NOT A DUP

    10 Michael D -2 This is a member of a DUP_PAIR

    11 Michael A -2 This is a member of a DUP_PAIR

    12 Michael D -2 This is a member of a DUP_PAIR

    13 Michael A -2 This is a member of a DUP_PAIR

    14 Michael A 0 NOT A DUP

    15 Michael A 0 NOT A DUP

    Why is Charles on 5 a dupe when there's no previous charles entry yet Charles on 9 is not a dupe?

    The reason this comes into play is on Row2, Susan is obviously duping because of Row 1, but Row3 is the same. What is the exact business logic of the dupe selection?

    This is the code so far:

    ;WITH RowNumberOrdering AS

    (SELECT

    ROW_NUMBER() OVER ( ORDER BY DataKey, RecID) AS rn,

    RecID, DataKey, TransCode

    FROM

    #TestTBL

    )

    SELECT

    TestRec.RecID,

    CASE WHEN TestRec.DataKey = CheckRec.DataKey

    THEN CASE WHEN TestRec.TransCode <> CheckRec.TransCode

    THEN 'This is a member of a DUP_PAIR'

    ELSE 'NOT A DUP'

    END

    ELSE 'NOT A DUP'

    END AS DupeCheckResult,

    tt.*

    FROM

    RowNumberOrdering AS TestRec

    LEFT JOIN

    RowNumberOrdering AS CheckRec

    ONTestRec.rn = CheckRec.rn - 1

    LEFT JOIN #TestTBL AS tt

    ONTestRec.RecID = tt.RecID

    ORDER BY

    testRec.RecID


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Kraig,

    I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup". you can see this if you expand the original inserts as such

    --Test Table

    IF OBJECT_ID('tempdb..#TestTBL','U') IS NOT NULL

    DROP TABLE #TestTBL

    CREATE TABLE #TestTBL

    (

    RecID int IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    DataKey varchar(20),

    TransCode char(1),

    DupInd int

    )

    --Test Data

    INSERT INTO #TestTBL

    (RecID,DataKey, TransCode, DupInd)

    SELECT 'Susan','A',0 UNION ALL

    SELECT 'Susan','D',0 UNION ALL

    SELECT 'Susan','D',0 UNION ALL

    SELECT 'Susan','A',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Charles','D',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Charles','D',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Michael','D',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','D',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Susan','A',0 UNION ALL

    SELECT 'Susan','D',0 UNION ALL

    SELECT 'Susan','D',0 UNION ALL

    SELECT 'Susan','A',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Charles','D',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Charles','D',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Michael','D',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','D',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Susan','A',0 UNION ALL

    SELECT 'Susan','D',0 UNION ALL

    SELECT 'Susan','D',0 UNION ALL

    SELECT 'Susan','A',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Charles','D',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Charles','D',0 UNION ALL

    SELECT 'Charles','A',0 UNION ALL

    SELECT 'Michael','D',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','D',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','A',0 UNION ALL

    SELECT 'Michael','A',0

  • venoym (10/24/2011)


    Kraig,

    I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".

    Then I'm not sure about Rows 13 through 15.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • OP can correct, but it looks like you start at the top. Select the first row that is "D" and the first row of "A" that have matching DataKeys. If one is missing, it's not a dup. If both exist, both are dups.

    So in Rows 13-15

    Row 13 has Row 12 it's a dup with.

    Row 14 has no Row with a "D", so not a Dup

    Row 15 has no Row with a "D", so not a Dup

  • Mike,

    For option 1, Use a view in place of the CTE. That's all it's doing is providing a query on the Data.

  • Evil Kraig F (10/24/2011)


    venoym (10/24/2011)


    Kraig,

    I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".

    Then I'm not sure about Rows 13 through 15.

    Yeah, I agree with what the code appears to do, but I wanted to make sure that was the intent. Susan gets herself knocked out completely this way... there is no non-dupe left, so I was looking to confirm the intent from the business rule perspective.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Mike Usoffski (10/24/2011)


    Outstanding !

    This is a small part of major data scrubbing/cleansing within 2000 --> 2K8 R2 migration project.

    The plan was to leave "dups" behind on a SQL 2000 side. I didn't even think of CTE.

    So, now I have options:

    Either convert your code to pre-CTE

    or may as well move all data to 2008 for processing.

    Many thanks

    CTE's are pretty easy to convert in most cases... they just become "Derived Tables" in a FROM clause.

    Unfortunately, it's not quite as easy to duplicate what ROW_NUMBER does but it's also not so hard. Just have the query from each CTE insert into a Temp Table with an IDENTITY column using an ORDER BY.

    Of course, cleansing the data on the 2k8 side like you said is also a simple option.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Evil Kraig F (10/24/2011)


    Evil Kraig F (10/24/2011)


    venoym (10/24/2011)


    Kraig,

    I think what it's analyzing is on the whole find all pairs that match DataKey and have opposite Types. Any "orphan" or "single" DataKey is "Not a dup".

    Then I'm not sure about Rows 13 through 15.

    Yeah, I agree with what the code appears to do, but I wanted to make sure that was the intent. Susan gets herself knocked out completely this way... there is no non-dupe left, so I was looking to confirm the intent from the business rule perspective.

    Craig,

    Sorry, late to the party. Setting my test-bed to see performance on a 12-15mil records. venoym is absolutely right...

    venoym (10/24/2011) Any "orphan" or "single" DataKey is "Not a dup".

    For this snippet it doesn't really matter which record is selected to be a Dup and which one is "orphan". In reality bus rules are implemented by using few datetime columns as a tiebreaker.

    Thanks again guys and girls

  • If in girls you were referring to me, hate to tell you but I'm a guy.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply