Trying to not use Cursors - Eliminating Duplicates

  • I have written the following T-SQL and it works fine... just slowly.

    I am think its from the cursors in it.. but do not know how to do the same thing without it.

    the ll_dup_Sender table is a table i populated that has all the duplicate and a new column called 'rn'

    if a sender is duplicated (based on columns: local_part, domain_part, is_local)then the one with the 'rn' = 1 is the one to keep. the other ones need to be removed. hence, the trick part. any data in the Sender_data table that references the sender_id key in the ll_dup_sender table need to be updated to begin to reference the one that will remain. then finally all the duplicates can be deleted.

    for example....

    ll_dup_sender table

    sender_id | local_part | Domain_part | is_local | rn

    -----------------------------------------------------

    1 | someone | domain.com | 0 | 2

    2 | someone | domain.com | 0 | 1

    3 | someone | domain.com | 0 | 3

    44 | someone | domain.com | 0 | 4

    Sender_data table

    stat_id | sender_id | More columns ->

    -----------------------------------------------------

    1 | 1 |

    2 | 44 |

    3 | 44 |

    44 | 3 |

    so basically the Sender_data table needs to be updated so any rows with sender_ID of 3,44, 2 are updated to sender_id = 1 (the one to keep, FYI -- always the one with the lowest ID number) no info to be removed from the MT_monitor_fact_scanned table and no sender_ID can be orphaned from the ll_dup_sender table. and not all in the ll_dup_sender table are duplicated and some may be duplicated upto 8 times. what a mess!

    i just cannot figure out how to do this without a cursor and it will just take to long with the way i am doing it

    Thanks

    here is the T-SQL i have been using:

    ----------------------------------------

    declare @DupCount int, declare @Real_sender_id int, declare @sender_id int

    declare @local_part varchar(255), declare @domain_Part varchar(255), declare @Is_Local bit

    declare @strSQL varchar(255), declare @Query1 varchar(255), declare @Query2 varchar(255)

    declare @fixedrows int, declare @rn int, declare @Cursor_rowcount int

    SET NOCOUNT ON

    select @DupCount = Count(sender_id) from ll_dup_sender

    while @DupCount != 0

    begin

    set @strSQL = '('

    select top(1) @local_part = Local_part, @domain_Part = domain_Part, @Is_Local = Is_Local from ll_dup_sender

    DECLARE Bad_Sender_ID CURSOR LOCAL STATIC FOR

    select sender_id, rn from ll_dup_sender where local_part = @local_part and domain_Part = @domain_Part and Is_Local = @Is_Local

    open Bad_sender_ID

    fetch next from Bad_sender_ID into @sender_id, @rn

    set @Cursor_rowcount = @@cursor_rows

    WHILE @@FETCH_STATUS = 0

    BEGIN

    if @Cursor_rowcount = 1

    Begin

    set @Real_sender_id = @sender_id

    goto OnlyDelete

    end

    if @rn > 1 set @strSQL = @strSQL + cast(@sender_id as varchar) + ', '

    if @rn = 1 set @Real_sender_id = @sender_id

    FETCH NEXT FROM Bad_Sender_ID into @sender_id, @rn

    END

    set @strSQL = left(@strSQL, len(@strSQL) - 1) + ')'

    set @Query1 = 'update Sender_data set sender_id = ' + cast(@Real_sender_id as varchar) + ' where sender_id IN ' + @strSQL

    EXEC (@Query1)

    OnlyDelete:

    if @Cursor_rowcount = 1 set @Query2 = 'delete from ll_dup_sender where sender_id = ' + cast(@Real_sender_id as varchar)

    if @Cursor_rowcount > 1 set @Query2 = 'delete from ll_dup_sender where sender_id IN ' + @strSQL + ' or sender_id = ' + cast(@Real_sender_id as varchar)

    EXEC (@Query2)

    set @DupCount = @DupCount -@@rowcount

    CLOSE Bad_Sender_ID

    DEALLOCATE Bad_Sender_ID

    end

  • Thanks for providing your code. However, if you could provide sample data and the results you would like to get (see the link in my signature on how to provide sample data for the correct way to do this), it would make it much easier for us to help you write some set based code.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • If I understand your cursor code correctly, then the following 2 statements will do the equivalent. The self join on the ll_dup_sender table is effectvely creating a map linking duplicate sender_id values (where column rn > 1) to their associated "actual" sender_id values (where column rn = 1). The TRUNCATE statement just removes all values from the ll_dup_sender table since you don't appear to need it after updating the Sender_data table.

    UPDATE SD SET sender_id = ACTUAL.sender_id

    FROM ll_dup_sender ACTUAL

    INNER JOIN ll_dup_sender DUP ON (

    ACTUAL.local_part = DUP.local_part

    AND ACTUAL.Domain_part = DUP.Domain_part

    AND ACTUAL.is_local = DUP.is_local

    AND ACTUAL.rn = 1

    AND DUP.rn > 1

    )

    INNER JOIN Sender_data SD ON (SD.sender_id = DUP.sender_id)

    TRUNCATE TABLE ll_dup_sender

    The following version will be identical in terms of result and will very likely perform similarly, but the mapping from duplicate to actual sender_id values is made explicit in the CTE.

    ;WITH cteMAP AS (

    SELECT

    ACTUAL.sender_id AS actual_sender_id,

    DUP.sender_id AS dup_sender_id

    FROM ll_dup_sender ACTUAL

    INNER JOIN ll_dup_sender DUP ON (

    ACTUAL.local_part = DUP.local_part

    AND ACTUAL.Domain_part = DUP.Domain_part

    AND ACTUAL.is_local = DUP.is_local

    AND ACTUAL.rn = 1

    AND DUP.rn > 1

    )

    )

    UPDATE SD SET sender_id = MAP.actual_sender_id

    FROM Sender_data SD

    INNER JOIN cteMAP MAP ON (SD.sender_id = MAP.dup_sender_id)

    TRUNCATE TABLE ll_dup_sender

  • Thanks for the reply..

    i will setup some test data and give the code a try, i will let you know how it goes.

    Thanks

    LL

  • As a side bar, if you had setup some test data and provided the scripts to gen that test data, any solutions folks gave you would probably already be tested. Please see the first link in my signature line below for how and what the format for the data should be.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    Thanks for pointing out the link on posting etiquette.

    i have read the link when Garadin posted the same link in a reply to this post.

    i was preparing some data to post to this thread but got sidetracked with some other things, when i had time to get back on this task a possible solution seemed to be posted (preliminary tests look good, thanks again andrewed)

    so right now at the moment since it looks as if the puzzle has been solved it will hold on putting together a test data and table structure info since it does not appear to be needed anymore.

    in the future i will follow the info as in the link provided by you and Garadin.

    Thanks

    Leroy L

  • andrewd,

    I have tested your suggestion on sample data and it worked perfectly as is.

    running the T-SQL on the actual data worked successfully, and I appreciate you for your time and thought.

    Leroy L

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply