Trying to not use Cursors - Eliminating Duplicates

Question

Trying to not use Cursors - Eliminating Duplicates

leroy-1092048

Ten Centuries

Points: 1054
More actions
October 22, 2009 at 1:38 pm

#136879

I have written the following T-SQL and it works fine... just slowly.
I am think its from the cursors in it.. but do not know how to do the same thing without it.
the ll_dup_Sender table is a table i populated that has all the duplicate and a new column called 'rn'
if a sender is duplicated (based on columns: local_part, domain_part, is_local)then the one with the 'rn' = 1 is the one to keep. the other ones need to be removed. hence, the trick part. any data in the Sender_data table that references the sender_id key in the ll_dup_sender table need to be updated to begin to reference the one that will remain. then finally all the duplicates can be deleted.
for example....
ll_dup_sender table
sender_id | local_part | Domain_part | is_local | rn
-----------------------------------------------------
1 | someone | domain.com | 0 | 2
2 | someone | domain.com | 0 | 1
3 | someone | domain.com | 0 | 3
44 | someone | domain.com | 0 | 4
Sender_data table
stat_id | sender_id | More columns ->
-----------------------------------------------------
1 | 1 |
2 | 44 |
3 | 44 |
44 | 3 |
so basically the Sender_data table needs to be updated so any rows with sender_ID of 3,44, 2 are updated to sender_id = 1 (the one to keep, FYI -- always the one with the lowest ID number) no info to be removed from the MT_monitor_fact_scanned table and no sender_ID can be orphaned from the ll_dup_sender table. and not all in the ll_dup_sender table are duplicated and some may be duplicated upto 8 times. what a mess!
i just cannot figure out how to do this without a cursor and it will just take to long with the way i am doing it
Thanks
here is the T-SQL i have been using:
----------------------------------------
declare @DupCount int, declare @Real_sender_id int, declare @sender_id int
declare @local_part varchar(255), declare @domain_Part varchar(255), declare @Is_Local bit
declare @strSQL varchar(255), declare @Query1 varchar(255), declare @Query2 varchar(255)
declare @fixedrows int, declare @rn int, declare @Cursor_rowcount int
SET NOCOUNT ON
select @DupCount = Count(sender_id) from ll_dup_sender
while @DupCount != 0
begin
set @strSQL = '('
select top(1) @local_part = Local_part, @domain_Part = domain_Part, @Is_Local = Is_Local from ll_dup_sender
DECLARE Bad_Sender_ID CURSOR LOCAL STATIC FOR
select sender_id, rn from ll_dup_sender where local_part = @local_part and domain_Part = @domain_Part and Is_Local = @Is_Local
open Bad_sender_ID
fetch next from Bad_sender_ID into @sender_id, @rn
set @Cursor_rowcount = @@cursor_rows
WHILE @@FETCH_STATUS = 0
BEGIN
if @Cursor_rowcount = 1
Begin
set @Real_sender_id = @sender_id
goto OnlyDelete
end
if @rn > 1 set @strSQL = @strSQL + cast(@sender_id as varchar) + ', '
if @rn = 1 set @Real_sender_id = @sender_id
FETCH NEXT FROM Bad_Sender_ID into @sender_id, @rn
END
set @strSQL = left(@strSQL, len(@strSQL) - 1) + ')'
set @Query1 = 'update Sender_data set sender_id = ' + cast(@Real_sender_id as varchar) + ' where sender_id IN ' + @strSQL
EXEC (@Query1)
OnlyDelete:
if @Cursor_rowcount = 1 set @Query2 = 'delete from ll_dup_sender where sender_id = ' + cast(@Real_sender_id as varchar)
if @Cursor_rowcount > 1 set @Query2 = 'delete from ll_dup_sender where sender_id IN ' + @strSQL + ' or sender_id = ' + cast(@Real_sender_id as varchar)
EXEC (@Query2)
set @DupCount = @DupCount -@@rowcount
CLOSE Bad_Sender_ID
DEALLOCATE Bad_Sender_ID
end

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Garadin One Orange Chip Points: 29613 More actions · Answer 1

Thanks for providing your code. However, if you could provide sample data and the results you would like to get (see the link in my signature on how to provide sample data for the correct way to do this), it would make it much easier for us to help you write some set based code.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

andrewd.smith SSCrazy Eights Points: 8444 More actions · Answer 2

If I understand your cursor code correctly, then the following 2 statements will do the equivalent. The self join on the ll_dup_sender table is effectvely creating a map linking duplicate sender_id values (where column rn > 1) to their associated "actual" sender_id values (where column rn = 1). The TRUNCATE statement just removes all values from the ll_dup_sender table since you don't appear to need it after updating the Sender_data table.

UPDATE SD SET sender_id = ACTUAL.sender_id

FROM ll_dup_sender ACTUAL

INNER JOIN ll_dup_sender DUP ON (

ACTUAL.local_part = DUP.local_part

AND ACTUAL.Domain_part = DUP.Domain_part

AND ACTUAL.is_local = DUP.is_local

AND ACTUAL.rn = 1

AND DUP.rn > 1

)

INNER JOIN Sender_data SD ON (SD.sender_id = DUP.sender_id)

TRUNCATE TABLE ll_dup_sender

The following version will be identical in terms of result and will very likely perform similarly, but the mapping from duplicate to actual sender_id values is made explicit in the CTE.

;WITH cteMAP AS (

SELECT

ACTUAL.sender_id AS actual_sender_id,

DUP.sender_id AS dup_sender_id

FROM ll_dup_sender ACTUAL

INNER JOIN ll_dup_sender DUP ON (

ACTUAL.local_part = DUP.local_part

AND ACTUAL.Domain_part = DUP.Domain_part

AND ACTUAL.is_local = DUP.is_local

AND ACTUAL.rn = 1

AND DUP.rn > 1

)

UPDATE SD SET sender_id = MAP.actual_sender_id

FROM Sender_data SD

INNER JOIN cteMAP MAP ON (SD.sender_id = MAP.dup_sender_id)

TRUNCATE TABLE ll_dup_sender

leroy-1092048 Ten Centuries Points: 1054 More actions · Answer 3

Thanks for the reply..

i will setup some test data and give the code a try, i will let you know how it goes.

Thanks

LL

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 4

As a side bar, if you had setup some test data and provided the scripts to gen that test data, any solutions folks gave you would probably already be tested. Please see the first link in my signature line below for how and what the format for the data should be.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

leroy-1092048 Ten Centuries Points: 1054 More actions · Answer 5

Jeff,

Thanks for pointing out the link on posting etiquette.

i have read the link when Garadin posted the same link in a reply to this post.

i was preparing some data to post to this thread but got sidetracked with some other things, when i had time to get back on this task a possible solution seemed to be posted (preliminary tests look good, thanks again andrewed)

so right now at the moment since it looks as if the puzzle has been solved it will hold on putting together a test data and table structure info since it does not appear to be needed anymore.

in the future i will follow the info as in the link provided by you and Garadin.

Thanks

Leroy L

leroy-1092048 Ten Centuries Points: 1054 More actions · Answer 6

andrewd,

I have tested your suggestion on sample data and it worked perfectly as is.

running the T-SQL on the actual data worked successfully, and I appreciate you for your time and thought.

Leroy L