Duplicates

Question

Post reply

Duplicates

matt101

Default port

Points: 1456
More actions
October 9, 2003 at 1:17 pm

#176796

I have an interesting scenerio. I am getting ascii files from our mainframe system that contains duplicate records(they say they can't weed them out without causing severe system slowdowns) So, I am looking for the best way to remove dups. I was reading an old article about the IGNORE_DUP_KEY option for the index and does this index automatically delete dups of newly added items or does it just throw an error saying I am violating the primary key. I think I am going to create a new table populate it with data and weed the extras out.
Matt

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Allen Cui-55137 SSC Guru Points: 51650 More actions · Answer 1

Allen Cui-55137

SSC Guru

Points: 51650

October 9, 2003 at 1:28 pm

#477754

I think you are in right direction.

OldHooky Grasshopper Points: 16 More actions · Answer 2

You could try using code like this to weed out duplicates - You need to ensure that the table is ordered on the field that you are taking the variables from.

It is usually recommended that cursors be avoided if possible - This is the only circumstance under which I would normally use them

Hope this helps

-- 1. Declare variables used to compare Vendor keys

DECLARE @kdnr_var1 char(6), @kdnr_var2 char(6)

-- 2. Declare and run cursor against Vendor work table

DECLARE dedupe_wrkVM3 CURSOR

FOR SELECT KDNR FROM wrkVM3 ORDER BY KDNR FOR UPDATE

OPEN dedupe_wrkVM3

FETCH NEXT FROM dedupe_wrkVM3 INTO @kdnr_var1

WHILE @@FETCH_STATUS = 0

BEGIN

SELECT @kdnr_var2=@kdnr_var1

FETCH NEXT FROM dedupe_wrkVM3 INTO @kdnr_var1

IF @@FETCH_STATUS = 0 AND @kdnr_var2=@kdnr_var1

BEGIN

INSERT INTO errSourceData

VALUES(GETDATE(),'wrkVM3','KDNR',@kdnr_var2,'Duplicate Vendor Record')

DELETE FROM wrkVM3 WHERE CURRENT OF dedupe_wrkVM3

END

CLOSE dedupe_wrkVM3

DEALLOCATE dedupe_wrkVM3

GO

JoeBelson Old Hand Points: 322 More actions · Answer 3

Hi Matt101,

IGNORE_DUP_KEY will check each attempted insert or update to verify that it its key(s) are unique. Violations of the constraint will be ignored, i.e. they will not be inserted in the table & they will raise an error but the batch will proceed. You will take a perf hit when using this option. Also you will be unable to acquire a BU lock on bulk inserts if there are indexes in the table.

Deleting the duplicate records after the fact is an option. Depending on the duplicate record specifications (size & quantity) you may dramatically exercise the log.

Another option might be to allow the duplicates. Following the load put an index on the table. Build a view using distinct for accessing the data in the table.

And another option might be to load the data into a staging table and then use select distinct to insert into the final table. Remember to drop or truncate your staging table once done. Bulk loading w/out indexes & using “tablock” hint will be your fastest load option.

-- “it depends” on your data (types, columns, record count)

Regards,

kleb_writes Old Hand Points: 337 More actions · Answer 4

Matt,

I've had to work similarly and the staging table approached worked best for me. It's very clear, easy to validate, and easy to clean up.

-Kel

K Leb

gljjr SSChampion Points: 10080 More actions · Answer 5

I've used the stage table technique for a long time to weed out dupes. It is by far the easiest way to go. Plus you can archive the stage table for safe keeping in case you need to redo the import or validate that you got the correct data. I'm actually kind of in the process of doing exactly this scenario right now as I upgrade my database to no longer use managed identities. Such fun! 🙂

Gary Johnson

Microsoft Natural Language Group

DBA, Sr. DB Engineer

Gary Johnson
Microsoft Natural Language Group
DBA, Sr. DB Engineer

This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.