October 9, 2003 at 1:17 pm
I have an interesting scenerio. I am getting ascii files from our mainframe system that contains duplicate records(they say they can't weed them out without causing severe system slowdowns) So, I am looking for the best way to remove dups. I was reading an old article about the IGNORE_DUP_KEY option for the index and does this index automatically delete dups of newly added items or does it just throw an error saying I am violating the primary key. I think I am going to create a new table populate it with data and weed the extras out.
Matt
October 9, 2003 at 1:28 pm
I think you are in right direction.
October 10, 2003 at 2:33 am
You could try using code like this to weed out duplicates - You need to ensure that the table is ordered on the field that you are taking the variables from.
It is usually recommended that cursors be avoided if possible - This is the only circumstance under which I would normally use them
Hope this helps
-- 1. Declare variables used to compare Vendor keys
DECLARE @kdnr_var1 char(6), @kdnr_var2 char(6)
-- 2. Declare and run cursor against Vendor work table
DECLARE dedupe_wrkVM3 CURSOR
FOR SELECT KDNR FROM wrkVM3 ORDER BY KDNR FOR UPDATE
OPEN dedupe_wrkVM3
FETCH NEXT FROM dedupe_wrkVM3 INTO @kdnr_var1
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @kdnr_var2=@kdnr_var1
FETCH NEXT FROM dedupe_wrkVM3 INTO @kdnr_var1
IF @@FETCH_STATUS = 0 AND @kdnr_var2=@kdnr_var1
BEGIN
INSERT INTO errSourceData
VALUES(GETDATE(),'wrkVM3','KDNR',@kdnr_var2,'Duplicate Vendor Record')
DELETE FROM wrkVM3 WHERE CURRENT OF dedupe_wrkVM3
END
END
CLOSE dedupe_wrkVM3
DEALLOCATE dedupe_wrkVM3
GO
October 10, 2003 at 1:39 pm
Hi Matt101,
IGNORE_DUP_KEY will check each attempted insert or update to verify that it its key(s) are unique. Violations of the constraint will be ignored, i.e. they will not be inserted in the table & they will raise an error but the batch will proceed. You will take a perf hit when using this option. Also you will be unable to acquire a BU lock on bulk inserts if there are indexes in the table.
Deleting the duplicate records after the fact is an option. Depending on the duplicate record specifications (size & quantity) you may dramatically exercise the log.
Another option might be to allow the duplicates. Following the load put an index on the table. Build a view using distinct for accessing the data in the table.
And another option might be to load the data into a staging table and then use select distinct to insert into the final table. Remember to drop or truncate your staging table once done. Bulk loading w/out indexes & using “tablock” hint will be your fastest load option.
-- “it depends” on your data (types, columns, record count)
Regards,
October 10, 2003 at 1:53 pm
Matt,
I've had to work similarly and the staging table approached worked best for me. It's very clear, easy to validate, and easy to clean up.
-Kel
K Leb
October 10, 2003 at 5:18 pm
I've used the stage table technique for a long time to weed out dupes. It is by far the easiest way to go. Plus you can archive the stage table for safe keeping in case you need to redo the import or validate that you got the correct data. I'm actually kind of in the process of doing exactly this scenario right now as I upgrade my database to no longer use managed identities. Such fun! 🙂
Gary Johnson
Microsoft Natural Language Group
DBA, Sr. DB Engineer
Gary Johnson
Microsoft Natural Language Group
DBA, Sr. DB Engineer
This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply