Avoiding duplication within the same file

  • I have a SSIS package set up which processes about 50 files, each containing anywhere from 4.5MM to 6.2MM records. I'm importing these records into a highly normalized database, for example, there is a table for first/middle names and each record should be a unique pairing. In other words, there would be only one record in this table for "John Paul", no matter how many million people from our files had this as their first and middle name.

    I had originally set up SSIS to call a generic OLE DB command, which fired a SProc to determine if the name pairs already exists, e.g.,

    "IF NOT EXISTS (SELECT PK_FirstMiddleName FROM FirstMiddleNames WHERE FirstName = @FirstName AND MiddleName = @MiddleName", and then do an insert if true. This definitely helped filter out any dupes but, for obvious reasons, it was VERY slow, especially when it came to normalizing things like parsed-out address info.

    I then determined that it would be quicker to change it back to a regular OLE DB destination (INSERT) and then put an INSTEAD OF INSERT trigger on the tables that essentially did the same thing. I was right, it was a hell of a lot faster. Unfortunately, there's some oddity that's occurring where, out of my 6MM records, I'm ending up with about 50 duplicate pairs of first/middle names. I made sure that the OLE DB destination was set to FIRE_TRIGGERS and, in fact, even processed the same file again to make sure that it didn't insert any more records and it didn't. So the duplicate check is working on the trigger. It just seems to be something peculiar about the first run. I would chalk this up to the way that I'm parsing the names inside of my data transformation task, but when i query the data afterwards, such as:

    SELECT COUNT(*), FirstName, MiddleName FROM FirstMiddleName GROUP BY FirstName, MiddleName having count(*) > 1

    it definitely sees them as duplications and will return the same 50 or so records as having count(*) = 2, so I would have to assume that if SQL sees them as duplicates after they're in the database, it should be seeing them as duplicates at the INSTEAD OF INSERT trigger point....or would they?

    I dunno. If anyone has any ideas, let me know. And please, no critical comments. If you can't be helpful, don't respond.

  • It might have to do with the fact that the trigger is acting on a large batch of rows, so the check is against existing rows, but if there are dupes in the batch itself (without existing yet in the table) they will both pass the test?

    Have you thought of putting a unique index on the key columns and using IGNORE_DUP_KEY= ON ?

    From BOL:

    Specifies the error response when an insert operation attempts to insert duplicate key values into a unique index. The IGNORE_DUP_KEY option applies only to insert operations after the index is created or rebuilt. The option has no effect when executing CREATE INDEX, ALTER INDEX, or UPDATE. The default is OFF.

    ON

    A warning message will occur when duplicate key values are inserted into a unique index. Only the rows violating the uniqueness constraint will fail.

    OFF

    An error message will occur when duplicate key values are inserted into a unique index. The entire INSERT operation will be rolled back.

    jg

  • You know, I figured it probably had to do with duplicates being in the same batch. Anyhow, that's a terrific idea, thanks!

  • I'd load all the data into staging tables and then do a an outer join to load new rows.

    SELECT DISTINCT

    A.firstname,

    A.middlename

    FROM

    stagingtable as A LEFT JOIN

    firstnamemiddlename as FM

    ON A.firstname = FM.firstname AND

    A.middlename = FM.middlename

    WHERE

    FM.firstname IS NULL

    Or you can do it similarly using a WHERE EXISTS instead of a LEFT JOIN.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply