Identifying Duplicates in Flat File Source

  • My SSIS package has a flat file source with an eventual flat file destination output.

    I'm looking for suggestions on the best approach to identify duplicate records in the flat file source. Please note that I still want to allow those duplicate records into my destination file, but I also want the package to identify the duplicates and spit them out into a separate output file for review.

  • I always use the rownumber - partition query to remove duplicates .

    Get all the data to a temporary table. use the row-number partition query and conditional split to seperate original and duplicates

  • Thank you for your reply.

    However, I'm not looking to "remove" duplicates through a conditional split. I just want to identify them and export them to some kind of output for review, all while still allowing them to show up in the destination file.

  • U can use conditional split to split the data into two seperate destinations based on a conditions.

    get the data from the file into a temporary table.

    use a select query like

    Select

    Row_Number() Over(Partition By column1,column2,column3.. Order By column1,column2,column3.. ) As 'RowNum',*

    from dbo.tmp_table

    In partition by use the column based on which u want the duplicates to be decided.

    use order to order records so that the top one is considered original.

    (If you are dealing with true duplicates( all columns are duplicated) you can use anything in partition by and order by)

    then add a conditional split. use rownum=1 in one condition and rownum>1 as other.

    your duplicates will be seperated.

  • A variant of the temp table method is probably still the easiest for you.

    Bring all the data into a temp table, run a query to identify & process the dups as you wish, then output the contents of the temp table to the flat file.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Since my goal was to never remove the duplicates, I realized that the Multicast transformation was the best fit for my needs.

    I have one Multicast output that sends ALL data (including duplicates) to my destination file and another Multicast output that sends DUPLICATE DATA ONLY to a "review" file. Thanks to those that took time to reply.

  • Well done on solving that one by yourself. My solution would have worked too though - at no point did I mention deleting dups, all I said was 'processing' them, by which I meant a query & output results to file. If your input file is large, this method is likely to be faster too.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply