Identifying Duplicates in Flat File Source

Question

Identifying Duplicates in Flat File Source

MattieMich7

Old Hand

Points: 367
More actions
November 19, 2009 at 11:08 am

#135403

My SSIS package has a flat file source with an eventual flat file destination output.
I'm looking for suggestions on the best approach to identify duplicate records in the flat file source. Please note that I still want to allow those duplicate records into my destination file, but I also want the package to identify the duplicates and spit them out into a separate output file for review.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

hussain27syed Ten Centuries Points: 1191 More actions · Answer 1

I always use the rownumber - partition query to remove duplicates .

Get all the data to a temporary table. use the row-number partition query and conditional split to seperate original and duplicates

MattieMich7 Old Hand Points: 367 More actions · Answer 2

Thank you for your reply.

However, I'm not looking to "remove" duplicates through a conditional split. I just want to identify them and export them to some kind of output for review, all while still allowing them to show up in the destination file.

hussain27syed Ten Centuries Points: 1191 More actions · Answer 3

U can use conditional split to split the data into two seperate destinations based on a conditions.

get the data from the file into a temporary table.

use a select query like

Select

Row_Number() Over(Partition By column1,column2,column3.. Order By column1,column2,column3.. ) As 'RowNum',*

from dbo.tmp_table

In partition by use the column based on which u want the duplicates to be decided.

use order to order records so that the top one is considered original.

(If you are dealing with true duplicates( all columns are duplicated) you can use anything in partition by and order by)

then add a conditional split. use rownum=1 in one condition and rownum>1 as other.

your duplicates will be seperated.

Phil Parkin SSC Guru Points: 246993 More actions · Answer 4

A variant of the temp table method is probably still the easiest for you.

Bring all the data into a temp table, run a query to identify & process the dups as you wish, then output the contents of the temp table to the flat file.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

MattieMich7 Old Hand Points: 367 More actions · Answer 5

Since my goal was to never remove the duplicates, I realized that the Multicast transformation was the best fit for my needs.

I have one Multicast output that sends ALL data (including duplicates) to my destination file and another Multicast output that sends DUPLICATE DATA ONLY to a "review" file. Thanks to those that took time to reply.

Phil Parkin SSC Guru Points: 246993 More actions · Answer 6

Well done on solving that one by yourself. My solution would have worked too though - at no point did I mention deleting dups, all I said was 'processing' them, by which I meant a query & output results to file. If your input file is large, this method is likely to be faster too.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel