November 19, 2009 at 11:08 am
My SSIS package has a flat file source with an eventual flat file destination output.
I'm looking for suggestions on the best approach to identify duplicate records in the flat file source. Please note that I still want to allow those duplicate records into my destination file, but I also want the package to identify the duplicates and spit them out into a separate output file for review.
November 19, 2009 at 11:18 am
I always use the rownumber - partition query to remove duplicates .
Get all the data to a temporary table. use the row-number partition query and conditional split to seperate original and duplicates
November 19, 2009 at 11:21 am
Thank you for your reply.
However, I'm not looking to "remove" duplicates through a conditional split. I just want to identify them and export them to some kind of output for review, all while still allowing them to show up in the destination file.
November 19, 2009 at 11:34 am
U can use conditional split to split the data into two seperate destinations based on a conditions.
get the data from the file into a temporary table.
use a select query like
Select
Row_Number() Over(Partition By column1,column2,column3.. Order By column1,column2,column3.. ) As 'RowNum',*
from dbo.tmp_table
In partition by use the column based on which u want the duplicates to be decided.
use order to order records so that the top one is considered original.
(If you are dealing with true duplicates( all columns are duplicated) you can use anything in partition by and order by)
then add a conditional split. use rownum=1 in one condition and rownum>1 as other.
your duplicates will be seperated.
November 19, 2009 at 11:40 am
A variant of the temp table method is probably still the easiest for you.
Bring all the data into a temp table, run a query to identify & process the dups as you wish, then output the contents of the temp table to the flat file.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
November 19, 2009 at 12:43 pm
Since my goal was to never remove the duplicates, I realized that the Multicast transformation was the best fit for my needs.
I have one Multicast output that sends ALL data (including duplicates) to my destination file and another Multicast output that sends DUPLICATE DATA ONLY to a "review" file. Thanks to those that took time to reply.
November 19, 2009 at 2:26 pm
Well done on solving that one by yourself. My solution would have worked too though - at no point did I mention deleting dups, all I said was 'processing' them, by which I meant a query & output results to file. If your input file is large, this method is likely to be faster too.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply