Redirect duplicate records using ssis

  • Hi,

    We have a requirement like the source is excel and there are duplicates in the file we need to redirect the duplicates into one table and the unique records will be in another table. There may be 20+ columns in the excel sheet.

  • Easiest option: dump records in a staging table. Then select again and use the ROW_NUMBER function to indicate duplicates. Everything with row number 1 goes into the unique table, all other row number go to the duplicate table.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • We should not use a staging table while loading from source we should pull unique records in single table and the duplicate in next table without using row_number,rank & dense rank.

  • nhimabindhu (12/4/2014)


    We should not use a staging table while loading from source we should pull unique records in single table and the duplicate in next table without using row_number,rank & dense rank.

    Why not?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Is there any other approach other than this which we achieve the result as per requirement.:-)

  • If we don't understand your requirement, it's difficult to advise you. Such arbitrary stipulations make this sound a bit like a homework assignment. So, what makes you so hostile to staging tables and windowing functions?

    John

  • nhimabindhu (12/4/2014)


    Is there any other approach other than this which we achieve the result as per requirement.:-)

    If your requirement is such that terrible performance is acceptable, there is. Are you happy to accept performance which is quite likely 10x worse than Koen's suggestion?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Well, I don't know how big your spreadsheets are, so don't know what the performance will be like for you with this, but how about you have two Excel Source data flow source components in the package. One for the duplicates and one for the unique records. Set the data access mode to each to be SQL command and then have the SQL command text like the two examples below (obviously you would need to add the extra columns and maybe change the sheet name).

    Duplicate rows

    SELECT F1,F2,F3,F4,F5

    FROM `Sheet1$`

    GROUP BY F1,F2,F3,F4,F5

    HAVING COUNT(*) > 1

    Unique rows

    SELECT F1,F2,F3,F4,F5

    FROM `Sheet1$`

    GROUP BY F1,F2,F3,F4,F5

    HAVING COUNT(*) = 1

    Personally, I'd still go with the staging table suggestions made earlier.

  • And if it *is* a homework problem, a quick google search produces some approaches that use SSIS transforms to handle this problem.

    Steve.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply