SSIS upsert from CSV File Source

  • I'm trying to determine the most effect way to do an upsert in SSIS with CSV source 100+ rows

    1 x CSV Source (loop through multiple CSVs same structure 100 + columns) 1 x Staging Table Destination

    I have come up with 3 approaches:

    1) Using the SSIS lookup component , if record doesn't exist insert otherwise call some SQL script to do the update (row by row) so could be slow.

    2) Calling a sproc with a merge statement, so CSV to sproc (will be row by row) so slow. Also there will 100+ params

    3) Using another staging table that gets cleaned out before insert..Then calling a sproc to merge the 2 staging tables

    I'm leaning to approach 3 as its the simplest and cleanest.

    Thoughts?

  • I would go with a staging table, and upload the contents on your CSV file(s) to it. SSIS can do this part using a data flow task more than easily.

    If applicable, use a Execute SQL Task to clean the data, probably inserting it into a second staging table. This step may not be required, however, when importing data from a CSV, you're likely pulling every field through as a string (unless you have transformations in your data flow), Plus, depending on the source it could be quite "dirty" (odd spellings, dates in various formats (i.e. one row has 01-May-2016 while another 01/05/2016, etc)). Thus you probably want to get those conversions out the way if you haven't already, as it makes your UPDATE much easier (as making on dates, numbers, or even nvarchars with a varchar isn't quick if you have a lot of records), and clean those records up to something standised.

    Then, using a Execute SQL task, use SQL to do the decision making on whether it should be an INSERT or UPDATE. You could even use the MERGE syntax if you prefer it. I find that SQL is much better at this kind of decision making, and much more forgiving.

    Obviously don't forget to clear out your staging tables either at the start or end of your package. Otherwise you'll end up up duplicate your data each time you run the process. 🙂

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A - Friday, March 17, 2017 4:46 AM

    I would go with a staging table, and upload the contents on your CSV file(s) to it. SSIS can do this part using a data flow task more than easily.

    If applicable, use a Execute SQL Task to clean the data, probably inserting it into a second staging table. This step may not be required, however, when importing data from a CSV, you're likely pulling every field through as a string (unless you have transformations in your data flow), Plus, depending on the source it could be quite "dirty" (odd spellings, dates in various formats (i.e. one row has 01-May-2016 while another 01/05/2016, etc)). Thus you probably want to get those conversions out the way if you haven't already, as it makes your UPDATE much easier (as making on dates, numbers, or even nvarchars with a varchar isn't quick if you have a lot of records), and clean those records up to something standised.

    Then, using a Execute SQL task, use SQL to do the decision making on whether it should be an INSERT or UPDATE. You could even use the MERGE syntax if you prefer it. I find that SQL is much better at this kind of decision making, and much more forgiving.

    Obviously don't forget to clear out your staging tables either at the start or end of your package. Otherwise you'll end up up duplicate your data each time you run the process. 🙂

    Great, yes i think having the 2 staging tables is the way to go.
    The 1st staging table will get cleared down receiving CSV content.
    The 2nd staging table i will SQL merge with the first taking any new or updated content.
    Then Various other processes can get any new /updated content from this 2nd staging table.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply