Can SSIS import a CSV and update a table all in one Data Flow process?

  • Is this even possible with SSIS? I'm very new to it, but I can see there's a wealth of objects in the toolbox.

    I've created a connection and a Flat File Source. What I need to do next is:

    1. Do some basic validation of the data. Would that be the Data Conversion object or ???

    2. Update one table with data using the table's primary integer key which is one of my imported data fields. Is there an object that will let me do that via SQL script or some other mechanism?

    Can anyone suggest which objects I ought to look at to accomplish these tasks? All I need is some high-level help to nudge me toward the basic data flow model.

    Thanks in advance!

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

  • Updates aren't done in a particularly efficient fashion in SSIS.

    SSIS functions as a pipeline stream, with rows being read from the source and fed to the transformations and then the destination. Normal transformations are non-blocking, which means the rows flow out almost as fast as they flow in.

    Putting an update operation in turns the transformations into a blocking set. This basically inhibits the main use of SSIS.

    That being said, it's not necessarily a bad thing to do.

    You could take your flat file source, read it in, and put a derived column transformation or a conditional split transformation depending on what you want to happen to the rows which don't pass validation. The derived column one would allow you to keep the row that doesn't pass validation, and substitute other values for the ones which fail the checks, while the conditional split would let you drop the rows which don't pass validation.

    For the update part, you have two options. You can use the SQL Command transform to perform an action at each row, such as an UPDATE. The problem with this is that if your source is large, you could be looking at a large number of updates occurring, and if there's any sort of locking from other sources going on, this could cause your process to take a very long time.

    Your other option, and which is probably the more recommended approach, is to take your flat file source, pipeline the records through, and put them into a temporary table. Then, use an Execute SQL Task to perform an UPDATE, joining to your table on the primary key field and the imported field.

  • kramaswamy,

    Thank you for all that practical advice. I'm taking your advice to do the temp table and do the update with a sql statement. One more question, please. In looking at the Execute SQL task, it's not clear if I can use it to fire a stored procedure (which would give me more control). Is that possible? Or do I embed the whole text of the s/p, including cursor logic, in the Enter SQL query box?

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

  • The Execute SQL Task can be used either way. You can put in the stored proc, or you can put in the code, whichever you prefer.

  • I particularly like this sentence:

    ... which means the rows flow in almost as fast as they flow out.

    presumably creating some sort of data vacuum which may implode the pipeline?

    😀

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Lol - okay, got those two reversed, but you get the gist of it 😛

  • All right contestants, here's a 5-point tossup for anyone:

    Let's say I want to import a CSV file with this process every day, and this file comes to me with the date as part of its name, as in: "Upload_20111221.csv"

    Every day, I can copy the new day's file to something like "DailyUpload.csv" and hardwire my Flat File Source connection manager to look only for that filename.

    But I'm incurably lazy--I'd like to automate this process. Is there a way to add a scriptlet or function that would modify the filename expected by the connection manager on the fly?

    For example, today the SSIS package would somehow "know" to open the CSV file named "Upload_20111221.csv?" Is this even possible? Can SSIS jobs be run on a schedule automatically like SQL Server jobs?

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

  • Yes to all counts 😛

    SSIS has a wonderful feature called "Variables". They work pretty similarly to variables in normal programming environments.

    What you can do is have a variable which is set through an Expression. For example, if you know your file names are always going to be ""Upload_[yyyymmdd].csv", then you can make a variable called FileName, and have it be evaluated as an Expression. The Expression would be:

    "Upload_" + REPLACE(SUBSTRING((DT_WSTR, 30) GETDATE(), 1, 10), "-", "") + ".csv"

    Set your flat file manager up with a normal file, since it needs a template. Then, select the flat file manager, and in the properties, find the Expression property, and then select the ConnectionString property, and set it to @FileName.

    Then you're set!

    As for the schedule, you can schedule SSIS packages to run through SQL Server Agent. There's a few ways to do it - you can choose SSIS as the type of the package, and just run it directly, by passing it the .dtsx file name, or you can run it through a command by using DTEXEC. Just note that if you want to schedule it, you'll need to change the ProtectionLevel of the package, since the default of EncryptSensitiveWithUserKey won't work for an SSIS Agent executing it. I usually use EncryptAllWithPassword.

  • Thanks, kramaswamy, this is very cool stuff; I would have spent a week putting this altogether the way you've laid it out. Thanks for the advise on running a job, and also thanks for including that filename code for me. I know how to do that, but your way is more elegant.

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

  • No probs. Let us know if everything works out properly!

  • I don't think that the solution proposed is necessarily the best one, as it depends heavily on you being able to derive the name of the file.

    What I would do is use a Foreach loop with a file spec of *.csv in your chosen folder.

    This will loop around all of the CSV files in the nominated folders (and its sub-folders too, if needed). As part of the looping, the name of the file currently being 'processed' is assigned to a variable. You can use that variable when specifying your datasource location. You don't ever even need to know the file's name.

    Of course, you'll need to archive the files somewhere after processing (so that they don't get picked up repeatedly), but that's just good practice anyway.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Yeah - also a good solution, although that one depends on the fact that the only files in that folder will be the ones you want to process. If you need to do any filtering, then you're back to the first solution 😛

  • It's a good point about how to manage the volume of files in that folder. These are files that can be discarded after upload, but it's a problem that could well surface over a long weekend or after a disaster.

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

  • Capt. Sigerson (12/21/2011)


    It's a good point about how to manage the volume of files in that folder. These are files that can be discarded after upload, but it's a problem that could well surface over a long weekend or after a disaster.

    One common solution is to have a \archive folder and to move files there once they are processed - possibly adding processed date to the file name as part of the move (for example xyz.csv becomes 2011-12-21xyz.csv) - just in case you ever need to refer back.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • If anybody's still listening, I've run into a small wall in SSIS. I get a quote-delimited CSV file that I want to upload to a SQL Server temp table.

    I'm using a Flat File Source and a SQL Server Destination, but I'd like to convert three of the columns from numeric strings into integers before they go into the database. Is there an object I can insert in the data flow stream to do that?

    Thanks.

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply