SSIS 2008 – Read roughly 50 CSV files from a folder, create SQL table from them dynamically, and dump data.

  • Hello everyone,

    I’ve been assigned one requirement wherein I would like to read around 50 CSV files from a specified folder.

    In step 1 I would like to create schema for this files, meaning take the CSV file one by one and create SQL table for it, if it does not exist at destination.

    In step 2 I would like to append the data of these 50 CSV files into respective table.

    In step 3 I would like to purge data older than a given date.

    Please note, the data in these CSV files would be very bulky, I would like to know the best way to insert bulky data into SQL table.

    Also, in some of the CSV files, there will be 4 rows at the top of the file which have the header details/header rows.

    According to my knowledge I would be asked to implement this on SSIS 2008 but I’m not 100% sure for it.

    So, please feel free to provide multiple approaches if we can achieve these requirements elegantly in newer versions like SSIS 2012.

    Any help would be much appreciated.

    Thanks,

    Ankit

  • Have you ever used SSIS before to automatically import multiple identical files of a known structure?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There is going to be no decent way of auto-creating the tables. A major reason for this is that columns in CSV files have no data types. Unless you interrogate the data in the files first & then have some sort of datatype-guessing algorithm ready to go. There must be one somewhere, because the Excel OLEDB driver does something similar (and frequently cocks it up).

    Assuming that the tables can be created first, using whatever means, the data import is one or more data flows (depending on the table structures). Every different table structure requires its own data flow.

    For the purging bit, I would invoke a stored proc. from SSIS using an Execute SQL task.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • This looks like a typical situation for ELT, load everything as a text into the staging area and then do the conversion afterwards.

    😎

  • I think that the points that Phil and Eirikur have brought up are really good ones.

    To amplify, yes, it is possible and really not so difficult to have something auto-magically read through, say, the first ten rows to find the "first longest row", which should be the column header row, interrogate that row for the header names and delimiters, and then have the code build a staging table with all text columns.

    The problem is, what do you do with 50 tables that no one knows anything about? Even if you have code that correctly guesses datatypes, who knows what each column in each table actually means? It's like loading an XML file. Yes... relatively easy to split out the elements but what the hell do they mean and how should they be used?

    There has to be a plan here. Someone has to have a record layout of these files in a spreadsheet somewhere because someone built the files. Someone has to know what these files contain and have a plan for what to do with the data. Use that data to make a handful of BCP format files automatically in the spreadsheet (or Word document or text file or whatever, much quicker than manually mapping SSIS).

    Importing 50 unknown content data files into equally obscure tables just doesn't strike me as the right way to do any of this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hello Chris and Jeff,

    Thank you for the response.

    The data type for each column of all 50 files will always be varchar (255).

    The requirement is like the client would like to create the schema for all 50 files via SSIS package only. The reason being, the staging table gets changed for them from client to client.

    The details about each column for respective CSV file are also specified in form of PDF file.

  • ankithimmatlalshah (4/20/2015)


    Hello Chris and Jeff,

    Thank you for the response.

    The data type for each column of all 50 files will always be varchar (255).

    The requirement is like the client would like to create the schema for all 50 files via SSIS package only. The reason being, the staging table gets changed for them from client to client.

    The details about each column for respective CSV file are also specified in form of PDF file.

    Whoever Chris is, I see no posts from him in this thread.

    Are we supposed to understand what you want, based on "The details about each column for respective CSV file are also specified in form of PDF file"? You want the package to read a PDF file?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Hello Phil,

    I'm really sorry, it's yourself only.

    Instead of Phil I specified Chris in my previous reply.

    Client has provided one PDF file that contains the details about each column for every CSV file.

    Sorry for the confusion, I would not like to read the PDF file vis SSIS package.

    Actually, I'm panning to prepare the SQL scripts manually based on the detail to generate tables at staging database.

    I'm going to propose the following plan to customer,

    Step 1 - SQL script task, which will prepare the schema for 50 CSV files

    Step 2- For each container,

    Here it will pick each CSV file and insert/append it's data to respective SQL tables.

    I think I need to consider the various variables, roughly 50 variables to manage the things dynamically.

    Please advise if you know any other elegant way to achieve aforementioned steps.

    Thanks,

    Ankit

  • I have done this with a script task, but it depends on the file structures and naming conventions.

    in my case, i got a zip file with an unknown number of CSV files.

    they all contained headers, so i knew the column names, and i grouped the files based on a known quality of my files: they did {name}_{date}.csv, so my code creates a datatable for each {name}, and then i bulk insert the name as the table name.; my process would either create or append to the table if it existed.

    so say the zip file contained Invoices_20150401.csv, Invoices_20150402.csv,Customers_20150401.csv, and Products_20150401.csv

    my process would create three tables, Invoices, Custmers and Products dynamically, and multiple files following the same naming convention get appended to the right table.

    i just inserted into a staging table and allowed everything to go in as strings, so downstream i had to cast varchar to money or int or bit or whatever was appropriate.

    lemme see if i can find a decent example.

    the problem with this process, is say the first file creates customers, and the longest length of the current customername is varchar(17)...that's what gets created, so the next day, any customer with a longer name might raise an error.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Hello Lowell,

    Thank you very much for the response.

    I'm also having the same type of requirements for this client as you specified.

    I'm going to handle our requirement in more or less in a same manner as you described.

    I'm going to use Script task to read header columns of CSV files, preparing one SSIS variable which will be having SQL script to create the required table with if exists condition inside script task itself.

    I will be having "Execute SQL task" following the script task. And this will create the actual table for a CSV.

    Both these components will be inside a for each loop container and execute all 50 CSV files one by one.

    Some points to be clarified,

    1. In the bunch of these 50 CSV files there will be some exception for which we first need to purge the tables and then insert the data. Meaning for 2 files out of 50, we need to first clean the tables and then perform data insert, while for the rest 48 files, they should be appended on daily basis.

    Can you please advise what is the best way to achieve this requirement? Where should we configure such

    exceptional cases for the package?

    2. For some of the CSV files we would be having more than one file with the same name. Like out of 50 the 2nd file is divided into 10 different CSV files. so in total we're having 60 files wherein the 10 out of 60 have repeated file names. How can we manage this criteria within the same loop, do we need to do one more for each looping inside the parent one, what is the best way to achieve this requirement?

    3. There will be another package, which will be used to purge data for the SQL tables. Meaning unlike the above package, this package will not run on daily basis. At some point we would like these 50 tables to be purged with older than criteria, say remove data older than 1st Jan 2015. what is the best way to achieve this requirement?

    Please know, I'm very new in SSIS world and would like to develop these packages for client using best package development practices.

    Any help would be greatly appreciated.

  • There are a lot of questions here and several of them required detailed answers in their own right.

    But for me, one thing really stands out. Best practice is not to create the tables dynamically as part of the package. Create them in a separate script. Then let the package run in the knowledge that the tables already exist.

    Second: if the package is creating tables, they will not already contain data. Yet you mention purging and appending data. Either you are creating the tables, or not.

    Third: multiple files in a folder cannot have the same name. Your point (2) is impossible. The names may, of course, be similar.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Hello Phil,

    Thank you for the quick response.

    For first point, I will re-confirm with customer that can we create the SQL script task that will first create the all tables to destination and then for each loop to migrate data from CSV file to SQL table?

    On second point, the tables will be created for very first time if it does not present, on next go (runs daily basis) it will append data roughly for 48 files while for 2 files it will purge some existing data, if present and then append the new data. The criteria for purge is yet to discussed with us.

    On third point, by same file name I meant that the same file name for a particular CVS file out of 50 with incremental number, like if I do have history.CSV then the complete data will be divided in 10 different files, like history_01.csv, history_02.csv files as so on. Sorry for the confusion.

    Please feel free to ask if any of the points are still not clear.

  • ankithimmatlalshah (4/21/2015)


    Hello Phil,

    Thank you for the quick response.

    For first point, I will re-confirm with customer that can we create the SQL script task that will first create the all tables to destination and then for each loop to migrate data from CSV file to SQL table?

    On second point, the tables will be created for very first time if it does not present, on next go (runs daily basis) it will append data roughly for 48 files while for 2 files it will purge some existing data, if present and then append the new data. The criteria for purge is yet to discussed with us.

    On third point, by same file name I meant that the same file name for a particular CVS file out of 50 with incremental number, like if I do have history.CSV then the complete data will be divided in 10 different files, like history_01.csv, history_02.csv files as so on. Sorry for the confusion.

    Please feel free to ask if any of the points are still not clear.

    OK, things are becoming clearer.

    First point ... agreed, though 'migrate' is probably better referred to as 'import'

    Second: for the two special files, I would suggest calling a stored proc (via ExecuteSQL task) after doing the load which performs the purge.

    Third: a foreach loop should take care of all of the files.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • ankithimmatlalshah (4/20/2015)


    Hello Phil,

    I'm really sorry, it's yourself only.

    Instead of Phil I specified Chris in my previous reply.

    Client has provided one PDF file that contains the details about each column for every CSV file.

    Sorry for the confusion, I would not like to read the PDF file vis SSIS package.

    Actually, I'm panning to prepare the SQL scripts manually based on the detail to generate tables at staging database.

    I'm going to propose the following plan to customer,

    Step 1 - SQL script task, which will prepare the schema for 50 CSV files

    Step 2- For each container,

    Here it will pick each CSV file and insert/append it's data to respective SQL tables.

    I think I need to consider the various variables, roughly 50 variables to manage the things dynamically.

    Please advise if you know any other elegant way to achieve aforementioned steps.

    Thanks,

    Ankit

    Is there any duplication of the format and column names of the files in these 50?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ok here's a link to an example package i put together for you:

    GenericCSVDynamicImport.zip

    it's in Visual Studio 2008, since you posted about SSIS 2008 ,but i've got the same thing that i wrote in VS2012 if needed.

    the usual caveats about downloading strange code from the internet applies, of course, but this should give you a decent example to get started.

    just point it at your sandbox and a folder full of files, and step through the code so you understand it.

    like i said, the script task processes the whole folder full of files, and organizes them into tables based on an underscore;

    no underscore = each file gets it's own table.

    there's also an OverRideTableName variable, which if populated, forces all files into a single target table; yet another option.

    the advantage of this is flexibility; you can get stuff into a database pretty darn quickly; substantially faster than if you had to define the tables first, then point SSIS to learn their data types and such; but it's not as good as typed data types, which is a big tradeoff..

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply