Loading x number of CSV files into x number of unique tables of a database

  • Hi,

    This is my first time here so hope I am posting this at the right place. I recently started a new job which involves SQL scripting and loading tables into a database using SSIS within SSDT

    Is there a way to load multiple CSV files into a database such that a new table is created for every CSV file loaded. Hope the following explanation of a scenario paints a clearer picture

    Every month or so I receive a large number of excel file (anywhere from 40-100) from the client. I convert these excel files into CSVs after removing some unwanted header rows

    The composition of the files are such there are different groups of files where each group contains information for different department or business. The structure of the files within a group are the same but vary from group to group, i.e Group1 files have the same structure and so on. The files are named

    Filename1_Group1.csv

    Filename2_Group1.csv

    Filename3_Group1.csv

    Filename1_Group2.csv

    Filename2_Group2.csv

    Filename3_Group3.csv ......

    Once I have the CSVs ready, the following are the steps I perform using SSIS in SSDT. (Note that I repeat these for each and every file that it has now become muscle memory but it is extremely time consuming!)

    Step 1 : I add a single data flow task and click on the Data flow tab in ssdt

    Step 2: I then drag and drop "Flat File Source" and "OLE DB Destination" into the canvas. I create a connection by joining the blue line from the flat file source to the OLEDB destination controls. I name the Destination "Filename1_Group1"

    Step 3: I then double click on the flat file source, add Filename1_Group1.csv, change Text Qualifer to  " . Next, I go to the Advanced properties in the flat file connection manager editor and change  the output column width to 800 for all the columns and then click "OK"

    Step4 : In the Advanced Editor for Flat File Source window, I click on the Component Properties tab and enter "OrgFilename" in the FileNameColumnName properties button and then click OK

    Step5: I open the OLE DB Destination editor, click New and then select the relevant database (the database remains the same for all the files). I then select "Table or view-fast load" from the Data access mode drop down options and click on NEW which opens up the Create Table window

    Step 6:  In the Create table window I add an identifier column at the beginning of the query by writing this code Id_num IDENTITY(1,1) and click OK

    Step 7: I click on the Mapping property to see if the columns are mapped correctly and then click OK

    Step 8: I click on Start which then executes the process and loads Filename1_Group1.csv in to the specified database. This then creates a new table named Filename1_Group1 in the database

    What I have described here is just for one file. I have to do this for a very large number of files and, as you can see, it is definitely not a feasible method to do this

    Is there a way to improve this process such that the data flow performs these steps for all files within a specific folder. As different groups have different structures, would this still be possible. Even performing these tasks in batches will save me a lot of time and effort

    Thanks in advance!

     

     

  • EDIT

    As this might be a long read, here is the TL;DR version

    I want to upload multiple csv files into an SQL database such that a new table is created for every file that has been uploaded. The CSVs are grouped such that files belonging to the same group have the same structure. With every file that is created, a FileNameColumnName called OrgUnit is added in the advanced editor section of the flat file source and an identity column is added in the destination oledb

  • Discounting the Excel thing for now, a little work in T-SQL will allow you to read the row header from each file, figure what's there, and assign it to the "structure" group you speak of, to do a BULK INSERT into a table based on whatever naming convention you need including the name of the file itself along with a date and time or whatever.  You don't even need to have SSIS installed, IMHO (and I've never had to).  And, it can do all the things you ask like applying your OrgUnit and IDENTITY columns auto-magically.

    Right now, after you've uploaded (imported) a file to a named table, what do you want done with the file?

    Getting back to the Excel thing, with a thing known as the ACE Drivers, we can make that an intelligent automatic process, as well.  You'd never even have to open a spreadsheet manually never mind modify it and save it in what Microsoft thinks is a CSV format.  SQL Server could read it directly and scrap out just the stuff it needs.

    How many file groups do you have?  And do they ever violate the structure of the file(s) (heh... of course they do and will but have to ask because some clients actually do have their act together).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh... in fact, we could make the system figure out what's wrong with a file, produce an errata list in an email and send it the client along with the original file... all auto-magically.

    I won't kid you, though.  It's not as easy as falling off a slippery log but it's not as difficult as making a 5 course meal either.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff

    I haven't heard of ACE driver. Is it a tool ? if so, can you share a link please. I am not sure how to use bulk insert for my task. I had a look online and one suggestion was using the Foreach loop container in SSIS

    As the file structures for the groups are different, I can run it in batches. But I am not sure how to use the foreach loop container such that every individual csv file get its own table

  • masterelaichi wrote:

    Thanks Jeff

    I haven't heard of ACE driver. Is it a tool ? if so, can you share a link please. I am not sure how to use bulk insert for my task. I had a look online and one suggestion was using the Foreach loop container in SSIS

    As the file structures for the groups are different, I can run it in batches. But I am not sure how to use the foreach loop container such that every individual csv file get its own table

    The ACE driver provides a way of connecting to Excel files in a way which makes them machine readable. As your requirement here is very dynamic, Jeff's way of doing things is a better option than SSIS, which works best with known formats and static tables.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Sorry, how do I go about Jeff's method of using BULK INSERT. Is it an SQL query ?

    I was playing around with the ForEach loop container to try and understand it. I did succeed partially in uploding a folder full of csv files. But I what I want to do is create a new table for each of the input source csvs. I am guessing it is something to do with using a variable for the create table function ?

  • masterelaichi wrote:

    Sorry, how do I go about Jeff's method of using BULK INSERT. Is it an SQL query ?

    I was playing around with the ForEach loop container to try and understand it. I did succeed partially in uploding a folder full of csv files. But I what I want to do is create a new table for each of the input source csvs. I am guessing it is something to do with using a variable for the create table function ?

    BULK INSERT is T-SQL and is described here.

    The T-SQL for creating a table is simple enough ... if you know the columns and datatypes at the time of creation. But SSIS won't do that work for you. Within a dataflow, mapping columns from source to target is easy enough ... as long as you know the source and target columns at design time.

    In your situation, this is not going to be easy at all in SSIS. Please wait for Jeff to respond again.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • masterelaichi wrote:

    Thanks Jeff

    I haven't heard of ACE driver. Is it a tool ? if so, can you share a link please. I am not sure how to use bulk insert for my task. I had a look online and one suggestion was using the Foreach loop container in SSIS

    As the file structures for the groups are different, I can run it in batches. But I am not sure how to use the foreach loop container such that every individual csv file get its own table

    Heh... I can't actually help unless you answer some of the questions I've asked.  Go back to my first post on this thread, find the questions and answer them, please.

    Just out of curiosity, you said that "I recently started a new job which involves SQL scripting and loading tables into a database using SSIS within SSDT".  With absolutely no snarkiness intended (things like this happen to the best of us), how did you end up being placed in this position?  I'm just curious about such events leading up to such things.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Right now, after you've uploaded (imported) a file to a named table, what do you want done with the file?

    Once the files are uploaded, there are there are numerous procedures in the database that does transformations on them, applying business rules etc.  As for the raw csv file that have been uploaded, there are no further actions on them. It just sits there in folder

    How many file groups do you have?  And do they ever violate the structure of the file(s) (heh... of course they do and will but have to ask because some clients actually do have their act together).

    I have around 7 different groups. The file structure within the groups are pretty consistent with same datatypes, number of columns etc. Only thing that changes is the number of rows

    And as for how I got the role ? Beats me! I have Power BI experience and had very limited SQL knowledge- knowledge which I gained through self-learning from resources such as datacamp,youtube etc, something I mentioned to the panel during the interview

  • I have a couple of questions:

    1. Why do you need a new table created for each file - every time?  Could this be done into a single table for each group - with a column that denotes which file it came from?
    2. Are the files consistent within each group?  You say they are 'pretty consistent' - does that mean file1 of group1 may or may not have the same columns defined as file2 of group1?
    3. How often does a groups set of files change?  If they are consistent - how often would you say they change by adding/removing/renaming a column in the Excel spreadsheet?

    It doesn't make sense to create a new table every time you import a file.  If these are consistent per group - and you don't have any drift on the columns, then you could use SSIS and create a separate package for each groups files.  Then setup the foreach in each package to only pull that groups files - and load them into a pre-defined table for that group, adding the source file as one of the columns.

    You would need an additional column - or columns - on that groups tables to identify when that row has been processed, but then you would have a single set of code for each group instead of having to create/modify new code for each file - or use dynamic code to identify each 'new' table to be processed.

    It is possible - but it will take some time and thought into how the overall process should work.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • masterelaichi wrote:

    Right now, after you've uploaded (imported) a file to a named table, what do you want done with the file?

    Once the files are uploaded, there are there are numerous procedures in the database that does transformations on them, applying business rules etc.  As for the raw csv file that have been uploaded, there are no further actions on them. It just sits there in folder

    How many file groups do you have?  And do they ever violate the structure of the file(s) (heh... of course they do and will but have to ask because some clients actually do have their act together).

    I have around 7 different groups. The file structure within the groups are pretty consistent with same datatypes, number of columns etc. Only thing that changes is the number of rows

    And as for how I got the role ? Beats me! I have Power BI experience and had very limited SQL knowledge- knowledge which I gained through self-learning from resources such as datacamp,youtube etc, something I mentioned to the panel during the interview

    Good stuff and totally understood.  The only question I have now is how do decide which group any given file is in?  I don't know of a reason why that couldn't be automated, as well.  I just need to know what process (manual or otherwise) you're using for that.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Good stuff and totally understood.  The only question I have now is how do decide which group any given file is in?  I don't know of a reason why that couldn't be automated, as well.  I just need to know what process (manual or otherwise) you're using for that.

     

    I was thinking, to keep it simple, to do it in batches instead of all at once. Each group has anywhere from 15-20 files. Sure, automating the entire bulk upload would be a lot more efficient, but it might be harder to troubleshoot. Anything so that I don't have to upload 80 odd files individually. It becomes a lot more painful when I lose my work as I my saved files never seem to work as I get all sort of error (another thing I need to figure out!)

    Thanks

  • masterelaichi wrote:

    Good stuff and totally understood.  The only question I have now is how do decide which group any given file is in?  I don't know of a reason why that couldn't be automated, as well.  I just need to know what process (manual or otherwise) you're using for that.

    I was thinking, to keep it simple, to do it in batches instead of all at once. Each group has anywhere from 15-20 files. Sure, automating the entire bulk upload would be a lot more efficient, but it might be harder to troubleshoot. Anything so that I don't have to upload 80 odd files individually. It becomes a lot more painful when I lose my work as I my saved files never seem to work as I get all sort of error (another thing I need to figure out!)

    Thanks

    OK, but you still haven't answered Jeff's question. If you pick file 1 out of a folder containing 80 files, how do you identify the 'batch' to which it belongs? By name? By file type? Or do you have to interrogate the file to find out number of columns and column names?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • The filenames are named with the groups there are in - Filename1_Group1.csv, Filename2_Group1.CSV, Filename3_Group1.csv; Filename1_Group2.csv, Filename2_Group2.CSV, Filename3_Group2.csv, and so on

    When I create the dataflow task, I add the name of the file as the name of the OLEDB destination component. So when I go to e OLEDB destination->table load or fast view-> NEW, the SQL code is CREATE TABLE Filename1_Group1. I edit this step by adding an identifier column using IDENTITY (1,1)

    Once I do this, the resulting tables in SQL are named the same as the raw file mentioned above. To answer your question as to how I identify, it is manual as the names of the file is the name of the sql table that is meant to be created. I can do these in batches, i.e, load all of Group1, followed by Group2, and so on

    Hope that helps

    • This reply was modified 2 years, 11 months ago by  masterelaichi.

Viewing 15 posts - 1 through 15 (of 52 total)

You must be logged in to reply to this topic. Login to reply