Challenge - dynamically load any file

  • I have challenge where I received data from multiple sources in different formats such as excel, csv, flat file.

    Now the challenge is I have to load any received file dynamically using SSIS in flat structure.

    I achieved the same for excel files using script task...but not sure how to do it for csv and flat file.

    Abhijit - http://abhijitmore.wordpress.com

  • Abhijit More (7/13/2015)


    I have challenge where I received data from multiple sources in different formats such as excel, csv, flat file.

    Now the challenge is I have to load any received file dynamically using SSIS in flat structure.

    I achieved the same for excel files using script task...but not sure how to do it for csv and flat file.

    Describe what you mean by 'flat structure'.

    Are the files all in different format (columns/datatypes)?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Describe what you mean by 'flat structure'.

    Flat Table with all headers from files, if no header it should come with Column0, Column1, Column2...& so on.

    Are the files all in different format (columns/datatypes)?

    yes each file will have different headers with varying data size.

    Abhijit - http://abhijitmore.wordpress.com

  • Abhijit More (7/13/2015)


    Describe what you mean by 'flat structure'.

    Flat Table with all headers from files, if no header it should come with Column0, Column1, Column2...& so on.

    Are the files all in different format (columns/datatypes)?

    yes each file will have different headers with varying data size.

    So you somehow want to parse a flat file and return all the column names and appropriate data types & then handle that in SSIS?

    I have no idea how to do that without writing a ton of code. It will also, of course, not be able to use many of SSIS's optimisations, because you will not be able to code this as a dataflow.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • i've done this with a script task, but i assume the file always has a header row for the column names, and that i know the delimiters, ie, comma/CrLf, tab/CrLf or pipe/crlf for example. files without a header row ends up making the first row of the data the column names...yuck.

    i have to assume all file types are strings, but at least it's stuffed into SQL.

    i have a variation of it as well, due to one datasource having lines 2 and lines 3 of the file defining data type and size, so i can actually define the datatypes of things other than strings.

    my logic also assumes that the filename left of an underscore become sthe table name, so two similarly named files, like Invoices_20150712.txt and Invoices_20150713.txt, have the same format, and get stuffed in the same table.

    i read the file in, parsing it for quoted identifiers, and stick it in a list of strings.

    then i load the strings into a datatable, once in the data table, i use that to determine the script for creating the table if it does not exist, create the destination table, and then use SQL bulk insert to insert the table; if the table already existed, then the SQL bulk insert would execute.

    the sweet part of this is it's so generic; unzip something that has 50 or 100 differnet files in it,and they all go into SQL in matching tables.

    i should throw an article out there, it's pretty handy.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (7/13/2015)


    i've done this with a script task, but i assume the file always has a header row for the column names, and that i know the delimiters, ie, comma/CrLf, tab/CrLf or pipe/crlf for example. files without a header row ends up making the first row of the data the column names...yuck.

    i have to assume all file types are strings, but at least it's stuffed into SQL.

    i have a variation of it as well, due to one datasource having lines 2 and lines 3 of the file defining data type and size, so i can actually define the datatypes of things other than strings.

    my logic also assumes that the filename left of an underscore become sthe table name, so two similarly named files, like Invoices_20150712.txt and Invoices_20150713.txt, have the same format, and get stuffed in the same table.

    i read the file in, parsing it for quoted identifiers, and stick it in a list of strings.

    then i load the strings into a datatable, once in the data table, i use that to determine the script for creating the table if it does not exist, create the destination table, and then use SQL bulk insert to insert the table; if the table already existed, then the SQL bulk insert would execute.

    the sweet part of this is it's so generic; unzip something that has 50 or 100 differnet files in it,and they all go into SQL in matching tables.

    i should throw an article out there, it's pretty handy.

    Good stuff. Importing everything as a string makes the task a lot easier, but what do you do about column length?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Abhijit More (7/13/2015)


    I have challenge where I received data from multiple sources in different formats such as excel, csv, flat file.

    Now the challenge is I have to load any received file dynamically using SSIS in flat structure.

    I achieved the same for excel files using script task...but not sure how to do it for csv and flat file.

    "Dynamically load any file". And you say you've done that for Excel files?

    With absolutely no knowledge of the attached 3 Excel files, will your script file successfully create the necessary table and load them with absolutely no modifications to any code anywhere, script file or not?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Phil Parkin (7/13/2015)


    Good stuff. Importing everything as a string makes the task a lot easier, but what do you do about column length?

    i iternate through the datatable and determine the longest string in each column.

    with that in place, i get things like CityName has a length of 17 in one datasource, and 34 in another, which is a headache, so i have another version that rounds up to the next highest increment if ten, with a minimum of 30 for the size; so a column with 17 gets rounded up to 30, but a source that has len=34 gets rounded up to 40.

    if i process all similar files otgether, ie there was 40 Invoice_xxx files, the datatypes are pretty stable, but if you do one off files, you might have to tweak the destination table manually afterwards...

    like i said, if the table existed, it would just bulk insert.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • lol Jeff that's evil;

    i could do it with Excel Automation in a script task, but all the columns would end up being "column1" "column2" because the first row does not contain column names. event he used ranges don't contain column names; the header row of the used range doinks that up.

    so i could stuff that into a sql table, but it would be ugly. i'll wait to see how Abhijit would handle it, but, as with all things programming, yeah it's possible, but you might not like the result.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Jeff I have considered excel as simple table format columns and rows...no formatting is consider in my scri[t. probably after some tweaking it may work:-)

    Abhijit - http://abhijitmore.wordpress.com

  • thxs Lowell the only concern I have in my files it may or may not have headers as well some file has Row Delimiter "CR LF"" and some has "LF". there is no consistent format. However in excel being a row & columnar data I can mange the same every easily.

    I am looking for below options..

    1. Dynamically look for Row & Column delimiters

    Or

    2. Configure Row & Column delimiters, if not found in file reject the file.

    Or

    3. In case there are any data shifting, reject the file (do not load it).

    Abhijit - http://abhijitmore.wordpress.com

  • Abhijit More (7/13/2015)


    thxs Lowell the only concern I have in my files it may or may not have headers as well some file has Row Delimiter "CR LF"" and some has "LF". there is no consistent format. However in excel being a row & columnar data I can mange the same every easily.

    I am looking for below options..

    1. Dynamically look for Row & Column delimiters

    Or

    2. Configure Row & Column delimiters, if not found in file reject the file.

    Or

    3. In case there are any data shifting, reject the file (do not load it).

    wow that makes it awfully difficult, to try and determine delimiters. if you don't know up front what the column delimiters are, i can't imagine how to do it; the CrLf vs Lf, yeah you could make a test for that i guess.

    you'd have to load the whole file into a single string, and maybe you could split on a best guess delimiter, and see if you get a decent row count;

    rinse and repeat for your top 3 delimiters? i'd still go for known assumptions, and change the delimiter on a per-source package.

    ie, if i know BankOfAmerica sends me piped data, the param for them is pipe, and each source might be a per package, or a parameter in a table that gets looked up, if it's "One SSIS Package To Rule Them All"

    i'd love to hear how you might tackle that.

    the data shifting, you could compare columns against the destination (assuming the destination exists)after you've read it but before you insert, and reject if new columns appeared, or the column order changed.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (7/13/2015)


    lol Jeff that's evil;

    i could do it with Excel Automation in a script task, but all the columns would end up being "column1" "column2" because the first row does not contain column names. event he used ranges don't contain column names; the header row of the used range doinks that up.

    so i could stuff that into a sql table, but it would be ugly. i'll wait to see how Abhijit would handle it, but, as with all things programming, yeah it's possible, but you might not like the result.

    Heh... I actually just gave a presentation in Cleveland on how to do all of those spreadsheets and more with zero code changes. And, it comes up with meaningful column names. It's all done using two fairly short stored procedures. And, yeah... working on the article for it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Abhijit More (7/13/2015)


    Jeff I have considered excel as simple table format columns and rows...no formatting is consider in my scri[t. probably after some tweaking it may work:-)

    Thanks for the feedback, Abhijit . That's pretty much what I thought.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Abhijit More (7/13/2015)


    thxs Lowell the only concern I have in my files it may or may not have headers as well some file has Row Delimiter "CR LF"" and some has "LF". there is no consistent format. However in excel being a row & columnar data I can mange the same every easily.

    I am looking for below options..

    1. Dynamically look for Row & Column delimiters

    Or

    2. Configure Row & Column delimiters, if not found in file reject the file.

    Or

    3. In case there are any data shifting, reject the file (do not load it).

    I've actually solved that problem for work and working on an article for that, as well. It's not a total panacea but it will auto-magically create a staging table and a BCP Format file to do the imports of most "reasonable" text files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply