SSIS Flat File Import with different file layouts

  • I have a flat file which I need to import into three different tables in SQL Server, and I want to use SSIS and SQL 2008 R2. The flat file is pipe delimited. The first record could be one of three values (WITH DIFFERENT COLUMNS), which determines the table I need to put it in - and it will tell me the layout of that record as well. So each line of the flat file will be one of three different layouts. I'm wondering if this is possible - and how I would do this using SSIS. Thanks in advance.

    For example, let's say

    F1 has Name, City, State, Zip;

    F2 has Account#, TotalSales, RepName; and

    F3 has ID#, Color

    Here is what the file I want to import and transform looks like:

    F3|313|Blue

    F3|112|Red

    F1|John Smith|New York|NY|10001

    F3|415|Yellow

    F2|123300|210000.00|Barney

    F1|Fred Johnson|Boston|MA|01234

    F2|331206|300500.00|Silvia

  • Read the whole file in as one column, then use a conditional split to split the data stream into 3 flows.

    For each flow, use a script component to split out a row into it's respective columns.

    Finally write each flow to its own destination.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (3/1/2011)


    Read the whole file in as one column, then use a conditional split to split the data stream into 3 flows.

    For each flow, use a script component to split out a row into it's respective columns.

    Finally write each flow to its own destination.

    Koen... this is no reflection on you. Your suggestion is perfect.

    I'm just ticked at the lack of programming in SSIS and I'm amazed that something like SSIS, which is supposedly the definitive tool for ETL in SQL Server, still requires the use of a bloody script component. If that's what's necessary, folks might just as well do it all in T-SQL. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/1/2011)


    Koen... this is no reflection on you. Your suggestion is perfect.

    I know. 😎

    😀

    Jeff Moden (3/1/2011)


    I'm just ticked at the lack of programming in SSIS and I'm amazed that something like SSIS, which is supposedly the definitive tool for ETL in SQL Server, still requires the use of a bloody script component. If that's what's necessary, folks might just as well do it all in T-SQL. 😉

    You can look at it from a different perspective:

    <salesmode on>you can extend SSIS beyond the limits with the full .NET capacity! </salesmode off>

    I suggested the script component because it will be non-blocking and thus blazingly fast, as each row is parsed in memory.

    But because it is you, I shall propose a solution without scripting :-D:

    Again, split out the flow using the conditional split, but this time, write each flow to a flat file destination.

    Then have a second data flow, where you read the 3 files using 3 seperate Flat File Sources.

    Since you can configure each source seperately, you can exactly determine the correct number of columns.

    Advantages: no scripting.

    Disadvantages: more I/O.

    Another alternative approach is to do the splitting in a derived column instead of a script component, using findchar and substring, but that would turn ugly real quick.

    (and .NET has a built-in Split() function, so why not use it?)

    But I get your point. If something becomes a tidbit complex, you usually have to resolve to either .NET scripting or TSQL scripting.

    But that's OK, I make my living out of it 😀

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Heh... thanks for the feedback, Koen. I may just have to teach myself C# so I can write a conditional splitter that would make it all a whole lot easier.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/1/2011)


    Heh... thanks for the feedback, Koen. I may just have to teach myself C# so I can write a conditional splitter that would make it all a whole lot easier.

    I'm just teaching myself C#. I already know a fair bit of VB.NET, but for some reason my coworkers laugh at me when I mention it 🙂

    Anyway, if you ever want to start on writing your custom splitter, here's your starting point:

    http://msdn.microsoft.com/en-us/library/system.string.split(v=VS.100).aspx

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I'd get on to the numptie that decided to put three types of data in a random fashion into a flat file.. give em a quick kick and get them to supply three files instead!

    but in the likely hood you can't find the numptie , Koen's plan seems like a good alternative. 😀

  • nahk.fussuy (3/1/2011)


    I'd get on to the numptie that decided to put three types of data in a random fashion into a flat file.. give em a quick kick and get them to supply three files instead!

    but in the likely hood you can't find the numptie , Koen's plan seems like a good alternative. 😀

    Most of the time it goes like this:

    "Who are you looking for? The guy who created the flat file dumps? Nah, he doesn't work here anymore. And we don't know how the system works..."

    or

    "There are flat file dumps?"

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • There's another option, and it's in stream without a VB component, but does require some repetitive coding and usage of expressions. You'd have to used the derived column component after you do the conditional split.

    Using a substring and a series of Findstrings (think substring and charindex), you can manually locate each column. This gets a lot uglier if you've got quoted strings, and I'd recommend escaping out commas in the middle of quotes if that's the case (I personally like pipes, but that's me). It's more annoying then using VB because it's not as simple to loop, but it does work. I wouldn't tackle anything this way with more then 10-11 columns though, it's just painful. At that point I'd probably feed it into a staging table and work with it there out of self defense. I'm not really a fan of the VB Script dataflow component as a transformation, but it can certainly work here.

    EDIT: Apologies, Koen, I didn't see this when I first read through the thread:

    Another alternative approach is to do the splitting in a derived column instead of a script component, using findchar and substring, but that would turn ugly real quick.

    (and .NET has a built-in Split() function, so why not use it?)

    So, um, this was mentioned before. Oops.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (3/1/2011)


    There's another option, and it's in stream without a VB component, but does require some repetitive coding and usage of expressions. You'd have to used the derived column component after you do the conditional split.

    Using a substring and a series of Findstrings (think substring and charindex), you can manually locate each column. This gets a lot uglier if you've got quoted strings, and I'd recommend escaping out commas in the middle of quotes if that's the case (I personally like pipes, but that's me). It's more annoying then using VB because it's not as simple to loop, but it does work. I wouldn't tackle anything this way with more then 10-11 columns though, it's just painful. At that point I'd probably feed it into a staging table and work with it there out of self defense. I'm not really a fan of the VB Script dataflow component as a transformation, but it can certainly work here.

    EDIT: Apologies, Koen, I didn't see this when I first read through the thread:

    Another alternative approach is to do the splitting in a derived column instead of a script component, using findchar and substring, but that would turn ugly real quick.

    (and .NET has a built-in Split() function, so why not use it?)

    So, um, this was mentioned before. Oops.

    😀

    It would be a lot less uglier if the derived column wasn't a one-line-editor.

    edit: corrected very stupid typo

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (3/2/2011)


    I would be a lot less uglier if the derived column wasn't a one-line-editor.

    Don't be hard on yourself, I'm sure you're not that ugly 😛

    Script component is definitely the way to go.

    I don't think it's a failing in the platform really. It would be pretty difficult to deal with multiple different table structures (potentially with different data types) within the same flat file in a generic way that didn't require any scripting.

    What I would like to see though is a 1st party dynamic flat file source that allowed you to identify columns through column headers rather than ordinal position and allowed optional columns...

  • HowardW (3/2/2011)


    Koen Verbeeck (3/2/2011)


    I would be a lot less uglier if the derived column wasn't a one-line-editor.

    Don't be hard on yourself, I'm sure you're not that ugly 😛

    Corrected that mistake 🙂

    HowardW (3/2/2011)


    What I would like to see though is a 1st party dynamic flat file source that allowed you to identify columns through column headers rather than ordinal position and allowed optional columns...

    Well, I wrote some time ago a generic import package, that basically reads data from various sources based on metadata stored in a table.

    The trick was that you can write an OLE DB query to a flat file, allowing you to read the contents with an ordinary SQL statement (so you can drop columns, reference them by name, do some basic formatting, apply a where clause et cetera).

    The catch is that this approach is usually slower (albeit more flexible) and that for flat files you usually have to generate a .INI file to describe the file's contents.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • try without ssis

    bulk insert into FlatBuffer ...

    and parsing into tables

    if Object_ID('tempdb..#F1') is not null drop table #F1

    if Object_ID('tempdb..#F2') is not null drop table #F2

    if Object_ID('tempdb..#F3') is not null drop table #F3

    if Object_ID('tempdb..#Buffer') is not null drop table #Buffer

    create table #F1(Name varchar(50), City varchar(50), State varchar(50), Zip varchar(50))

    create table #F2(Account bigint, TotalSales money, RepName varchar(50))

    create table #F3(ID int, Color varchar(50))

    create table #Buffer(Data varchar(max))

    insert into #Buffer

    select '#F3|313|Blue' as Dat?

    union select '#F3|112|Red'

    union select '#F1|John Smith|New York|NY|10001'

    union select '#F3|415|Yellow'

    union select '#F2|123300|210000.00|Barney'

    union select '#F1|Fred Johnson|Boston|MA|01234'

    union select '#F2|331206|300500.00|Silvia'

    declare @sql varchar(max)

    declare t cursor local forward_only for

    select i.[Insert] + ' ' + d.[Data]

    from

    (select left(Data,charindex('|',Data)-1) as TABLE_NAME

    , stuff((select '' + ' union select '''+replace(substring(Data,charindex('|',Data)+1,len(Data)-charindex('|',Data)),'|',''',''')+''''

    from #Buffer where left(Data,charindex('|',Data)-1) = left(b.Data,charindex('|',b.Data)-1) for xml path('')),1,6,'') as [Data]

    from #Buffer b

    group by left(Data,charindex('|',Data)-1))d

    join

    (select left(t.TABLE_NAME,3) as TABLE_NAME,

    'insert into '+left(t.TABLE_NAME,3) + '('

    + stuff((select ',' + COLUMN_NAME from TEMPDB.INFORMATION_SCHEMA.COLUMNS

    where left(TABLE_NAME,3) = left(t.TABLE_NAME,3) for xml path('')),1,1,'')+')' as [INSERT]

    from TEMPDB.INFORMATION_SCHEMA.COLUMNS as t

    where left(t.TABLE_NAME,3)in ('#F1','#F2','#F3')

    group by t.TABLE_NAME) i on i.TABLE_NAME = d.TABLE_NAME

    open t

    fetch next from t into @sql

    while @@fetch_status=0

    begin

    exec(@sql)

    fetch next from t into @sql

    end

    close t

    deallocate t

    select * from #F1

    select * from #F2

    select * from #F3

  • LexusR (3/3/2011)


    try without ssis

    bulk insert into FlatBuffer ...

    and parsing into tables

    if Object_ID('tempdb..#F1') is not null drop table #F1

    if Object_ID('tempdb..#F2') is not null drop table #F2

    if Object_ID('tempdb..#F3') is not null drop table #F3

    if Object_ID('tempdb..#Buffer') is not null drop table #Buffer

    create table #F1(Name varchar(50), City varchar(50), State varchar(50), Zip varchar(50))

    create table #F2(Account bigint, TotalSales money, RepName varchar(50))

    create table #F3(ID int, Color varchar(50))

    create table #Buffer(Data varchar(max))

    insert into #Buffer

    select '#F3|313|Blue' as Dat?

    union select '#F3|112|Red'

    union select '#F1|John Smith|New York|NY|10001'

    union select '#F3|415|Yellow'

    union select '#F2|123300|210000.00|Barney'

    union select '#F1|Fred Johnson|Boston|MA|01234'

    union select '#F2|331206|300500.00|Silvia'

    declare @sql varchar(max)

    declare t cursor local forward_only for

    select i.[Insert] + ' ' + d.[Data]

    from

    (select left(Data,charindex('|',Data)-1) as TABLE_NAME

    , stuff((select '' + ' union select '''+replace(substring(Data,charindex('|',Data)+1,len(Data)-charindex('|',Data)),'|',''',''')+''''

    from #Buffer where left(Data,charindex('|',Data)-1) = left(b.Data,charindex('|',b.Data)-1) for xml path('')),1,6,'') as [Data]

    from #Buffer b

    group by left(Data,charindex('|',Data)-1))d

    join

    (select left(t.TABLE_NAME,3) as TABLE_NAME,

    'insert into '+left(t.TABLE_NAME,3) + '('

    + stuff((select ',' + COLUMN_NAME from TEMPDB.INFORMATION_SCHEMA.COLUMNS

    where left(TABLE_NAME,3) = left(t.TABLE_NAME,3) for xml path('')),1,1,'')+')' as [INSERT]

    from TEMPDB.INFORMATION_SCHEMA.COLUMNS as t

    where left(t.TABLE_NAME,3)in ('#F1','#F2','#F3')

    group by t.TABLE_NAME) i on i.TABLE_NAME = d.TABLE_NAME

    open t

    fetch next from t into @sql

    while @@fetch_status=0

    begin

    exec(@sql)

    fetch next from t into @sql

    end

    close t

    deallocate t

    select * from #F1

    select * from #F2

    select * from #F3

    I would like to see your solution for 1 million+ rows.

    I think the cursor and the fact that you need to store everything in a temp table will kill your tempdb and I/O.

    If you do it in SSIS, you can stream it in memory, so you can already write to destination while you are still reading the source. Much more performant.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (3/3/2011)


    I would like to see your solution for 1 million+ rows.

    I think the cursor and the fact that you need to store everything in a temp table will kill your tempdb and I/O.

    If you do it in SSIS, you can stream it in memory, so you can already write to destination while you are still reading the source. Much more performant.

    I agree that the cursor will likely be a problem but the Temp Table won't be. It's just a different kind of memory usage. I do stuff similar to this all the time (thanks to some whacko 3rd party data vendors) in T-SQL with million row inputs. Of course, I whack'n'stack the data quite a bit differently... I don't build a million SELECT/UNION statements which would take comparatively forever. 😉

    So far as streaming it in SSIS, I've not tried that nor seen it done (at least not properly) so I can't comment on that at all.

    I do know that I replaced some nasty ol' Pearl Scripts (not that Pearl is nasty... the script was) used in a script task that were taking 45 minutes just to get 1 variable length file READY for import in a previous job. Between a good Tally Table splitter and some other SQL prestidigitation, I got the job down to actually importing 8 files with error checking and data validation in under 2 minutes. Each file had unpredictable widths anywhere from 11 to 800 (a million "elements", not rows, was typical) highly denormalized "fields" with column pairs that had to stay together and every other denormalization sin you can think of. All of the files, regardless of width, all needed to be imported into a table that was only 11 columns wide.

    Of course, a well written C# preprocessor or SQLCLR would probably do the job much quicker than any of that (as has been proven multiple times with "splitters").

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply