ETL with non-linear data files

  • Everyone,

    I'm in the middle of populating a SQL Server database with data extracted from an old proprietary system.  The only way to get data out of this system is to pull reports, which are very ugly.  Each record consumes several lines, there are blank lines interspersed throughout, and the header appears in the document every 100 lines or so.  The only saving grace is that the file format is consistent from record to record, and there is a clear point of delineation between records (New records start with "ID: ").

    Up to this point, I have been writing a C# application for each of these file types, and parsing each file individually.  Although this works well, it is very time-consuming - and there are many more files to come.

    What I'm wondering is if there are any tools to make this easier?  This is not a typical ETL project, but I'm sure this problem is not unique either.  I have considered creating a .NET class to assist with this - or even a full blown application to pull the records into a database or at least a more usable text format.  However, if this has already been done, I don't want to reinvent the wheel.

    Any input would be appreciated.

    Thanks,

    Tim

    Tim Mitchell, Microsoft Data Platform MVP
    Data Warehouse and ETL Consultant
    TimMitchell.net | @Tim_Mitchell | Tyleris.com
    ETL Best Practices

  • Tim,

    I'm working on a project which had a similar problem and I ended up writing my own SSIS custom data source. This consumed the file(s) and spat out a tabular representation of the data it contained. Surprisingly it's very fast and as a toolbox item in SSIS participates in all the usual looping constructs ets which you would need to process large numbers of files.

    My file format was something like

    fileheader_class,fileheader_type,number_of_subheadings

    subheading_startdate,subheading_enddate,number_of_items

    item

    item

    item

    item

    item

    subheading_startdate,subheading_enddate,number_of_items

    etc.

    etc.

    Adding three outputs to the data source (fileheader, subheader and item) enables me to point each at a separate table.

    there are plenty of examples of SSIS custom data sources so it's pretty straight forward.

    Cheers

  • I have also used the ActiveX script component in SSIS as a data source for consuming some oddly formatted files.  So far, I have been amazed at the performance I was able to get with very little code.

    A custom data source is a better long-term solution, but in a pinch, the ActiveX script will work and is a bit less complicated.

  • If this is something you will reuse then the custom data source component is the best approach.

    For a quick-and-dirty reformatting problems I use Textpad, and other editors exist that could do the same thing.  Textpad can use regular expressions to do search-and-replace in large text files, or even in multiple files, very quickly.  Multi-line expressions can merge data lines and strip out blank lines.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply