SSIS: Importing Flat/Text Structured, Semi-Structured, Unstructured

  • I have a number of text files that are produced daily that can be loaded to a SQL Db table and reported on. The problem is it requires a lot of manual scrubbing. I'm a bit new to the SSIS world and need to understand how to automate this load. I have had some success in file connection and formatting the input as Ragged Right. BUT the problem lies in the Page Breakes with headers through out each text file.

    Is there any standard method to pars through a file with data records, column headings, fixed width/Ragged right but has page breaks interspersed through out the file?

    Example:

    --------------

    Wed Jan 06 06:00:01 CST 2010 Page 1

    Daily Job Report

    Job Started Job Finished Loc Jobid Module Chain Job Status

    ------------------- ------------------- --- --------- ------------------------------ ------------------------------ ------------

    01-06-2010 06:00:01 Q 104226.00 OMNI_SPEC0011D_HOMEOWNERSHIPCO L1_CBS_SPEC_CHAIN PRED WAIT

    01-06-2010 06:00:01 Q 104225.00 OMNI_SPEC0015D_OVERDRAWNLOCKRE L1_CBS_SPEC_CHAIN PRED WAIT

    01-06-2010 06:00:01 01-06-2010 06:00:01 Q 104233.00 CBS_KEEP_ALIVE LAUNCHED

    01-06-2010 06:00:01 01-06-2010 06:00:01 Q 104234.00 AML_KEEP_ALIVE LAUNCHED

    01-06-2010 06:00:01 01-06-2010 06:00:01 Q 104238.00 VMBT3_KEEPALIVE LAUNCHED

    01-06-2010 06:00:01 01-06-2010 06:00:01 Q 104236.00 VMBT1_KEEPALIVE LAUNCHED

    01-06-2010 06:00:01 01-06-2010 06:00:01 Q 104237.00 VMBT2_KEEPALIVE LAUNCHED

    01-06-2010 06:00:01 01-06-2010 06:00:01 Q 104235.00 DAILY_JOB_REPORT_2

    Wed Jan 06 06:00:01 CST 2010 Page 2

    Daily Job Report

    Job Started Job Finished Loc Jobid Module Chain Job Status

    ------------------- ------------------- --- --------- ------------------------------ ------------------------------ ------------

    01-06-2010 05:01:05 01-06-2010 05:01:05 H 104155.00 OMNI_FINA0018M_FRB-COLLATERAL L1_CBS_FINA_CHAIN Skip!RunCal

    01-06-2010 05:01:05 01-06-2010 05:01:06 H 104152.00 AH_CLEAR_399-0 OSI_M_50CK_PROMO_POST_T1841 INACTIVE

    01-06-2010 05:01:05 01-06-2010 05:01:06 H 104151.00 PS_FILELOADER_9656 OSI_M_50CK_PROMO_POST_T1841 INACTIVE

    01-06-2010 05:01:05 01-06-2010 05:01:08 H 104149.00 OSI_M_50CK_PROMO_POST_T1841 L1_CBS_MARK_CHAIN Skip!RunCal

    01-06-2010 05:01:05 01-06-2010 05:01:05 H 104148.00 OMNI_MARK0001W_50CKGROUPPROMO L1_CBS_MARK_CHAIN Skip!RunCal

    Wed Jan 06 06:00:01 CST 2010 Page 3

    Daily Job Report

    Job Started Job Finished Loc Jobid Module Chain Job Status

    ------------------- ------------------- --- --------- ------------------------------ ------------------------------ ------------

    01-06-2010 05:01:02 01-06-2010 05:01:07 H 104123.00 L1_CBS_BOPS_CHAIN L0_CBS_MASTER_CHAIN_MORNING FINISHED

    01-06-2010 05:01:02 01-06-2010 05:01:07 H 104124.00 L1_CBS_COMM_CHAIN L0_CBS_MASTER_CHAIN_MORNING FINISHED

    01-06-2010 05:01:02 01-06-2010 05:10:07 H 104125.00 L1_CBS_DEPS_CHAIN L0_CBS_MASTER_CHAIN_MORNING FINISHED

    01-06-2010 05:01:02 01-06-2010 05:01:14 H 104126.00 L1_CBS_INFO_CHAIN L0_CBS_MASTER_CHAIN_MORNING FINISHED

    01-06-2010 05:01:02 01-06-2010 05:02:29 H 104127.00 L1_CBS_RISK_CHAIN L0_CBS_MASTER_CHAIN_MORNING FINISHED

    01-06-2010 05:01:02 01-06-2010 05:01:07 H 104129.00 L1_CBS_TELE_CHAIN L0_CBS_MASTER_CHAIN_MORNING FINISHED

  • I don't know if there is a best practice on this but there are two areas you could look into.

    You could use a script task and strip out the column header information from the file before passing it through your data flow. I am not strong in scripting, and I would be hard pressed to come up with a working sample quickly.

    That being said I would try to attack this with a conditional split. I think all the tools you need are in there to send only rows with real data downstream.

    Good luck.

  • hey dbowlin,

    Thanks muchos for your reply. I got a good lead from JAYDAVISUS that gave me enough info to create the solution. So in the spirit of sharing back here's how I created the solution:

    This actually skips all the repeating header sections through out the file and writes the rows of data to the output columns.

    1. set up a flat file connector with a ragged right fixed width.

    2. eliminated the leading headers to the first line of data.

    3. added a script component that examines the first character on the first predefined column for a CHR(12) page break symbol

    4. set a looping 5 line count to skip that header section

    5. Wrote the lines of good data to the output column buffer.

    6. the key here is for the output columns in the script component you have to set the "synchronousInput" property to NONE. This allows you to to map the columns from the input source to the output columns programatically AFTER you've weeded out the unwanted lines (repeating header sections within the file).

    7 After that life was bliss. And it too FAR less lines of code than I originally conceived.

    Thanks for the direction! The best problem solving comes not from knowing all the answers. But knowing where to look for the answer!

    Code below:

    ------------------------------------------

    ' Microsoft SQL Server Integration Services user script component

    ' This is your new script component in Microsoft Visual Basic .NET

    ' ScriptMain is the entrypoint class for script components

    Imports System

    Imports System.Data

    Imports System.IO

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Public Class ScriptMain

    Inherits UserComponent

    Public Overrides Sub inputFileSource_ProcessInputRow(ByVal Row As inputFileSourceBuffer)

    Dim WantedValue As Integer

    Dim i As Integer

    Dim x As Integer

    WantedValue = Int(Asc(Row.JobStarted.Substring(1, 1)))

    If WantedValue = 12 Then

    For i = 1 To 5

    Row.NextRow()

    Next

    Else

    With Me.outputDelPgBrksHdrsBuffer

    .AddRow()

    .START = CDate(Row.JobStarted)

    .STOP = CDate(Row.JobFinished)

    .LOC = RTrim(Row.Loc)

    .JOBID = CInt(Row.JobID)

    .MODULE = RTrim(Row.Module)

    .CHAIN = RTrim(Row.Chain)

    .FILENAME = RTrim(Row.AWImportFileName)

    .STATUS = Row.JobStatus

    End With

    Row.NextRow()

    End If

    End Sub

    End Class

  • I am glad you were able to work it out. Congratulations. I also spend a fair amount of time over on a SQL Server BI site, and they just had an article about this very subject. You may want to give it a look to see what other tidbits you can pick up.

    http://www.bidn.com/articles/integration-services/112/handling-headers-in-flat-files-with-ssis

  • Thanks.

    I'll check that out.

    Although I have a change to the above code. Same steps but the VB portion changed. I started debugging and ran into so little data conversion issues. this piece of code ran to completion.

    ' Microsoft SQL Server Integration Services user script component

    ' This is your new script component in Microsoft Visual Basic .NET

    ' ScriptMain is the entrypoint class for script components

    Imports System

    Imports System.Data

    Imports System.IO

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Public Class ScriptMain

    Inherits UserComponent

    Public Overrides Sub inputFileSource_ProcessInputRow(ByVal Row As inputFileSourceBuffer)

    Dim WantedValue As Integer

    Dim i As Integer

    Dim x As Integer

    Dim JobStat As Boolean

    Dim RowCount As Integer

    If Row.JobStatus.Contains("Page") = True Then

    For i = 1 To 7

    Row.NextRow()

    Next

    ElseIf Row.JobStatus.Contains("--------") = True Then

    Row.NextRow()

    ElseIf Row.JobStarted.Contains("Job Started") = True Then

    Row.NextRow()

    Else

    With Me.outputDelPgBrksHdrsBuffer

    .AddRow()

    'handle nulls for the start date

    If Trim(Row.JobStarted) = "" Then

    .START = CDate("12/30/1899 12:00:00 AM")

    Else

    .START = CDate(Trim(Row.JobStarted))

    End If

    'handle nulls for the Finish date

    If Trim(Row.JobFinished) = "" Then

    .STOP = CDate("12/30/1899 12:00:00 AM")

    Else

    .STOP = CDate(Trim(Row.JobFinished))

    End If

    .LOC = Trim(Row.Loc)

    'handle nulls for the JobID

    If Trim(Row.JobID) = "" Then

    .JOBID = CInt("-9")

    Else

    .JOBID = CInt(Trim(Row.JobID))

    End If

    .MODULE = RTrim(Row.Module)

    .CHAIN = RTrim(Row.Chain)

    .FILENAME = RTrim(Row.AWImportFileName)

    .STATUS = RTrim(Row.JobStatus)

    End With

    End If

    End Sub

    End Class

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply