Importing Variable length text files into SQL

  • Hi
    I regularly import a set of Windows text files into SQL. These files are a fixed length and the data is column delimited. In general I can import them using a series of SSIS dataflow text file to SQL stages
    Unfortunately one of the files can be two different set lengths, 331  or 351 columns. All the lines in the file will be the same length. On the short length file the SSIS import screws the import by bringing the start of the next line into the end of the first.
    I wrote a VB script task that reads each line, checks the line length and pads short lines to the full length, however I’ve obviously made some error as it only writes every other line.
    I was wondering if anyone out there can help me?
    Thanks
    Steve

    VB Code (extracted from SSIS Edit Script)
    Dim FILE_NAME As String = "C:\For adding to Collection\NAT00080.txt"
       Dim TextLine As String
       Dim putLine As String
       Try
        Dim sr As StreamReader = New StreamReader(FILE_NAME)
        Dim Writer As System.IO.StreamWriter
        Writer = IO.File.CreateText("C:\For adding to Collection\NAT00080.txt")

        Do While sr.Peek() >= 0
          TextLine = sr.ReadLine()
          ' Pinched and inserted pad string code
          If Len(TextLine) = 331 Then
           ' Have to check to make sure that the len of the string is less than 351, otherwise you will get an error.
           putLine = TextLine & Space(20)
          Else
           putLine = TextLine
          End If
          Writer.WriteLine(putLine)
            Loop
        Writer.Close()
        sr.Close()
       Catch e As Exception
        Console.WriteLine("The process failed: {0}", e.ToString())
       End Try


  • Try changing the format of the input file from fixed width to ragged right, you shouldn't need a pre process step.

  • Thanks for the quick responseZZartin
    It's already set as raged right. I've tried most variations of format and encoding! It doesn't seem to make a difference.
    Steve

  • steve.alston - Wednesday, November 8, 2017 3:57 PM

    Hi
    I regularly import a set of Windows text files into SQL. These files are a fixed length and the data is column delimited. In general I can import them using a series of SSIS dataflow text file to SQL stages
    Unfortunately one of the files can be two different set lengths, 331  or 351 columns. All the lines in the file will be the same length. On the short length file the SSIS import screws the import by bringing the start of the next line into the end of the first.
    I wrote a VB script task that reads each line, checks the line length and pads short lines to the full length, however I’ve obviously made some error as it only writes every other line.
    I was wondering if anyone out there can help me?
    Thanks
    Steve

    VB Code (extracted from SSIS Edit Script)
    Dim FILE_NAME As String = "C:\For adding to Collection\NAT00080.txt"
       Dim TextLine As String
       Dim putLine As String
       Try
        Dim sr As StreamReader = New StreamReader(FILE_NAME)
        Dim Writer As System.IO.StreamWriter
        Writer = IO.File.CreateText("C:\For adding to Collection\NAT00080.txt")

        Do While sr.Peek() >= 0
          TextLine = sr.ReadLine()
          ' Pinched and inserted pad string code
          If Len(TextLine) = 331 Then
           ' Have to check to make sure that the len of the string is less than 351, otherwise you will get an error.
           putLine = TextLine & Space(20)
          Else
           putLine = TextLine
          End If
          Writer.WriteLine(putLine)
            Loop
        Writer.Close()
        sr.Close()
       Catch e As Exception
        Console.WriteLine("The process failed: {0}", e.ToString())
       End Try


    Can you post the record layout for the two files?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • steve.alston - Wednesday, November 8, 2017 3:57 PM

    Hi
    I regularly import a set of Windows text files into SQL. These files are a fixed length and the data is column delimited. In general I can import them using a series of SSIS dataflow text file to SQL stages
    Unfortunately one of the files can be two different set lengths, 331  or 351 columns. All the lines in the file will be the same length. On the short length file the SSIS import screws the import by bringing the start of the next line into the end of the first.
    I wrote a VB script task that reads each line, checks the line length and pads short lines to the full length, however I’ve obviously made some error as it only writes every other line.
    I was wondering if anyone out there can help me?
    Thanks
    Steve

    There is a possible alternative approach.
    Build a task in your SSIS Control Flow (inside a FOREACH container, assuming you're using that to pick up the various files) which interrogates the 'current' file to determine the number of columns and set a package variable accordingly ('Cols331' or 'Cols351' or similar).
    Add another dataflow task – so now you have one for 331 cols and one for 351 cols.
    Change the precedence constraints going into the dataflow tasks to be 'Expression and Constraint', with the expression being along the lines of
    varname == "Cols331" or
    varname == "Cols351"
    Thus you are able to control the execution path depending on the type of file being processed.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Hi Guys thanks for the support and suggestions.
    The file spec is:

    Fields – Client (NAT00080) file Position Length Type
    Client identifier 1 10 A
    Name for encryption 11 60 A
    Highest school level completed identifier 71 2 A
    Year highest school level completed 73 4 A
    Sex 77 1 A
    Date of birth 78 8 A
    Postcode 86 4 A
    Indigenous status identifier 90 1 A
    Language identifier 91 4 A
    Labour force status identifier 95 2 A
    Country identifier 97 4 A
    Disability flag 101 1 A

    Prior educational achievement flag 102 1 A
    At school flag 103 1 A
    Proficiency in spoken English identifier 104 1 A
    Address location – suburb, locality or town 105 50 A
    Unique student identifier 155 10 A
    State identifier 165 2 A
    Address building/property name 167 50 A
    Address flat/unit details 217 30 A
    Address street number 247 15 A
    Address street name 262 70 A
    Record length for national data collection for training organisations: 331
    Statistical area level 1 identifier 332 11 A
    Statistical area level 2 identifier 343 9 A
    Record length for national data collection for state and territory training authorities: 351
    Carriage return/line feed (ASCII 13/10): 2

    It's a simple text file and if data is missing spaces fill the appropriate columns. 
    Most files are the full-length (351) and the last two fields are always empty so adding 20 spaces to the end of the short files for import is fine. Unfortunately, I can't get the suppliers to send the full-length files which would be the nice solution!

    I'll give Phil Parkin's idea a go too. I don't use a foreach loop (yet) but I can see that might work in my flow.

    The full progess flow is unzipping the 9 text files, import them into SQL, fix up some of the data, Identify and count which of the fields are missing, identify the clients where data is missing so it can be rectified, append this data set to those previously imported (having first made a copy of those tables), export and zip the current modified set and then export and zip the "total so far" fileset into the correct directories..

    Eventually I'll get around to reading in all the source filesets in a foreach loop, but the process has gone from a manual 45 minute each set process (if all goes well) to around 45 seconds each and with ony around a hundred to do every 3 months I'm pleased with what I've almost done so far.

    My current work around for this file is to pass the file through Access (which is too dumb to complain about short line length), breakpoint, and check it's worked ok. If it hasn't, I use the Access internal import which works every time and continue.
    I't's just annoying that this step stops the process midway!

    Sorry for the rant.
    Steve

  • Another alternative is to import the files into a stage table with 1 column.   Then SUBSTRING the column to get all the columns above.

  • Thanks to everyone who helped me.

    brad.mason 's answer works great. I don't know why I ddn't think of it earlier. SQL is happy to ignore the SUBSTRING selects for the last two fields when the files are short. I assumed that since importing caused issues it wouldn't handle processing data that wasn't there.
    With files this size any extra processing time is almost zero.
    Sometimes you can't see the wood for the trees!
    Still along the way I did learn something about VB. Even if it didn't work properly!

    Thanks again guys
    Steve

  • steve.alston - Thursday, November 9, 2017 5:51 PM

    Thanks to everyone who helped me.

    brad.mason 's answer works great. I don't know why I ddn't think of it earlier. SQL is happy to ignore the SUBSTRING selects for the last two fields when the files are short. I assumed that since importing caused issues it wouldn't handle processing data that wasn't there.
    With files this size any extra processing time is almost zero.
    Sometimes you can't see the wood for the trees!
    Still along the way I did learn something about VB. Even if it didn't work properly!

    Thanks again guys
    Steve

    The substring trick works fine for this but if you ever need that extra burst of speed, it would be worthwhile to make 2 BCP format files (1 for each record layout) to do the imports.  The system could be setup to read just one row from a given file, determine its length, and then decide from that which BCP format file to use to import the full file using BULK INSERT.  If the target table where properly setup with the correct datatypes, it would even do row/column validation auto-magically instead of having to do something during or after the substrings.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply