April 2, 2007 at 5:25 pm
Hi Guys,
I have a flat file that I want to import. The problem is that the detination table has 32 columns, but the flat file sometimes has 32, and sometimes has 29.. within the same file. Previously in SQL 2000 DTS the data pump would handle this and pad the additional columns with NULL values if they werent preset. In SSIS the data flow task keeps looking for the next delimiter (which is pipe in this case), even if it finds the end of row delimiter first.
To clarify, if the flat file row has 29 columns, and ssis is expecting 32 rows, so it looks for the next delimiter which occurs on the NEXT ROW and includes the next row information in column 30, then it gets the next colum and puts it into col 31, and finally, it gets the rest of the row (pipes and all) and puts that into col 32.
a 4 column example. The flat file looks like this:
Col1a|Col2a{LF}
Col1b|Col2b|Col3b|Col4b{LF}
Col1c|Col2c|Col3c|Col4c{LF}
And when imported to a 4 column table, we get this:
Col1 Col2 Col3 Col4
Col1a Col2a {LF}Col1b Col2b|Col3b|Col4b{LF}
Col1c Col2c Col3c Col4c{LF}
But what we want is
Col1 Col2 Col3 Col4
Col1a Col2a NULL NULL
Col1b Col2b Col3b Col4b
Col1c Col2c Col3c Col4c
Is there a task that will handle these exceptions? Or am I going to have to write some VB?
Obviously we have pushed back to try and get the additional delimiters added to rows with less than 32 Cols in the text file, but this is not possible.
I guess the options are to cleanse the input file before it gets into data flow, or correct it on the fly using transformation scripts...
Any suggestions on what I should do here?
April 9, 2007 at 8:20 pm
Ok, I thought since no one seems to know about what to do with these file imports I thought I should post my findings.
There were a couple of options that I could have gone for. The simple one would be to use the "right ragged" file format. The problem with this is that it only handles 1 additional column past the number of defined cols. We can then take this "extra" column and split it using a contditional split. Our conditional split would have to be able to handle an abritary number of extra columns, which is where this method comes unstuck.
Instead I opted for a script at the control flow level. The script parses the input file and pads the rows that are short of columns with null columns, and the rows that have too many columns it truncates. (It also logs these to an external file for checking). It knows how many columns there should be by counting the number of columns in the SQL destination table.
Once it has cleansed the input file, it is then passed to the data flow section where it will always have the same number of columns.
Script Below:
Option Strict Off
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO
Public Class ScriptMain
'L.Mason 2007
'Working Directory
Dim Dir As String = "\\myUNCinputDirectory"
'Name of Input file
Dim InputFile As String = "filetoparse.txt"
'Name of repaired Input file
Dim OutputFile As String = "workingtempfile.txt"
'Name of SQL table the data is going in to
Dim DestinationTable As String = "Destination Table"
'Name of Error file for inputing errorneous lines from the input file. ErrorsYYYYMMDD.txt
Dim ErrorFile As String = "Errors" & Year(Now) & Right("0" & Month(Now), 2) & Right("0" & Day(Now), 2) & ".txt"
'Define delimiter String
Dim strDelim As String = "|"
Public Sub Main()
'Declare and Set Database Objects
Dim rstResults = CreateObject("ADODB.Recordset")
Dim conn = CreateObject("ADODB.Connection")
conn.Open("Driver={SQL Server};" & _
"Server={SQLServer};" & _
"Database={SQLDatabase};" & _
"UID=;" & _
"PWD=;")
'Set up SQL Query to populate resultset with column data using sproc SP_Columns
Dim strSQL As String = "exec sp_columns @table_name='" & DestinationTable & " '"
'Declare file Variables and Objects
Dim ch As Char() 'Set up array for storage of parsed tokens
Dim arr() As String 'Set up array for storage of parsed tokens
Dim ErrorOutput As String = Dir & ErrorFile
Dim Input As String = Dir & InputFile
Dim Output As String = Dir & OutputFile
Dim fFile1 As New FileInfo(Input)
Dim fFile2 As New FileInfo(Output)
Dim fErrorFile As New FileInfo(ErrorOutput)
Dim oWriteError As New System.IO.StreamWriter(ErrorOutput)
Dim oWrite As New System.IO.StreamWriter(Output)
Dim oFile As System.IO.File
Dim oFile2 As System.IO.File
Dim oRead As System.IO.StreamReader
'Declare general variables
Dim ColAmt As Integer 'Number of columns in destination table
Dim i As Integer 'loop counter i
Dim strErrorFileLine As String
Dim strInputString As String
Dim OutputLine As String
Dim arrLength As Integer
Dim intLineNumber As Integer = 0 'Line number counter
Dim ErrorCounterPadded As Integer = 0
Dim ErrorCounterTruncated As Integer = 0
'****************'
'Begin Processing'
'****************'
'populate results set with data from strSQL query
rstResults.open(strSQL, conn)
'this will get the number of columns in the desired destination table
ColAmt = 0
While Not rstResults.Eof
ColAmt = ColAmt + 1
rstResults.movenext()
End While
conn.close()
conn = Nothing
rstResults = Nothing
'Load file into oRead.
oRead = oFile.OpenText(Input)
While oRead.Peek() > -1
'increment line number counter (is there a .net type for this?)
intLineNumber = intLineNumber + 1
'get next line from file
strInputString = oRead.ReadLine
'Parse line according to delimiter
arr = strInputString.Split(strDelim.ToCharArray())
'Check if the current line has less columns than expected.
If arr.Length < ColAmt Then
arrLength = arr.Length
'make the short array = to the expected length
ReDim Preserve arr(ColAmt - 1)
'Send a line to error file
strErrorFileLine = "Line " & intLineNumber & " has less Cols than expected:" & strDelim & strInputString
oWriteError.WriteLine(strErrorFileLine)
'increment the error counter variable
ErrorCounterPadded = ErrorCounterPadded + 1
'pad added cols with NULLs
For i = (arrLength) To (ColAmt - 1)
arr(i) = Nothing
Next
End If
'check for input line with MORE cols than expected
If arr.Length > ColAmt Then
'write the line with too many cols to error file
strErrorFileLine = "Line " & intLineNumber & " has more Cols than expected:" & strDelim & strInputString
oWriteError.WriteLine(strErrorFileLine)
'increment the error counter variable
ErrorCounterTruncated = ErrorCounterTruncated + 1
'then remove the extra cols and insert as per normal
ReDim Preserve arr(ColAmt - 1)
End If
'build output line from array
For i = 0 To ColAmt - 2
OutputLine = OutputLine & CStr(arr(i)) & strDelim
Next
OutputLine = OutputLine & CStr(arr(ColAmt - 1))
'write line to output file
oWrite.WriteLine(OutputLine)
'clear line variable for next pass
OutputLine = Nothing
End While
'write out summary lines to error file
If ErrorCounterTruncated > 0 Then
strErrorFileLine = "Total number of TRUNCATED input rows: " & CStr(ErrorCounterTruncated) & " These will need to be investiaged. The data will be in, but the extra columns will be truncated"
oWriteError.WriteLine(strErrorFileLine)
End If
If ErrorCounterPadded > 0 Then
strErrorFileLine = "Total number of PADDED input rows: " & CStr(ErrorCounterPadded) & " These should be OK, the missing cols were padded with NULLs"
oWriteError.WriteLine(strErrorFileLine)
End If
'close file system read/write objects
oRead.Close()
oWrite.Close()
oWriteError.Close()
'rename working file to original filename fFile1.Delete()
fFile2.MoveTo(Input)
'if there were no errors, we can delete the transerror file
If ErrorCounterTruncated + ErrorCounterPadded = 0 Then
fErrorFile.Delete()
End If
'done!
Dts.TaskResult = Dts.Results.Success
End Sub
End Class
An additional question I now have is: can the number of columns in the destination table be found in an easier way? At the moment I use ADODB object to pass a query to the SQL DB which executes the SP_Columns stored proc and stores the result in a resultset type. The result set is then looped until the EOF, each time incrementing a counter (lines 71 to 79 in the code). This counter gives me the number of columns as the number of rows returned from the SP_Columns sproc = the number of cols in the queried table. It seems like a very crude way of doing this, and I would have thought that there must be a better way of doing this. I tried using a SQL task at the control flow level, but found that I could get the column count out ok, but I couldnt pass the variable on to the script to make use of it.
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply