Importing Variable length text files into SQL

Question

Post reply

Importing Variable length text files into SQL

steve.alston

Mr or Mrs. 500

Points: 525
More actions
November 8, 2017 at 3:58 pm

#402659

Hi
I regularly import a set of Windows text files into SQL. These files are a fixed length and the data is column delimited. In general I can import them using a series of SSIS dataflow text file to SQL stages
Unfortunately one of the files can be two different set lengths, 331 or 351 columns. All the lines in the file will be the same length. On the short length file the SSIS import screws the import by bringing the start of the next line into the end of the first.
I wrote a VB script task that reads each line, checks the line length and pads short lines to the full length, however I’ve obviously made some error as it only writes every other line.
I was wondering if anyone out there can help me?
Thanks
Steve
VB Code (extracted from SSIS Edit Script) Dim FILE_NAME As String = "C:\For adding to Collection\NAT00080.txt" Dim TextLine As String Dim putLine As String Try Dim sr As StreamReader = New StreamReader(FILE_NAME) Dim Writer As System.IO.StreamWriter Writer = IO.File.CreateText("C:\For adding to Collection\NAT00080.txt")
Do While sr.Peek() >= 0 TextLine = sr.ReadLine() ' Pinched and inserted pad string code If Len(TextLine) = 331 Then ' Have to check to make sure that the len of the string is less than 351, otherwise you will get an error. putLine = TextLine & Space(20) Else putLine = TextLine End If Writer.WriteLine(putLine) Loop Writer.Close() sr.Close() Catch e As Exception Console.WriteLine("The process failed: {0}", e.ToString()) End Try

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

ZZartin SSC-Dedicated Points: 30921 More actions · Answer 1

Try changing the format of the input file from fixed width to ragged right, you shouldn't need a pre process step.

steve.alston Mr or Mrs. 500 Points: 525 More actions · Answer 2

Thanks for the quick responseZZartin
It's already set as raged right. I've tried most variations of format and encoding! It doesn't seem to make a difference.
Steve

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 3

steve.alston - Wednesday, November 8, 2017 3:57 PM
Hi
I regularly import a set of Windows text files into SQL. These files are a fixed length and the data is column delimited. In general I can import them using a series of SSIS dataflow text file to SQL stages
Unfortunately one of the files can be two different set lengths, 331 or 351 columns. All the lines in the file will be the same length. On the short length file the SSIS import screws the import by bringing the start of the next line into the end of the first.
I wrote a VB script task that reads each line, checks the line length and pads short lines to the full length, however Iâ€™ve obviously made some error as it only writes every other line.
I was wondering if anyone out there can help me?
Thanks
Steve
VB Code (extracted from SSIS Edit Script) Dim FILE_NAME As String = "C:\For adding to Collection\NAT00080.txt" Dim TextLine As String Dim putLine As String Try Dim sr As StreamReader = New StreamReader(FILE_NAME) Dim Writer As System.IO.StreamWriter Writer = IO.File.CreateText("C:\For adding to Collection\NAT00080.txt")
Do While sr.Peek() >= 0 TextLine = sr.ReadLine() ' Pinched and inserted pad string code If Len(TextLine) = 331 Then ' Have to check to make sure that the len of the string is less than 351, otherwise you will get an error. putLine = TextLine & Space(20) Else putLine = TextLine End If Writer.WriteLine(putLine) Loop Writer.Close() sr.Close() Catch e As Exception Console.WriteLine("The process failed: {0}", e.ToString()) End Try

Can you post the record layout for the two files?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Phil Parkin SSC Guru Points: 246992 More actions · Answer 4

steve.alston - Wednesday, November 8, 2017 3:57 PM
Hi
I regularly import a set of Windows text files into SQL. These files are a fixed length and the data is column delimited. In general I can import them using a series of SSIS dataflow text file to SQL stages
Unfortunately one of the files can be two different set lengths, 331 or 351 columns. All the lines in the file will be the same length. On the short length file the SSIS import screws the import by bringing the start of the next line into the end of the first.
I wrote a VB script task that reads each line, checks the line length and pads short lines to the full length, however Iâ€™ve obviously made some error as it only writes every other line.
I was wondering if anyone out there can help me?
Thanks
Steve

There is a possible alternative approach.
Build a task in your SSIS Control Flow (inside a FOREACH container, assuming you're using that to pick up the various files) which interrogates the 'current' file to determine the number of columns and set a package variable accordingly ('Cols331' or 'Cols351' or similar).
Add another dataflow task â€“ so now you have one for 331 cols and one for 351 cols.
Change the precedence constraints going into the dataflow tasks to be 'Expression and Constraint', with the expression being along the lines of
varname == "Cols331" or
varname == "Cols351"
Thus you are able to control the execution path depending on the type of file being processed.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

steve.alston Mr or Mrs. 500 Points: 525 More actions · Answer 5

Hi Guys thanks for the support and suggestions.
The file spec is:

Fields – Client (NAT00080) file	Position	Length	Type
Client identifier	1	10	A
Name for encryption	11	60	A
Highest school level completed identifier	71	2	A
Year highest school level completed	73	4	A
Sex	77	1	A
Date of birth	78	8	A
Postcode	86	4	A
Indigenous status identifier	90	1	A
Language identifier	91	4	A
Labour force status identifier	95	2	A
Country identifier	97	4	A
Disability flag	101	1	A
Prior educational achievement flag	102	1	A
At school flag	103	1	A
Proficiency in spoken English identifier	104	1	A
Address location – suburb, locality or town	105	50	A
Unique student identifier	155	10	A
State identifier	165	2	A
Address building/property name	167	50	A
Address flat/unit details	217	30	A
Address street number	247	15	A
Address street name	262	70	A
Record length for national data collection for training organisations:		331
Statistical area level 1 identifier	332	11	A
Statistical area level 2 identifier	343	9	A
Record length for national data collection for state and territory training authorities:		351
Carriage return/line feed (ASCII 13/10):		2

It's a simple text file and if data is missing spaces fill the appropriate columns.
Most files are the full-length (351) and the last two fields are always empty so adding 20 spaces to the end of the short files for import is fine. Unfortunately, I can't get the suppliers to send the full-length files which would be the nice solution!

I'll give Phil Parkin's idea a go too. I don't use a foreach loop (yet) but I can see that might work in my flow.

The full progess flow is unzipping the 9 text files, import them into SQL, fix up some of the data, Identify and count which of the fields are missing, identify the clients where data is missing so it can be rectified, append this data set to those previously imported (having first made a copy of those tables), export and zip the current modified set and then export and zip the "total so far" fileset into the correct directories..

Eventually I'll get around to reading in all the source filesets in a foreach loop, but the process has gone from a manual 45 minute each set process (if all goes well) to around 45 seconds each and with ony around a hundred to do every 3 months I'm pleased with what I've almost done so far.

My current work around for this file is to pass the file through Access (which is too dumb to complain about short line length), breakpoint, and check it's worked ok. If it hasn't, I use the Access internal import which works every time and continue.
I't's just annoying that this step stops the process midway!

Sorry for the rant.
Steve

brad.mason5 Hall of Fame Points: 3299 More actions · Answer 6

Another alternative is to import the files into a stage table with 1 column. Then SUBSTRING the column to get all the columns above.

steve.alston Mr or Mrs. 500 Points: 525 More actions · Answer 7

Thanks to everyone who helped me.

brad.mason 's answer works great. I don't know why I ddn't think of it earlier. SQL is happy to ignore the SUBSTRING selects for the last two fields when the files are short. I assumed that since importing caused issues it wouldn't handle processing data that wasn't there.
With files this size any extra processing time is almost zero.
Sometimes you can't see the wood for the trees!
Still along the way I did learn something about VB. Even if it didn't work properly!

Thanks again guys
Steve

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 8

steve.alston - Thursday, November 9, 2017 5:51 PM
Thanks to everyone who helped me.
brad.mason 's answer works great. I don't know why I ddn't think of it earlier. SQL is happy to ignore the SUBSTRING selects for the last two fields when the files are short. I assumed that since importing caused issues it wouldn't handle processing data that wasn't there.
With files this size any extra processing time is almost zero.
Sometimes you can't see the wood for the trees!
Still along the way I did learn something about VB. Even if it didn't work properly!
Thanks again guys
Steve

The substring trick works fine for this but if you ever need that extra burst of speed, it would be worthwhile to make 2 BCP format files (1 for each record layout) to do the imports. The system could be setup to read just one row from a given file, determine its length, and then decide from that which BCP format file to use to import the full file using BULK INSERT. If the target table where properly setup with the correct datatypes, it would even do row/column validation auto-magically instead of having to do something during or after the substrings.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)