October 5, 2015 at 8:34 pm
Comments posted to this topic are about the item SSIS Design Pattern - Staging Fixed Width Flat Files
Sam Vanga
http://SamuelVanga.com
October 6, 2015 at 12:49 am
In almost all cases, parsing in the data flow pefroms (sic) better than parsing at the source, if not at the same performance level.
Sam, I'm sorry, this sentence doesn't make any sense to me. If something performs better, say so. But adding the qualifier "if not at the same performance level" negates any meaning.
I'm not saying your methodology is wrong - in fact, in terms of human effort, it's vastly superior IMO, but I can't work out whether you've added the qualifier beacuse the total processing time is increased (is it?) or for some other reason.
October 6, 2015 at 6:06 am
Moreover, if the value is unknown or unavailable you’ll see a blank string, such as ‘ ‘. It is often better to convert these blanks to NULL. You can off course load the blank value to database column without any transformations, and I think it's again my preference to convert it to NULL.
Just to give a different view. I would convert it to '' if the value was to indicate that there was no value. I would convert it to NULL if the value was to indicate the value was not known or that it's in some ways not certain. If that's not able to be determined, it would be a judgement call.
October 6, 2015 at 6:28 am
Thanks for the article. We do a lot of these sort of imports and have many of the same issues such as dealing with blanks. But we have already determined what to do with them when we created our destination files by setting whether or not to allow nulls.
October 6, 2015 at 8:36 am
Its my suggestion ,In case of using two Derived Column and one Data Conversion in between source and Destination, it is performance Factor .When we are aware of length of each column then use Fixed width in Flat File Connection Manager and in advanced tab add new column as per requirement and order of the file then change length, datatype and length and map directly with Destination table.
This will reduce Transformations used and give High Performance when loading Billion Records from source file.
October 6, 2015 at 9:22 am
This is a good start, however I do not see any error handling. Do you just let the package fail? If so, how would you troubleshoot a failure?
Gerald Britton, Pluralsight courses
October 6, 2015 at 4:30 pm
We do a lot of file imports and exports. One thing that really helps is to add a last update column to your staging table. We've had many cases where the package did not send an email error notice when it failed. By looking at the staging table update column we can quickly see if the data processed or not.
October 6, 2015 at 6:15 pm
t.pinder (10/6/2015)
In almost all cases, parsing in the data flow pefroms (sic) better than parsing at the source, if not at the same performance level.
Sam, I'm sorry, this sentence doesn't make any sense to me. If something performs better, say so. But adding the qualifier "if not at the same performance level" negates any meaning.
I'm not saying your methodology is wrong - in fact, in terms of human effort, it's vastly superior IMO, but I can't work out whether you've added the qualifier beacuse the total processing time is increased (is it?) or for some other reason.
t.pinder - Thanks for this comment, I really like what you said. Definitely something for me to learn as I write more!
Sam Vanga
http://SamuelVanga.com
October 6, 2015 at 6:17 pm
Honny (10/6/2015)
Its my suggestion ,In case of using two Derived Column and one Data Conversion in between source and Destination, it is performance Factor .When we are aware of length of each column then use Fixed width in Flat File Connection Manager and in advanced tab add new column as per requirement and order of the file then change length, datatype and length and map directly with Destination table.This will reduce Transformations used and give High Performance when loading Billion Records from source file.
Both Derived Column and Data Conversion transformations are asynchronous, meaning they shouldn't be a drag on performance. Having said that, I do plan on do a performance test and blog about it. It could take more than couple of weeks though.
Sam Vanga
http://SamuelVanga.com
October 6, 2015 at 6:24 pm
g.britton (10/6/2015)
This is a good start, however I do not see any error handling. Do you just let the package fail? If so, how would you troubleshoot a failure?
As I stated in the article, error handling is out of the scope - I don't suggest that it's not required. I wanted to focus on the actual pattern.
Sam Vanga
http://SamuelVanga.com
October 6, 2015 at 8:40 pm
This is a very good point. To me it is unforgivable to allow process aborts. Better have a table for error rows and/or groups of rows. Then send notifications to the data owner and even return invalid data that was not processed. You must never create processes that require IT personnel to intervene. Bad data must be rejected and returned to the creator. Depending on the nature of the data, you may need to reject a whole file batch or complete documents or whatever the application requires, but be sure YOU don't have to handle problems.
Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )
October 7, 2015 at 11:55 am
if a field is a 6 character number, you’d need 8 bytes to store this numeric field as a string using VARCHAR(6) data type (one byte for each character plus 2 bytes for overhead, according to Books Online). Instead, if you convert it to an INT data type, you’d need only 4 bytes
This tends to be something to approach with caution. If the value isn't going to have math done on it then it is probably isn't a number. A US ZIP code is technically a number value. But it has leading zeros so converting it to a numeric data type will make the data incorrect.
I'm not saying to never convert numeric type strings to numeric data types. I'm saying you need to understand your data and plan your fields correctly.
October 7, 2015 at 12:14 pm
This tends to be something to approach with caution. If the value isn't going to have math done on it then it is probably isn't a number. A US ZIP code is technically a number value. But it has leading zeros so converting it to a numeric data type will make the data incorrect.
I'm not saying to never convert numeric type strings to numeric data types. I'm saying you need to understand your data and plan your fields correctly.
Agree completely. Some consultants not from the US wanted us to use an int field for the SSNs. They didn't know that many SSNs begin with zero. The math rule is a good one, especially for numbers that exist outside the local enterprise. I have used them for internal numbers such as SKUs where I'm assured they will always be a non-zero starting number.
October 7, 2015 at 1:15 pm
I'm guilty of using Int for SSN a long time ago (and I'm not from the US). You can imagine that was short lived!
Sam Vanga
http://SamuelVanga.com
October 15, 2015 at 12:26 pm
I read this a few days ago and it looked very similar to my workflow but the one thing that popped out at me was how you read your file in as a single column.
"Really?!" I thought, "That sounds kind of fiddly to me."
This morning, I get an email from the business line: "If our vendor adds a column to the end of that production file, will it break your import?"
Yes, it will. Wouldn't it be great if there were some way to avoid that .... oh, yeah ....
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply