October 9, 2009 at 10:36 am
I am working on a data quality data flow where I want to redirect rows that fail the check.
Currently I am using multicasts to send rows down the path where they get checked and another set goes into another multicast. This is repeated for all the checks I need to make.
The idea is to have the full dataset (all rows) continue while splitting off rows that fail the check.
I want to perform all checks on all rows, I don't want a row to leave just because it failed the first check.
See the attached screen shot.
My concern is the memory usage for all the copies I am making with the multicasts.
Anyone have a suggestion that may have better performance?
October 9, 2009 at 6:26 pm
You could use a derived column task and have one new field for each edit. A script component could be used to write out rows to the error file based on these values.
You could also skip the derived column task and put the edits in the script component. The script component needs to be made asynchronous so that the number of input and output rows can differ.
October 9, 2009 at 6:43 pm
I thought about the derived column idea, actually that would be good since I would only be duplicating the number of columns equal to the number of columns that need quality checks.
The second idea of using the script component I'm unclear what you mean by asynchronous with different input and output columns.
I could use a script component that performs the outputs to the error table by using the system..oledb and write the output directly from the script component when the row fails the check.
Thanks for the suggestions.
tom
October 10, 2009 at 9:47 pm
You could use one or more conditional splits (one for each quality check) that would either allow the row to pass through to the primary output or be redirected to the "failed" pipeline. Assuming the metadata remains the same, or at least very similar, you could collect all of the failed records using a UNION ALL transform, and send them to your output file or table.
I've done something similar in the past, but I'll typically use a derived column to add a column that would indicate which of the quality checks were failed.
hth,
Tim
Tim Mitchell, Microsoft Data Platform MVP
Data Warehouse and ETL Consultant
TimMitchell.net | @Tim_Mitchell | Tyleris.com
ETL Best Practices
October 11, 2009 at 7:07 am
Tim,
lets say you have 30 columns in the data flow and you are doing checks on all of them, would you have a single derived column component with 30 additional columns? My objective is to run all checks on all rows. I also need to generate an error message that will be sent back to the business units in a report so they can correct the data.
October 12, 2009 at 2:21 pm
Take a look at the attached and see if it might do what you need. Each quality check is done in turn, and the ones that fail the check are multicast back into the data stream as well as sent to the error output. Each failed row has a corresponding derived column transform to allow a custom message to indicate which of the quality checks was failed.
Using this method will allow data that fails the check to be sent to an error output for review. A copy of those rows will still move on to the next check for a comprehensive quality check of each row.
There is some data duplication, but only the rows that fail a check will be duplicated in the error pipeline.
Hope this helps,
Tim
Tim Mitchell, Microsoft Data Platform MVP
Data Warehouse and ETL Consultant
TimMitchell.net | @Tim_Mitchell | Tyleris.com
ETL Best Practices
October 12, 2009 at 4:46 pm
Thanks for the ideas.
Makes sense, I like the idea of only duplicating a few rows (the ones that fail).
October 19, 2009 at 4:08 pm
Attached is a screenshot of my data quality flow. I like this because no matter how many fields you need to check it doesn't get cluttered. All the checks are done in the script component using a series of IF statements. I also like this since you get the power of a wide range of .Net methods to check the data. Also I didn't see any performance degradation, actually it ran quicker than my previous attempt. shown in an earlier post screen shot. I ran 2.3 million rows through this and liked the way it performed.
This is a bit of hybrid from all the great comments I received.
Thanks Again,
Tom
October 19, 2009 at 5:59 pm
Thanks for the update. It looks very clear and easy to understand.
Kay
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply