When cleansing data from source systems to store in a data mart or warehouse, we often need to remove or replace characters. I recently used the Script Component in SSIS to cleanse some ticket tracking data and thought I’d write a blog post that outlines the implementation.
The data from the ticket tracking system contained characters that needed to be replaced. Here’s a sample of the data directly from the source system.
- Social__bMedia
- Walk__uUp
- Phone
- From__bAgent__bEmail
- Internal__bIdentification
- E__uMail
We need to replace the __u with a dash and the __b with a space, so the cleansed data looks like this:
- Social Media
- Walk-Up
- Phone
- From Agent Email
- Internal Identification
The first step is to create an SSIS variable to store a character replacement string, the string stores each unwanted character and the associated replacement character separated by a pipe and contained within curly braces:
{__b| } {__u|-}
Here’s an example from SSIS (with some additional characters included). The variable name is ‘Replacements’.
Next, in the SSIS data flow, we need to add a Script Component Task. The Script Component Task should be configured as follows:
Set the Script Component Type to ‘Transformation’.
On the Script tab, set ReadOnlyVariables to the user variable that stores the character replacement string (User::Replacements in this example).
In the ‘Input Columns’ tab, select the column that contains the characters to be replaced. In this example, the column is called ‘SubmissionMethod’. The ‘Usage Type’ should be set to ReadWrite.
After updating the ‘Input Columns’ tab, return to the ‘Script’ tab and click the ‘Edit Script’ button. (Note: I am using C# in the upcoming sample code)
In the public class ScriptMain : UserComponent, we will add a Dictionary object to store the <Key, Value> pairs representing the original characters and related replacement characters. In this example, the Dictionary object is called ‘replacements’.
We then modify the PreExecute method to load the replacements dictionary with the <Key, Value> character pairs stored in the SSIS variable we created earlier named Replacements.
Next, update the Input0_ProcessInputRow method to load the text from the SubmissionMethod column into a StringBuilder object. We can then loop through all the characters stored in the replacement dictionary and use the StringBuilder replace method to perform any necessary character replacements.
The complete code is included below: