Introduction
I was recently working on a data pipeline project where I had to check data quality for the source data stream. Azure Data Factory provides Assert transformation to perform data quality and data validation checks. You can add expressions that validate to meet data quality. You can also check for duplicate values.
Pipeline development
I used the below sample sales data which is available in the SQL table. This table has multiple columns including customer address details like City, State, and Country. Now our job is to check whether the country field has values 'Japan':
Let's go to the Azure Data Factory author page, create a new data flow and add a source then click on New to create a new dataset:
Create a new dataset for the Azure SQL database:
Add the linked service and select the source SQL table. In import schema option select 'From connection/store':
Now, add an assert transformation, click on Add to create a new assert configuration:
- Assert type:
- Expect true: The result must be a boolean value
- Expect unique: Use this option to identify duplicates in the data, it defines uniqueness in a column or expression.
- Expect exists: When you have two source streams, you can select this option to check whether the row exists in both sources.
- Assert Id: Provide a string value to name this assertion. You will be able to use the identifier later downstream in your data flow using hasError() or to output the assertion failure code.
- Assert Description: You can add a description for this assertion here.
- Filter: Optionally add a filter so that only a subset of the rows will be used in the assertion.
- Expression: Provide an expression to evaluate this assertion. You can have multiple assertions for each assert transformation.
- Ignore nulls: The assert transformation includes NULLs while evaluating the expression values. You can choose to ignore NULLs with this property.
I used the below configuration to check for records having the country as 'Japan':
- Assert type: Expect true
- Assert Id: assertExcludeJapan
- Assert description: Excluding Japan Country
- Filter: Left Blank
- Expression: COUNTRY!='Japan'
- Ignore nulls: Unchecked
We can use the data preview tab to check the data, the error tab shows the rows which evaluate to true for the specified assert rules. In the below screen, it shows 4 error rows from the assert configuration that have a country value as 'Japan':
I added a derived column transformation, then created two columns with IsError() and HasError(), both of the function returns the boolean value True or False.
The preview tab of the derived column shows the mark which evaluates the assert expression:
Now added a sink to load the data in a SQL Server table:
Let's run the pipeline:
The below result set shows the highlighted rows with the Country value as Japan.IsError and HasError both have values of 1:
Conclusion
In this article, we discussed assert transformation. It is a very useful transformation to check for data quality, data validation, and row uniqueness in the data pipeline.