Data Lineage is a process of understanding data's lifecycle, from origin to destination. It tracks where data originates, how it flows through organisation systems and how it changes.
Why is data lineage important?
The information gained from data lineage is crucial for understanding data management, metadata and data analytics. Lineage will help you understand data and use it effectively. Without accessing the overview of data flow, it becomes much more tedious for analysts to find data and the data potential for business.
What are the immediate benefits of data lineage?
- Better and more accurate analytics. By letting analytics teams and business users know where data comes from and what it means, data lineage improves their ability to find the data they need for BI and data science uses. That leads to better analytics results and makes it more likely that data analysis work will deliver meaningful information to drive business decision-making.
- Better data security and privacy overview. Organisations can use data lineage information to identify sensitive data that requires particularly strong security and assess potential risks.
- Stronger data governance. Data lineage also aids in tracking data and carrying out other key parts of the governance process.
- Improved data management. In addition to data quality improvement, data lineage improves data engineering and IT tasks, like data migration, data consolidation, and detecting potential data-related problems (missing values, skewed data distribution, ...).
What data lineage script brings
- using native T-SQL to analyse and collect information about sources and data flow from SQL query
- provide a simplified view of the SQL query
- help you better document end-to-end mappings and data flows through your organisation's systems.
Data lineage script structure
The Data lineage script consists of three main parts:
- standalone function for removing unnecessary or irrelevant characters for lineage
- removing comments
- extracting the predicates and tables
Removing specific characters
This standalone function is embedded in the Data Lineage script. You can use it as a standalone function in any given scenario. It will strip any unneeded characters for further process in analysing the query.
CREATE OR ALTER FUNCTION dbo.fn_removelistChars /* Author: Tomaz Kastrun Created: 06.JUN.2022 Desc: Function for removing list of unwanted characters Usage: SELECT dbo.fn_removelistChars('Tol~99""''''j\e.j/e[,t&eks]t,ki') */( @txt AS VARCHAR(max) ) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @list VARCHAR(200) = '^a-zA-Z0-9+@#\/_?!:.''-]' WHILE PATINDEX(@list,@txt) > 0 SET @txt = REPLACE(cast(cast(cast(cast(cast(cast(@txt as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast(SUBSTRING(@txt as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast(PATINDEX(@list,@txt as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))),1),'') RETURN @txt END; GO
Data Lineage script
The Data lineage script is twofold. In the first part, you will find the section to remove any kind of comments. These can be a two hyphens (--) for single-line comments or slash-dot (/* */) multi-line comments. Nested comments are also supported and will be removed from the script for further data lineage creation.
In the second part, you will find the while loop that will iterate through the lines of code and analyse the data sources and the corresponding clauses. The script can also analyse the columns relevant for data streamlining and data flow. At the end, the script will return all the relevant information regarding data sources for your query.
CREATE OR ALTER PROCEDURE dbo.TSQL_data_lineage /* Author: Tomaz Kastrun Date: August 2022 GitHub: github.com/tomaztk Blogpost: Description: Removing all comments from your T-SQL Query for a given procedure for better code visibility and readability - separate function Remove all unused characters. Create data lineage for inputed T-SQL query Usage: EXEC dbo.TSQL_data_lineage @InputQuery = N' SELECT * FROM master.dbo.spt_values ' */( @InputQuery NVARCHAR(MAX) ) AS BEGIN /* ****************************** * * 2. Remove comments characters * ******************************** */ DROP TABLE IF EXISTS dbo.SQL_query_table CREATE TABLE dbo.SQL_query_table ( id INT IDENTITY(1,1) NOT NULL ,query_txt NVARCHAR(4000) ) -- Breaks the procedure into lines with linebreak -- INSERT INTO dbo.SQL_query_table -- EXEC sp_helptext -- @objname = @InputQuery -- Breaks the query into lines with linebreak DECLARE @MAX_nof_break INT = (select len(@InputQuery) - len(REPLACE(cast(cast(cast(cast(cast(cast(@InputQuery as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( CHAR(10) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( '' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))))) DECLARE @start_nof_break INT = 1 declare @iq2 NVARCHAR(max) = @InputQuery declare @max_len int = (SELECT len(@InputQuery)) declare @start_pos int = 0 declare @br_pos int = 0 while (@MAX_nof_break >= @start_nof_break) BEGIN SET @br_pos = (SELECT charindex( char(10), @iq2) ) INSERT INTO dbo.SQL_query_table(query_txt) SELECT substring(@InputQuery,@start_pos, @br_pos ) SET @start_pos = @start_pos + @br_pos SET @iq2 = SUBSTRING(@InputQuery, @start_pos, @max_len) SET @start_nof_break = @start_nof_break + 1 END --- STart removing comments DECLARE @proc_text varchar(MAX) = '' DECLARE @proc_text_row varchar(MAX) DECLARE @proc_no_comment varchar(MAX) = '' DECLARE @comment_count INT = 0 SELECT @proc_text = @proc_text + CASE WHEN LEN(@proc_text) > 0 THEN '\n' ELSE '' END + query_txt FROM dbo.SQL_query_table DECLARE @i INT = 1 DECLARE @rowcount INT = (SELECT LEN(@proc_text)) WHILE (@i <= @rowcount) BEGIN IF SUBSTRING(@proc_text,@i,2) = '/*' BEGIN SELECT @comment_count = @comment_count + 1 END ELSE IF SUBSTRING(@proc_text,@i,2) = '*/' BEGIN SELECT @comment_count = @comment_count - 1 END ELSE IF @comment_count = 0 SELECT @proc_no_comment = @proc_no_comment + SUBSTRING(@proc_text,@i,1) IF SUBSTRING(@proc_text,@i,2) = '*/' SELECT @i = @i + 2 ELSE SELECT @i = @i + 1 END WHILE (@i <= @rowcount) BEGIN IF SUBSTRING(@proc_text,@i,4) = '/*/*' BEGIN SELECT @comment_count = @comment_count + 2 END ELSE IF SUBSTRING(@proc_text,@i,4) = '*/*/' BEGIN SELECT @comment_count = @comment_count - 2 END ELSE IF @comment_count = 0 SELECT @proc_no_comment = @proc_no_comment + SUBSTRING(@proc_text,@i,1) IF SUBSTRING(@proc_text,@i,4) = '*/*/' SELECT @i = @i + 2 ELSE SELECT @i = @i + 1 END DROP TABLE IF EXISTS #tbl_sp_no_comments CREATE TABLE #tbl_sp_no_comments ( rn INT IDENTITY(1,1) ,sp_text VARCHAR(8000) ) WHILE (LEN(@proc_no_comment) > 0) BEGIN INSERT INTO #tbl_sp_no_comments (sp_text) SELECT SUBSTRING( @proc_no_comment, 0, CHARINDEX('\n', @proc_no_comment)) SELECT @proc_no_comment = SUBSTRING(@proc_no_comment, CHARINDEX('\n',@proc_no_comment) + 2, LEN(@proc_no_comment)) END DROP TABLE IF EXISTS #tbl_sp_no_comments_fin CREATE TABLE #tbl_sp_no_comments_fin (rn_orig INT IDENTITY(1,1) ,rn INT ,sp_text_fin VARCHAR(8000)) DECLARE @nofRows INT = (SELECT COUNT(*) FROM #tbl_sp_no_comments) DECLARE @ii INT = 1 WHILE (@nofRows >= @ii) BEGIN DECLARE @LastLB INT = 0 DECLARE @Com INT = 0 SET @Com = (SELECT CHARINDEX('--', sp_text,@com) FROM #tbl_sp_no_comments WHERE rn = @ii) SET @LastLB = (SELECT CHARINDEX(CHAR(10), sp_text, @LastLB) FROM #tbl_sp_no_comments WHERE rn = @ii) INSERT INTO #tbl_sp_no_comments_fin (rn, sp_text_fin) SELECT rn ,CASE WHEN @Com = 0 THEN sp_text WHEN @Com <> 0 THEN SUBSTRING(sp_text, 0, @Com) END as new_sp_text FROM #tbl_sp_no_comments WHERE rn = @ii SET @ii = @ii + 1 END DROP TABLE IF EXISTS dbo.Query_results_no_comment SELECT rn ,sp_text_fin INTO dbo.Query_results_no_comment FROM #tbl_sp_no_comments_fin WHERE DATADATADATADATADATADATADATALENGTH(sp_text_fin) > 0 AND LEN(sp_text_fin) > 0 /* ****************************** * * 3. Create data lineage * ******************************** */ DECLARE @orig_q VARCHAR(MAX) SELECT @orig_q = COALESCE(@orig_q + ', ', '') + sp_text_fin FROM dbo.Query_results_no_comment order by rn asc DROP TABLE IF EXISTS dbo.LN_Query DECLARE @stmt2 NVARCHAR(MAX) SET @stmt2 = REPLACE(cast(cast(cast(cast(cast(cast(REPLACE(@orig_q as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( CHAR(13) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))), CHAR(10), ' ') SELECT TRIM(REPLACE(cast(cast(cast(cast(cast(cast(value as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max)))))))))))))) as val ,dbo.fn_removelistChars(value) as val_f ,row_number() over (ORDER BY (SELECT 1)) as rn INTO dbo.LN_Query from string_split(REPLACE(cast(cast(cast(cast(cast(cast(@stmt2 as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( CHAR(13) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))), ' ' ) WHERE REPLACE(cast(cast(cast(cast(cast(cast(value as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) <> ' ' OR REPLACE(cast(cast(cast(cast(cast(cast(value as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) <> ' ' DECLARE @table TABLE (command_ VARCHAR(200), location_ VARCHAR(200), order_ INT) DECLARE @command_i VARCHAR(200) = '' DECLARE @next_step BIT = 0 -- FALSE (1 = TRUE) DECLARE @previous VARCHAR(200) = '' DECLARE @order INT = 1 DECLARE @previous_cmd VARCHAR(200) = '' DECLARE @previous_step BIT = 0 -- FALSE DECLARE @ttok VARCHAR(100) = '' DECLARE @i_row INT = 1 DECLARE @max_row INT = (SELECT MAX(rn) FROM dbo.LN_Query) DECLARE @row_commands_1 NVARCHAR(1000) = 'select,delete,insert,drop,create,select,truncate,exec,execute' DECLARE @row_commands_2 NVARCHAR(1000) = 'select,not,if,exists,select' DECLARE @row_commands_3 NVARCHAR(1000) = 'from,join,into,table,exists,sys.dm_exec_sql_text,sys.dm_exec_cursors,exec,execute' WHILE (@max_row >= @i_row) BEGIN DECLARE @command VARCHAR(1000) = (SELECT val FROM dbo.LN_Query WHERE rn = @i_row) IF @command IN (SELECT REPLACE(cast(cast(cast(cast(cast(cast(TRIM(LOWER(value)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) FROM STRING_SPLIT(@row_commands_1, ',')) BEGIN IF LOWER(@command) = 'select' BEGIN SET @command = 'select' END SET @command_i = @command END IF (@next_step = 1) BEGIN IF @command NOT IN (SELECT REPLACE(cast(cast(cast(cast(cast(cast(TRIM(LOWER(value)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast(' ' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) FROM STRING_SPLIT(@row_commands_2,',')) BEGIN IF (LOWER(@previous) = 'into') SET @command_i = 'select into' IF (@command NOT LIKE '' OR @command NOT LIKE '') SET @ttok = ' ' + @command + ' as (' IF (@ttok NOT IN (SELECT @stmt2)) INSERT INTO @table (command_, location_, order_) SELECT @command_i ,@command ,@order SET @command_i = @command_i END SET @next_step = 0 IF @command IN ('sys.dm_exec_sql_text','sys.dm_exec_cursors') BEGIN SET @next_step = 1 END END IF (@command IN (SELECT REPLACE(cast(cast(cast(cast(cast(cast(TRIM(LOWER(value)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) FROM STRING_SPLIT(@row_commands_3,','))) BEGIN SET @next_step = 1 END SET @previous_cmd = @command_i SET @previous = @command SET @i_row = @i_row + 1 END DROP TABLE IF EXISTS dbo.final_result -- Final results SELECT * ,row_number() over (order by (select 1)) as rn INTO dbo.final_result FROM @table SELECT [command_] AS Clause_name ,[location_] AS Object_Name ,rn AS order_DL FROM dbo.final_result END; GO
Running the script
Once you create the procedure for data lineage, have your T-SQL query ready, including all the comments, object names, CTE tables and any other objects.
DECLARE @test_query VARCHAR(MAX) = ' -- This is a sample query to test data lineage SELECT s.[BusinessEntityID] ,p.[Title] ,p.[FirstName] ,p.[MiddleName] -- ,p.[LastName] ,p.[Suffix] ,e.[JobTitle] as JobName ,p.[EmailPromotion] ,s.[SalesQuota] ,s.[SalesYTD] ,s.[SalesLastYear] ,( SELECT GETDATE() ) AS DateNow ,( select count(*) FROM [AdventureWorks2014].sales.[SalesPerson] ) as totalSales /* Testing some additional comments! */FROM [AdventureWorks2014].sales.[SalesPerson] s LEFT JOIN [AdventureWorks2014].[HumanResources].[Employee] e ON e.[BusinessEntityID] = s.[BusinessEntityID] INNER JOIN [AdventureWorks2014].[Person].[Person] AS p ON p.[BusinessEntityID] = s.[BusinessEntityID] '
And simply execute the procedure with a single input parameter.
EXEC dbo.TSQL_data_lineage @InputQuery = @test_query
The data lineage script will return to you the results of the tables (and columns) used in the query.
About the script
The script is written in T-SQL and therefore does not need any scripting language. You can run the script on SQL Server 2016 and later. All editions are supported. Furthermore, you can run the script on Azure SQL Server, Azure SQL Database, Azure MI and Azure Synapse.
When you will be running the script, the query will not be validated nor run against the server. The query will be treated as a string and respectively analysed.
Conclusion
Enterprises and organisations are struggling with the quality of data analysis and potential security risks. Usually, both also come from complex data silos, data engineering and low visibility on data flows. By governing the data flow and understanding data origins, some of the issues can be addressed easier. And start today with this script.
You can track future updates on the Github repository.