Introduction
As DBA's and developers, we come across situations where the data to be processed needs to have a number of required elements. In a large set of data, some of the data may not be a complete unit and hence cannot be processed. For example, an employee record may not be complete without a LastName, a geographic record may not be complete without longitude and so on.
Once bad records are identified, they may need to be eliminated and good ones may need to be processed in bulk. Depending on the format and volume of the overall data set, identifying such bad records may be an arduous task. This article demonstrates the use of Common Table Expression and the OUTPUT clause in identifying bad records for a specific scenario that I recently encountered.
Building up the Scenario
A company that organizes music events receives data on prospective performers from one or more vendors. A vendor sends details of musicians available to perform at such events for the company as part of the company's orchestra.
Data is received from the vendor in a specific format in a flat file. If the flat file doesn't contain the rate-quote for a specific musician, then the entire record for that musician is considered bad and is sent back to the vendor for correction. It's only the complete records that are sent to the company's management team for approval. Once the management approves the quotes, the musicians are hired by the company to perform at a company-sponsored event.
Then the vendor gets paid by the company for all musicians that have been hired. The payment to the vendor is as per the rate quoted by the vendor for each musician hired by the company. So the data received from the vendor must contain the rate-quote (hourly rate) for each musician.
The situation would have been simple had the vendor been sending data for all musicians in columns as shown below:
Name Address Primary Attribute Rate Quote ----- --------------------- ----------------- ------------ John X City, Y State, Z Zip Drummer $500 Mary A City, B State, C Zip Flutist null ...
In such a case, the rows where the Rate Quote value was null or blank could have been deleted.
The data from the vendor is received in rows. A set of rows may be considered a record for one musician. There are several such records for musicians in the same flat file: A sample record for a musician is shown below:
Row#1 '*********************************************' Row#2 Name Row#3 'John' Row#4 Address Row#5 'X City, Y State, Z Zip' Row#6 Primary Attribute Row#7 'Drummer' Row#8 RateQuote Row#9 '$500' Row#10 '--------------------------------------------'
The above data is considered one good record for the musician, John. Data for musicians can contain more than ten rows in cases where they have a secondary attribute like Flute or Guitar. In short, the number of rows for each musician may vary. However, it is mandatory to have a value for the RateQuote.
A sample bad record is shown below. It is bad because the RateQuote doesn't have a value:
Row#1 '*********************************************' Row#2 Name Row#3 'Pierre' Row#4 Address Row#5 'F City, G State, H Zip' Row#6 Primary Attribute Row#7 'Flutist' Row#8 Secondary Attribute Row#9 'Violinist' Row#10 RateQuote Row#11 ' ' Row#12 '--------------------------------------------'
Deleting a bad record means deleting all the rows from Row#1 through Row#12 for this musician named Pierre.
The goal of this article is to lay out a method to identify the bad records. It is not the aim of this article to show how good records are processed down the line.
Setting up the situation
The data from the vendor is stored in a table called dbo.Test_Table. This table has an identity column Seq_No; the second column CData hosts the actual vendor data. The script below creates the table, dbo.Test_Table:
if exists( select 1 from sys.tables where type = 'U' and name = 'Test_Table' and USER_NAME(schema_id) = 'dbo') drop table dbo.Test_Table go create table dbo.Test_Table (Seq_No int identity(1,1) Primary Key clustered ,CData varchar(50) not null) go
The T-SQL script below populates the table, dbo.Test_Table with sample data. Here six records are being inserted, one per musician. The script below is compatible with Microsoft SQL Server 2008 but not with earlier versions. It may be noted that two of the records (for musicians Bob and Spencer) are bad as they do not have a value for the RateQuote:
insert into dbo.Test_Table (CData) values ('*********************************************') ,('Name') ,('James') ,('Address') ,('123 Lane1, City1 ST 12341') ,('Primary Attribute') ,('Pianist') ,('Secondaty Attribute') ,('Flutist') ,('RateQuote') ,('$500') ,('--------------------------------------------') ,('*********************************************') ,('Name') ,('Veronica') ,('Address') ,('321 Lane2, City2 ST 12342') ,('Primary Attribute') ,('Violinist') ,('RateQuote') ,('$600') ,('--------------------------------------------') ,('*********************************************') ,('Name') ,('Bob') ,('Address') ,('213 Lane3, City3 ST 12343') ,('Primary Attribute') ,('Drummer') ,('RateQuote') ,(' ') ,('--------------------------------------------') ,('*********************************************') ,('Name') ,('Harry') ,('Address') ,('312 Lane4, City4 ST 12344') ,('Primary Attribute') ,('Guitarist') ,('RateQuote') ,('$500') ,('--------------------------------------------') ,('*********************************************') ,('Name') ,('Spencer') ,('Address') ,('456 Lane5, City5 ST 12345') ,('Primary Attribute') ,('Saxophonist') ,('RateQuote') ,(' ') ,('--------------------------------------------') ,('*********************************************') ,('Name') ,('Jessica') ,('Address') ,('456 Lane6, City6 ST 12346') ,('Primary Attribute') ,('Flutist') ,('RateQuote') ,('$600') ,('--------------------------------------------') ,('*********************************************')
Note the data:
select * from dbo.Test_Table
The bad data (marked in red below) is within the Seq_No values 23 to 32 and 43 to 52.
Seq_No CData ------ ----------------------------------------------------- 1 ********************************************* 2 Name 3 James 4 Address 5 123 Lane1, City1 ST 12341 6 Primary Attribute 7 Pianist 8 Secondaty Attribute 9 Flutist 10 RateQuote 11 $500 12 -------------------------------------------- 13 ********************************************* 14 Name 15 Veronica 16 Address 17 321 Lane2, City2 ST 12342 18 Primary Attribute 19 Violinist 20 RateQuote 21 $600 22 -------------------------------------------- 23 ********************************************* 24 Name 25 Bob 26 Address 27 213 Lane3, City3 ST 12343 28 Primary Attribute 29 Drummer 30 RateQuote 31 32 -------------------------------------------- 33 ********************************************* 34 Name 35 Harry 36 Address 37 312 Lane4, City4 ST 12344 38 Primary Attribute 39 Guitarist 40 RateQuote 41 $500 42 -------------------------------------------- 43 ********************************************* 44 Name 45 Spencer 46 Address 47 456 Lane5, City5 ST 12345 48 Primary Attribute 49 Saxophonist 50 RateQuote 51 52 -------------------------------------------- 53 ********************************************* 54 Name 55 Jessica 56 Address 57 456 Lane6, City6 ST 12346 58 Primary Attribute 59 Flutist 60 RateQuote 61 $600 62 -------------------------------------------- 63 *********************************************
As is apparent, each record for a musician starts with stars ('******') and ends with '--------...'. There could be multiple entries (rows) between the starting row and the ending row for a musician's record. The number of rows for each musician may not be the same. Also, the final entry in the flat file is always stars ('******') as if it were the beginning of a new record.
Please note that I have changed the domain to music and the values as well to bring out the problem. In our practical situation, we have around 16 million records in a single flat file from the vendor and the domain is finance.
Removing the Bad Data
The script below identifies the bad records, deletes them from the table, dbo.Test_Table and then inserts those deleted bad records into the table, dbo.Test_Table_Bad_Records So the table, dbo.Test_Table_Bad_Records would ultimately host the bad records that have been identified and would be sent back to the vendor.
I have used the method below because it sounded simple to me, executes fast for a few tens of millions of records and is short. It also brings out the usage of CTE (Common Table Expression) and the OUTPUT clause. Please execute the following T-SQL code to remove bad records:
if exists( select 1 from sys.tables where type = 'U' and name = 'Test_Table_Bad_Records' and USER_NAME(schema_id) = 'dbo') drop table dbo.Test_Table_Bad_Records go create table dbo.Test_Table_Bad_Records (Seq_No int not null primary key clustered ,CData varchar(50) not null); With CTE_Find_Start_Rows (CData, Seq_No, Finder) as (select CData, Seq_No, Finder = case when ltrim(rtrim(CData)) like '*********************************************' then 1 else 0 end from dbo.Test_Table) ,CTE_Last_Row_To_Delete (CData, Seq_No, Finder) as (select CData, Seq_No, Finder from CTE_Find_Start_Rows a where Finder = 0 and ltrim(rtrim(CData)) = '--------------------------------------------' and exists (select 1 from CTE_Find_Start_Rows b where b.Finder = 1 and a.Seq_No + 1 = b.Seq_No) and exists (select 1 from CTE_Find_Start_Rows c where c.Finder = 0 and a.Seq_No - 1 = c.Seq_No and ltrim(rtrim(c.CData)) = '')) ,CTE_Range_to_Delete (Delete_Start, Delete_End) as (select max(a.Seq_No) Delete_Start, b.Seq_No Delete_End from CTE_Find_Start_Rows a inner join CTE_Last_Row_To_Delete b on a.Seq_No < b.Seq_No where a.Finder = 1 group by b.Seq_No) delete from a output deleted.* into dbo.Test_Table_Bad_Records --select a.* from dbo.Test_Table a inner join CTE_Range_to_Delete b on a.Seq_No between b.Delete_Start and b.Delete_End go
Explanation of the code
First the table, dbo.Test_Table_Bad_Records is being created for storing bad records that have been deleted from the table, dbo.Test_Table.
The purpose of the common table expression, CTE_Find_Start_Rows is to identify the starting row in a record for each musician. This is accomplished through the field called Finder. Finder is assigned a value of 1 for those rows that have all stars ('******') in them. For all other rows, the value of Finder would be 0.
The purpose of the common table expression, CTE_Last_Row_To_Delete is to identify the ending row in a bad record for each musician. The identity of such a record is established by three facts. The record needs to have only dashes ('---------'), the next record after that must have all stars ('********') and the prior record must be blank or null. The value of Seq_No from this CTE would then be used in the next CTE to find the starting and ending positions (Seq_No) of bad records.
Having found the starting position of each musician's record in CTE_Find_Start_Rows and the ending position of each bad record in CTE_Last_Row_To_Delete, it is now time to find the starting position of each bad record and ending position of each bad record by joining the first two CTE's. This is accomplished in the common table expression, CTE_Range_to_Delete. Since the Seq_No value of the ending row of a bad record is known from CTE_Last_Row_To_Delete, all that needs to done is to find the maximum value of a Seq_No value with all starts ('*******') in the CData column that is less that the Seq_no value of the ending row of a bad record. This gives us the range of Seq_No values that need to be deleted. Please note that we need this range to delete the entire bad record no matter how many rows it contains.
The final step in the code is to delete all rows falling within the start and end of a bad record. In other words, all Seq_No's are deleted from our main table, dbo.Test_Table that fall within the Delete_Start and Delete_End values from the common Table Expression, CTE_Range_To_Delete. At the same time, OUTPUT clause is being used within the DELETE statement to insert the bad deleted records into the table, dbo.Test_Table_Bad_Records. The contents of the table, dbo.Test_Table_Bad_Records would be sent back to the vendor for corrections in RateQuote for the musicians Bob and Spencer.
Let's verify the contents of the table, dbo.Test_Table_Bad_Records by executing the following TSQL code:
select * from dbo.Test_Table_Bad_Records ---Bad records select * from dbo.Test_Table ---Good records
Please note that the records for musicians Bob (Seq_No from 23 to 32) and Spencer (Seq_No from 43 to 52) have been deleted from dbo.Test_Table and have been stored in the table, dbo.Test_Table_Bad_Records.
Conclusion
CTE's (common table expressions) are a nice way of navigating through different data sets constructed out of one data set. They help analytics by looking at the same data from multiple different angles in the same piece of code. OUTPUT clause is a nice option for auditing altered data as shown in this example. In the next article, we would use recursive CTE's to solve another TSQL puzzle.