Introduction
One of the problems facing DBAs in companies that deal with personally identifiable information (PII) is that there are often either laws or contractual obligations that require that this information remain in a secure environment. This presents a problem when trying to restore databases from production down to development and QA environments because these systems are often not as secure as production. In order to be able to restore databases to these less secure environments, the PII has to be scrubbed.
If you are fortunate enough to have a small number of well-designed databases to manage, this may not be a problem. If, however, you have a large number of databases of varying age and disparate design, you can quickly end up with dozens or hundreds of tables with multiple columns in each one needing to be scrubbed.
An additional challenge with scrubbing data is that often, replacing every first name with “John” and every last name with “Doe” is not useful because it makes development and testing harder. How do you tell if you have the right “John Doe”? What we really want to do is replace PII with a reasonable substitute. For example, something that is not the same for every row but that follows the same basic rules: names are letters, phone number are 10 digits, SSNs are 9 digits, etc. This article describes a template for a generic and extendable solution for managing data that needs to be scrubbed and includes a set of sample scripts that set up a new database to demonstrate scrubbing test data.
General Overview
The scrubbing process will work like this: Start with a database backup from production that needs to be restored to an unsecure environment, and restore it to a server in a secure environment. Run an obfuscation process on it to scrub all PII from it, and then back it up again. This scrubbed backup can then be restored to unsecure environments.
We want the scrubbing process to be simple to manage, so we’re going to break PII up into categories and create a view for each category that gives us the replacement data. If we ever have to change how the scrubbed data is generated, we just need to change the views. Then we’re going to create a couple of control tables that determine what data gets scrubbed. The first tells us the database and table names that contain PII, and the second tells us which columns have PII and what kind it is. The heart of the scrubbing process is a stored procedure that can read these control tables and can automatically generate an SQL statement for each table using the views we’ve created to update all the appropriate columns. Once we have the statements, scrubbing the data is accomplished by running a stored procedure that iterates over each one and runs it.
A Note on Naming Conventions
There is some debate on the usefulness of naming conventions for SQL objects, but I’m required to use them at my job and I’ve gotten used to them. Let me explain the prefixes briefly:
- td – indicates that a table describes a domain (an enumeration in programming)
- te – indicates that a table describes an entity (anything not a domain)
- v – view
- p – stored procedure
Categorizing PII
First we’re going to create a couple of tables to define the categories of PII and how they should be handled. In some cases, it may be acceptable to use a single static value for every row. For example, we may be able to use a static last name of “Doe” as long as we have unique first names. When we create the statements used to scrub data, static values will use a CROSS JOIN to a derived table with a fixed value. The rest of the time, we’re going to INNER JOIN to a view containing the replacement data. We’re going to create two tables that tell us what we need to know.
Script 2a creates the JoinType table, which tells us what kind of join we need.
Script 2a
CREATE TABLE tdJoinType ( JoinTypeCD TINYINT NOT NULL CONSTRAINT pkJoinType PRIMARY KEY CLUSTERED , Descrip VARCHAR(20) NOT NULL ); GO INSERT INTO tdJoinType( JoinTypeCD, Descrip ) SELECT 1, 'INNER JOIN' UNION ALL SELECT 2, 'CROSS JOIN';
Script 2b creates the ColumnType table that tells us what kind of PII we’re dealing with. The first three columns are fairly self-explanatory. The ViewOrFixedValue column tells us the view name to join to if we’re using an INNER JOIN, or it contains the fixed value that we’re using if we’re using a CROSS JOIN. Finally, the Alias column tells us the table alias that will be used in the scrubbing SQL statement.
Script 2b
CREATE TABLE tdColumnType ( ColumnTypeCD TINYINT NOT NULL CONSTRAINT pkColumnType PRIMARY KEY CLUSTERED , Descrip VARCHAR(50) NOT NULL , JoinTypeCD TINYINT NOT NULL , ViewOrFixedValue VARCHAR(50) NOT NULL , Alias VARCHAR(5) NOT NULL , CONSTRAINT fkColumnType_JoinType FOREIGN KEY ( JoinTypeCD ) REFERENCES tdJoinType ( JoinTypeCD ) ); GO INSERT INTO tdColumnType SELECT 1 , 'FirstName' , 1 , 'vLetters' , 'A' UNION ALL SELECT 2 , 'LastName' , 2 , '''Doe''' , 'B' UNION ALL SELECT 3 , 'Address1' , 1 , 'vAddresses' , 'C' UNION ALL SELECT 4 , 'AddressOther' , 2 , 'NULL' , 'D' UNION ALL SELECT 5 , 'City' , 2 , '''Albany''' , 'E' UNION ALL SELECT 6 , 'State' , 2 , '''NY''' , 'F' UNION ALL SELECT 7 , 'Zip' , 2 , '''12203''' , 'G' UNION ALL SELECT 8 , 'DOB' , 2 , '''1980-01-24''' , 'H' UNION ALL SELECT 9 , 'SSN' , 1 , 'vSSNs' , 'I' UNION ALL SELECT 10 , 'Phone' , 1 , 'vPhones' , 'J';
Creating Replacement Data
All of our replacement data is either going to come from views or will be static. Each view needs to have the same two columns: ID and Value. In the sample implementation provided, we’re using sequential data generated using a Tally table. The idea to use tally tables and the script to create it came from Jeff Moden’s excellent article, The “Numbers” or “Tally” table: What it is and how it replaces a loop. If you need more realistic data, you could create some base tables that contain your data and reference them in the views instead. In the sample scripts provided, all of these views begin with 04. Let’s take a look at a couple of them. For Social Security numbers, the view looks like this:
CREATE VIEW vSSNs AS SELECT ID = T.N, Value = CONVERT( CHAR(1), ( T.N / POWER( 10, 8 ) % 10 ) ) + CONVERT( CHAR(1), ( T.N / POWER( 10, 7 ) % 10 ) ) + CONVERT( CHAR(1), ( T.N / POWER( 10, 6 ) % 10 ) ) + CONVERT( CHAR(1), ( T.N / POWER( 10, 5 ) % 10 ) ) + CONVERT( CHAR(1), ( T.N / POWER( 10, 4 ) % 10 ) ) + CONVERT( CHAR(1), ( T.N / POWER( 10, 3 ) % 10 ) ) + CONVERT( CHAR(1), ( T.N / POWER( 10, 2 ) % 10 ) ) + CONVERT( CHAR(1), ( T.N / 10 % 10 ) ) + CONVERT( CHAR(1), ( T.N % 10 ) ) FROM Tally T
This will give us a sequential set of SSNs starting with 000-00-0000 (without the formatting) and ending wherever our Tally table stops.
For addresses, the view uses a combination of dynamic and static values that is hopefully short enough compared to production data to avoid filling up the page and creating forwarded rows.
CREATE VIEW vAddresses AS SELECT ID = T.N, Value = A.Value + B.Value + C.Value + D.Value + E.Value + ' 1st St.' FROM Tally T JOIN tdNumberLetter A ON A.ID = ( T.N / POWER( 36, 4 ) % 36 ) JOIN tdNumberLetter B ON B.ID = ( T.N / POWER( 36, 3 ) % 36 ) JOIN tdNumberLetter C ON C.ID = ( T.N / POWER( 36, 2 ) % 36 ) JOIN tdNumberLetter D ON D.ID = ( T.N / 36 % 36 ) JOIN tdNumberLetter E ON E.ID = ( T.N % 36 )
This will give us addresses in the form “09A3N 1st St.” This can easily be changed to use only digits for the street number, but if you need a large number of unique addresses, it will lengthen the string. Using both letters and digits converts the street number into a base 36 number and gives you over 60,000,000 unique numbers using only 5 characters.
Controlling the Scrubbing Process
The scrubbing process is controlled by a pair of tables. The first one tells us which databases and tables contain PII.
CREATE TABLE teObfuscationTable( TableID INT NOT NULL CONSTRAINT pkObfuscationTable PRIMARY KEY CLUSTERED, DatabaseName VARCHAR(100) NOT NULL, TableName VARCHAR(100) NOT NULL, AutoGenerate BIT NOT NULL CONSTRAINT dfObfuscationTable_AutoGenerate DEFAULT(1), ObfStatement VARCHAR(MAX) );
The AutoGenerate column indicates whether or not the statement used to scrub the table should be automatically generated by the stored procedure when we run the scrubbing process, and the ObfStatement column contains the actual SQL statement that will be run to scrub the data. In cases where a table needs a manually generated scrubbing statement, set AutoGenerated to 0 and update ObfStatement with the scrubbing statement.
The second table tells us what columns have PII and what kind of data it is.
CREATE TABLE teObfuscationColumn( ColumnID INT NOT NULL CONSTRAINT pkObfuscationColumn PRIMARY KEY CLUSTERED, TableID INT NOT NULL, ColumnName VARCHAR(100) NOT NULL, ColumnTypeCD TINYINT NOT NULL, CONSTRAINT fkObfuscationColumn_ObfuscationTable FOREIGN KEY( TableID ) REFERENCES teObfuscationTable( TableID ), CONSTRAINT fkObfuscation_ColumnType FOREIGN KEY( ColumnTypeCD ) REFERENCES tdColumnType( ColumnTypeCD ) );
Examples of how to set up these tables are in the TestHarness/02 – Obfuscate Data folder of the sample scripts.
Generating the Scrubbing Statements
Once we have the control tables and views set up, we’re ready to generate the scrubbing statements. This is done by the procedure, pGenerateObfuscationStatements. The procedure could have used a cursor to iterate through the control tables and concatenate strings, but using the FOR XML option on a couple of the subqueries allows us to generate the scrubbing statements in a single update. We’ll look at it in pieces and explain what is happening in each piece.
UPDATE O SET O.ObfStatement = -- begin the update statement 'UPDATE X ' + CHAR(13) + 'SET ' + CHAR(13) +
This is the easy part. It just begins the update statement.
-- get all the columns we need to update SUBSTRING( REPLACE( ( SELECT ',' + CHAR(13) + ' X.' + C.ColumnName + ' = ' + T.Alias + '.Value' AS [text()] FROM teObfuscationColumn C JOIN tdColumnType T ON T.ColumnTypeCD = C.ColumnTypeCD WHERE C.TableID = O.TableID FOR XML PATH('') ), '
' /* this is a newline in xml */, CHAR(13) ), 3, 10000 ) + CHAR(13) +
In the SET part of the UPDATE statement, we need a list of all the columns that have to be updated. Since we can only return a single row from a subquery, we use the FOR XML option to return all the rows as an XML string, which can be added to our statement. The REPLACE statement is not strictly necessary if you don’t insert newlines into the statement, but I like nicely formatted SQL.
-- create the derived table that includes the sequence number 'FROM ( ' + CHAR(13) + ' SELECT ' + CHAR(13) +
In order to be able to join the table we’re scrubbing to our views, we need to add a unique number column to our table. We do this by creating a derived table that adds this column using the ROW_NUMBER function. SQL Server is smart enough to update the base table.
-- add the columns we need to select SUBSTRING( REPLACE( ( SELECT CHAR(13) + ' X.' + C.ColumnName + ',' AS [text()] FROM teObfuscationColumn C WHERE C.TableID = O.TableID FOR XML PATH('') ), '
', CHAR(13) ), 2, 10000 ) + CHAR(13) +
This is pretty much the same as above. We’re just adding the columns into the derived table.
-- add the row number so we can join to the sequential tables ' Num = ROW_NUMBER() OVER( ORDER BY (SELECT 1) ) ' + CHAR(13) + ' FROM ' + O.DatabaseName + '..' + O.TableName + ' X ' + CHAR(13) + ') AS X ' + CHAR(13) +
Here we add the number column to the derived table. We don’t care what the order is, as long as we have a unique number.
-- add all the joins to the sequential tables REPLACE( ( SELECT DISTINCT CASE T.JoinTypeCD WHEN 1 THEN 'JOIN ' + T.ViewOrFixedValue + ' ' + T.Alias + CHAR(13) + ' ON ' + T.Alias + '.ID = X.Num' + CHAR(13) ELSE 'CROSS JOIN ( SELECT Value = ' + T.ViewOrFixedValue + ' ) AS ' + T.Alias + CHAR(13) END [text()] FROM teObfuscationColumn C JOIN tdColumnType T ON T.ColumnTypeCD = C.ColumnTypeCD WHERE C.TableID = O.TableID FOR XML PATH('') ), '
', CHAR(13) ) FROM teObfuscationTable O WHERE O.AutoGenerate = 1
Once we have all the columns and the tables we’re updating, we need to add all the join statements that give us the replacement data. We need to handle both joining to a view to get replacement data, as well as using a static value, and this is done by the CASE statement. Again, we’re using the FOR XML option to concatenate multiple rows into a single XML string.
The final product of this procedure is a statement that will look something like this:
UPDATE X SET X.PhoneNbr = J.Value FROM ( SELECT X.PhoneNbr, Num = ROW_NUMBER() OVER( ORDER BY (SELECT 1) ) FROM Obfuscation..tePhone X ) AS X JOIN vPhones J ON J.ID = X.Num
Running the Scrubbing Process
With all the pieces in place, the only thing left to do is run the scrubbing process. This is done simply by calling the procedure pObfuscateData. In the sample scripts, this procedure just calls pGenerateObfuscationStatements, iterates over each of the statements in teObfuscationtable.ObfStatement using a cursor, and then executes each one using sp_executesql.
Running the Sample
When running the sample scripts, the only thing you should have to change is the location where the Obfuscation database is created, which is done in script 00. These scripts were tested on SQL 2012. They may very well work on other versions, but I have not tested them.
The Tally table in the sample scripts is fairly large. It is created with 10,000,000 rows in order to allow for a large set of replacement data. This makes the sample database around 300MB in size. You can shrink the Tally table if you like, but it may require modifying some of the scripts in the test harness that create the sample data that will be scrubbed.
Conclusion
The sample scripts provided are not meant to be taken and used directly in production, but rather to give a concrete example of how the process works. I hope that this solution is modular and flexible enough so that you can take it and easily transform it into something useful for your specific needs. I’m interested to hear your comments and suggestions, as well as any ideas you have for making it better. Thanks for reading.