Using Data Masker on Chinese Surnames

A customer had a question recently on masking Chinese characters. I thought that was interesting, so decided to test this out. This is a short post on using SQL Data Masker to accomplish this task, but I’ll a longer one on the Redgate Product Learning site.

Setting Up A Table

The first thing here was to get some test data. I was looking for Chinese names, since that was the request. I found this page on the most popular Chinese surnames. With that in mind, I build a small table and a few insert statements with this code. I only used the names Chén, Yáng, Zhào, Huáng, Zhōu, Wú, and one Western name for the demo

 CREATE TABLE dbo.CustomerFromChina
( customerid INT NOT NULL CONSTRAINT CustomerFromChinaPK PRIMARY KEY ,
   customersurname nvarchar(100)
)
GO
INSERT dbo.CustomerFromChina
   (customerid, customersurname)
VALUES
   (1, N'陈'),
   (2, N'杨'),
   (3, N'赵'),
   (4, N'黄'),
   (5, N'周'),
   (6, N'吴'),
   (7, N'Joe')
GO

I ran this and saw the results I needed.

query results with Chinese surnames

Creating a New Data Set

SQL Data Masker ships with a number of masking sets, but you can add your own. There is a process, but essentially you create a text file with the data in it and the udef extension.

The masking sets are in Program Files below the Redgate folder. This is an administative folder, so you need to have admin rights to make a new file. I did that and opened my file in VSCode, which defaults to UTF-8 format. Since I want to use Chinese characters, I need to use a text file that supports unicode.

I entered four names into my test file and saved it. You can see the entries here.

I made the file name, chinesesurnames.udef. Make sure that this doesn’t have the .txt extension at the end.

Setting The Masking Set

I opened SQL Data Masker and created a new masking set. I connected this to my database and then went to the Misc. Setup tab. I didn’t see my set (I had this open), so I clicked “Refresh” at the lower left. This brought the data set into the list view. As you can see, I should have capitalized the file name.

If I clicked “Sample” at the bottom, I see my data:

Note that the sample window shows a bunch of rows, but they are repeating the same four values.

Next, I added a new Substitution rule. I picked the CustomerFromChina table and the ChineseSurname column. I also selected my custom data set. This is noted on the right of the image below.

I saved this masking set and I was ready to test.

Testing

I first connected to the database and ran the query above in one window. Then I opened a vertical tab set, which moved this window to the right.

Next, I ran the masking set, which was quick on 7 rows of data. I then opened a new query window on the left side of SSMS and ran the same query I’d run on the right. You can see the results below:

In the image, the right has the original data. CustomerID 1 is Chén on the right, but on the left, this is now Liú. The others are also changed, with Joe also being masked to Liú.

This worked.

You can do the same thing with custom masking sets, in any Unicode format or language.

SQL Data Masker is a great product for managing a set of update scripts that will remote PII data from a database. It is clunky, but it works well for building and updating a project across time and with a team. Way better than T-SQL scripts.

Give it a try today.