June 30, 2004 at 10:00 am
I need some direction on how to scramble my production data for testing purposes. We have a compensation system that we need to copy down from production to UAT and DEV. Not the entire database needs to be scrambled, but a lot of it. What is the best way to do this?
Thanks...
June 30, 2004 at 11:40 am
What do you mean by "scrambled"?
1. Encrypted?
2. Rows mixed up (ie. original 1,2,3,4,5 - scrambled 4, 5, 1, 3, 2)?
3. Something else?
-SQLBill
June 30, 2004 at 11:54 am
I have SSN, salary, customer names and etc that I need to scramble the data within those fields so a third party vendor can't read it.
July 1, 2004 at 6:07 pm
SQL Server allows data sent between the client and the server to be encrypted. This ensures that any application or user intercepting the data packets on the network cannot view confidential or sensitive data (for example, passwords sent across the network as a user logs into an instance of SQL Server). SQL Server can use the Secure Sockets Layer (SSL) to encrypt all data transmitted between an application computer and an instance of SQL Server. The SSL encryption is performed within the Super Socket Net-Library (Dbnetlib.dll and Ssnetlib.dll) and applies to all inter-computer protocols supported by SQL Server 2000. Enabling encryption slows the performance of the Net-Libraries. Encryption forces the following actions in addition to all of the work for an unencrypted connection:
August 24, 2004 at 9:55 am
I am struggling with the same issue. Did anyone have a good solution to mixing up or changing names, SSNs, etc. in development environments?
April 7, 2005 at 12:25 pm
Create a table with id and val columns
select * from test
1 one
2 two
3 three
4 four
-- Scramble the data with next record value
select a.empno,max(b.val) VAL
from (select * from test) a,
(select * from test) b
where b.empno > a.empno
group by a.empno
1 two
2 three
3 four
April 7, 2005 at 1:18 pm
The problem with that solution is that it is a very simple algorithm based on the data itself, and would not pass HIPAA requirements.
A couple of things come to mind...
1. Use a phone book (or similar database) to pull random names to replace the existing ones.
2. Create a table of the names you have, and randomly replace from that table (I'd carry it further and select first and last from different records.)
You could use 1 and/or 2 for addresses as well.
3. Use a random number algorithm to replace social security numbers, phone numbers, etc.
4. There are commercial products available that will sanitize your data for you. I'm not familiar with any of them off the top of my head, but it is available.
Steve
April 7, 2005 at 6:46 pm
Send me a PM if you don't find an answer to this.
I wrote a few generic utility functions for this specific task (scrambling data for testing) and they're all ready to go. I can send you a copy.
(I thought I posted them here sometime back but I'm to lazy right now to see if they are in the library.)
April 26, 2005 at 9:32 am
John,
I could really use a utility function like that as well. Would you mind posting the file or referencing which post it is in?
Thanks a bunch!
~Joe
April 26, 2005 at 2:58 pm
Just had a PM about this--here it is, quick and dirty scramble utility. Free to all-- you'll never see me copyright anything. Feel free to fix/make better (and send me --PM-- a copy when you do).
-----------------------
DROP FUNCTION string_field
go
CREATE FUNCTION string_field
( @string VARCHAR(4000) = NULL
,@delimiter VARCHAR(20)
,@position INT
) RETURNS VARCHAR(255)
AS
BEGIN
---NOTE: modified because prior version didn't handle space as delimiter
DECLARE @result VARCHAR(255)
,@work VARCHAR(4000)
,@pattern VARCHAR(255)
,@i INT
,@j INT
,@ld INT
SELECT @result = ""
,@ld = LEN( REPLACE( @delimiter, ' ', 'X' ) ) --if delim is space then len = 0 so this fixes that
IF @position > 0 BEGIN
SELECT @pattern = "%" + @delimiter + "%"
,@work = @string
,@j = 0 --init
WHILE ( @j-2 < @position ) BEGIN
SELECT @i = PATINDEX( @pattern, @work )
,@j = @j-2 + 1
IF @i > 0 BEGIN
SELECT @result = SUBSTRING( @work, 1, @i - 1 )
,@work = SUBSTRING( @work, @i + @Ld, 4000 )
END ELSE BEGIN
IF @j-2 = @position
SELECT @result = @work
ELSE
SELECT @result = ""
,@j = @position
END
END
END
RETURN (@result)
END
GO
DROP PROC isoScramber
GO
CREATE PROC isoScramber
( @table_name VARCHAR(30),
@col_name VARCHAR(30),
@scramble_type VARCHAR(1) = 'S', --S=sequential#, W=word replace
@seq_seed INT = 1, --seed for scramble type S ( as IDENTITY function)
@seq_incr INT = 1, --incr for scramble type S ( as IDENTITY function)
@word_pos INT = 0, --0 = all --word position scramble type W
@word_delimiter VARCHAR(5) = ' ', --word position scramble type W
@debug INT = 0
) AS
SET NOCOUNT ON
--NOTE that word pos 0 type W not implemented!!!!
DECLARE @sql VARCHAR(6000)
,@cwidth VARCHAR(7)
,@crlf varchar(2)
SELECT @cwidth = CONVERT( VARCHAR(7) , col_length( @table_name, @col_name ) )
,@crlf = ''
IF @cwidth IS NULL BEGIN
PRINT 'INVALID TABLE/COLUMN NAME '
RETURN
END
IF @debug > 0 SELECT @crlf = CHAR(13) + CHAR(10)
IF @scramble_type = 'S' BEGIN
SELECT @sql = ' SELECT IDENTITY( INT, ' + CONVERT(VARCHAR(10), @seq_seed ) + ', ' + CONVERT(VARCHAR(6), @seq_incr ) + ') as seq' + @crlf
+ ' , w.orig_val ' + @crlf
+ ' ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf
+ ' INTO #txscramble ' + @crlf
+ ' FROM ( SELECT DISTINCT ' + @col_name + ' AS orig_val FROM ' + @table_name + ') w ' + @crlf
+ ' ' + @crlf
+ ' UPDATE #txscramble SET new_val = CONVERT( VARCHAR( ' + @cwidth + '), seq ) ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET ' + @col_name + ' = b.new_val ' + @crlf
+ ' FROM ' + @table_name + ' a ' + @crlf
+ ' JOIN #txscramble b ON a.' + @col_name + ' = b.orig_val ' + @crlf
END ELSE IF @scramble_type = 'W' BEGIN
SELECT @sql = ' SELECT IDENTITY( INT, 1, 1 ) as seq' + @crlf
+ ' , w.orig_val ' + @crlf
+ ' , 0 AS new_seq ' + @crlf
+ ' ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf
+ ' INTO #txscramble ' + @crlf
+ ' FROM ( SELECT DISTINCT dbo.string_field( ' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' ) AS orig_val ' + @crlf
+ ' FROM ' + @table_name + ') w ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET new_seq = CASE WHEN CONVERT( INT, shift ) + seq > maxseq THEN (CONVERT( INT, shift )+ seq ) - Maxseq ELSE shift + seq END ' + @crlf
+ ' FROM #txscramble a ' + @crlf
+ ' ,( SELECT CONVERT( INT, (rand() * (maxseq - 1) ) + 1 ) AS shift, maxseq ' + @crlf
+ ' FROM ( SELECT MAX( seq ) AS maxseq FROM #txscramble ) w1 ) w ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET new_val = b.orig_val ' + @crlf
+ ' FROM #txscramble a ' + @crlf
+ ' JOIN #txscramble b ON b.seq = a.new_seq ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET ' + @col_name + ' ' + @crlf
+ ' = CONVERT( VARCHAR( ' + @cwidth + '), SUBSTRING( ' + @col_name + ', 1, CHARINDEX( b.orig_val, ' + @col_name + ' ) - 1 ) + b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' ) + LEN( orig_val ), len(' + @col_name + ') ) ) ' + @crlf
+ ' FROM ' + @table_name + ' a ' + @crlf
+ ' JOIN #txscramble b ON dbo.string_field( a.' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' ) = b.orig_val ' + @crlf
--SET col = SUBSTRING( col, 1, CHARINDEX( orig_value, col ) - 1 ) + newval + SUBSTRING( col, CHARINDEX( orig_value, col ) + LEN( orig_value ), len(col) )
END
IF @debug > 0 PRINT @sql
IF @debug < 10 EXEC (@sql)
RETURN
GO
/*
-----------------------------------------------------------------
----DEMO /TEST SCRIPT
-----------------------------------------------------------------
DROP TABLE testscramble
SELECT '123-45-6789' AS SSN
,'Smith, John H. ' AS name
,'123 East 2nd Street, apt 1 ' AS addr
INTO testscramble
INSERT INTO testscramble SELECT '123-45-6789', 'Smith, John H. ', '123 Sweetwater Drive '
INSERT INTO testscramble SELECT '234-44-6789', 'Jones, James hawthorn ', '456 North 5th street, '
INSERT INTO testscramble SELECT '123-99-6789', 'Sillius, seymore q. ', '789 East 2nd Street, apt 1 '
INSERT INTO testscramble SELECT '123-45-1234', 'Jameson, mary ', '01234 West Street, apt 1 '
INSERT INTO testscramble SELECT '123-33-6666', 'Bergeron, harrison ', '5678 East 2nd Street, apt 1 '
INSERT INTO testscramble SELECT '123-23-2342', 'Smith, Jane D. ', '9012 North 3rd Street, apt 1 '
INSERT INTO testscramble SELECT '456-45-6789', 'Smith, John H. number2 ', '345 East 8th Street, '
INSERT INTO testscramble SELECT '789-45-6789', 'Tennison, A. L. ', '678 South 1std Street, apt 1 '
INSERT INTO testscramble SELECT '222-45-6789', 'Dilbert, Jack ', '910 East Maple Street '
INSERT INTO testscramble SELECT '333-45-6789', 'Doofus, donal ', '111 Elm Street '
INSERT INTO testscramble SELECT '444-45-6789', 'Mouse, Mickey ', '999 East 111th Street, apt 1 '
select * from testscramble
select * into savescramble
from testscramble
select * from testscramble
EXEC isoScramber
@table_name = 'testscramble'
,@col_name = 'ssn'
,@scramble_type = 'S' --S=sequential#, W=word replace
,@seq_seed =100000000
,@seq_incr =1
,@word_pos =0 --0 = all
,@word_delimiter = ' '
,@debug = 9
select * from testscramble
select * from testscramble
EXEC isoScramber
@table_name = 'testscramble'
,@col_name = 'name'
,@scramble_type = 'W' --S=sequential#, W=word replace
,@seq_seed =1
,@seq_incr =1
,@word_pos =1 --0 = all
,@word_delimiter = ','
,@debug = 9
select * from testscramble
select * from testscramble
EXEC isoScramber
@table_name = 'testscramble'
,@col_name = 'addr'
,@scramble_type = 'W' --S=sequential#, W=word replace
,@seq_seed =1
,@seq_incr =1
,@word_pos =3 --0 = all
,@word_delimiter = ' '
,@debug = 9
select * from testscramble
drop table testscramble
drop table savescramble
select * into testscramble from savescramble
*/
DROP FUNCTION count_char
go
CREATE FUNCTION count_char
( @string VARCHAR(255)
,@character CHAR(1)
) RETURNS INT
AS
BEGIN
RETURN LEN( @string ) - LEN( REPLACE( @string, @character, '' ))
END
April 27, 2005 at 6:56 am
A few questions:
1) How many records are you talking.
2) Does the data need to be unique on some or all of the columns, which ones?
If the data doesn't need to be unique on some of the columns then you can use simple substitutions.
If it needs to be unique you can create object tables with enough data to support the number of records and apply identity column to each table to use with. Then join on the numbers and replace the values between them, this way no direct correlation can be made between real world data and the test data. If you try to obvascate you leave the potential for discovery in the test data.
For Example if I need 10000 SSN values I would do like so.
CREATE TABLE Numbers (
Val char(1) not null primary key
)
GO
INSERT Numbers (Val) Values('0')
INSERT Numbers (Val) Values('1')
INSERT Numbers (Val) Values('2')
INSERT Numbers (Val) Values('3')
INSERT Numbers (Val) Values('4')
INSERT Numbers (Val) Values('5')
INSERT Numbers (Val) Values('6')
INSERT Numbers (Val) Values('7')
INSERT Numbers (Val) Values('8')
INSERT Numbers (Val) Values('9')
GO
CREATE TABLE SSNs (
Idx int identity(1,1) Not null,
SSN_Value Char(11) Not null Primary Key
)
GO
INSERT SSNs (SSN_Value)
SELECT
TOP 10000 (A.Val + B.Val + C.Val + '-' + D.Val + E.Val + '-' + F.Val + G.Val + H.Val + I.Val) SSN
FROM
dbo.Numbers A
CROSS JOIN
dbo.Numbers B
CROSS JOIN
dbo.Numbers C
CROSS JOIN
dbo.Numbers D
CROSS JOIN
dbo.Numbers E
CROSS JOIN
dbo.Numbers F
CROSS JOIN
dbo.Numbers G
CROSS JOIN
dbo.Numbers H
CROSS JOIN
dbo.Numbers I
GO
This provides me a large enough replacement value whch has no real world interpretation to the actual data unless I choose to build one.
May 17, 2005 at 5:11 pm
Thanks for the post, Antares! I modified it slightly to use table variables (I'm becoming a huge fan of those!) and have spread it around a bit.
Alternatively, if you needed more SSNs, you could leave the first three digits zero and use the last four.
-----
[font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]
July 6, 2006 at 8:24 am
"Invalid length parameter passed to the substring function. The statement has been terminated." error appears for the statement below: (possibly because the childlastname field may well be null or blank for some reason. It may sound dumb, but I haven't been able to figure how to fix this: [Tried isnull(ChildLastName,' ') to fix it]
Help?
UPDATE a
SET ChildLastName
=
CONVERT( VARCHAR( 25), SUBSTRING( ChildLastName, 1, CHARINDEX( b.orig_val, ChildLastName ) - 1 ) + b.new_val + SUBSTRING( ChildLastName, CHARINDEX( b.orig_val, ChildLastName ) + LEN( orig_val ), len(ChildLastName) ) )
FROM Child a
JOIN #txscramble b ON dbo.string_field( a.ChildLastName, ',', 1 ) = b.orig_val
which is originally (not debugged)
UPDATE a
SET new_seq = CASE WHEN CONVERT( INT, shift ) + seq > maxseq THEN (CONVERT( INT, shift )+ seq ) - Maxseq ELSE shift + seq END
FROM #txscramble a
,( SELECT CONVERT( INT, (rand() * (maxseq - 1) ) + 1 ) AS shift, maxseq
FROM ( SELECT MAX( seq ) AS maxseq FROM #txscramble ) w1 ) w
July 6, 2006 at 9:41 am
This should do it. A case statement that checks to see if the Charindex is zero.
CASE WHEN CHARINDEX( b.orig_val, ' + @col_name + ' )=0
THEN CONVERT( VARCHAR( ' + @cwidth + '),b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )
+ LEN( orig_val ), len(' + @col_name + ') ) )
ELSE CONVERT( VARCHAR( ' + @cwidth + '),
SUBSTRING( ' + @col_name + ', 1, CHARINDEX( b.orig_val, ' + @col_name + ' )
- 1 ) + b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )
+ LEN( orig_val ), len(' + @col_name + ') ) ) END'
-----------------------------------------
SET NOCOUNT ON
--NOTE that word pos 0 type W not implemented!!!!
DECLARE @sql VARCHAR(6000)
,@cwidth VARCHAR(7)
,@crlf varchar(2)
declare @nsql nvarchar(1000)
SELECT @nsql= N'Select @cwidth= col_length ('+ quotename(@table_name,'''')+','+ quotename(@col_name,'''')+N')'
print @nsql--exec( @sql)
EXEC sp_executesql @nsql, N'@cwidth varchar(50) OUTPUT', @cwidth OUTPUT
print @cwidth
--SELECT @cwidth = CONVERT( VARCHAR(7) , col_length( quotename(@table_name,''''), quotename(@col_name,'''') ) )
-- ,@crlf = ''
IF @cwidth IS NULL BEGIN
PRINT 'INVALID TABLE/COLUMN NAME '
RETURN
END
IF @debug > 0 SELECT @crlf = CHAR(13) + CHAR(10)
IF @scramble_type = 'S' BEGIN
SELECT @sql = ' SELECT IDENTITY( INT, ' + CONVERT(VARCHAR(10), @seq_seed ) + ', ' + CONVERT(VARCHAR(6), @seq_incr ) + ') as seq' + @crlf
+ ' , w.orig_val ' + @crlf
+ ' ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf
+ ' INTO #txscramble ' + @crlf
+ ' FROM ( SELECT DISTINCT ' + @col_name + ' AS orig_val FROM ' + @table_name + ') w ' + @crlf
+ ' ' + @crlf
+ ' UPDATE #txscramble SET new_val = CONVERT( VARCHAR( ' + @cwidth + '), seq ) ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET ' + @col_name + ' = b.new_val ' + @crlf
+ ' FROM ' + @table_name + ' a ' + @crlf
+ ' JOIN #txscramble b ON a.' + @col_name + ' = b.orig_val ' + @crlf
END
ELSE IF @scramble_type = 'W' BEGIN
SELECT @sql = ' SELECT IDENTITY( INT, 1, 1 ) as seq' + @crlf
+ ' , w.orig_val ' + @crlf
+ ' , 0 AS new_seq ' + @crlf
+ ' ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf
+ ' INTO #txscramble ' + @crlf
+ ' FROM ( SELECT DISTINCT dbo.string_field( ' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' ) AS orig_val ' + @crlf
+ ' FROM ' + @table_name + ') w ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET new_seq = CASE WHEN CONVERT( INT, shift ) + seq > maxseq THEN (CONVERT( INT, shift )+ seq ) - Maxseq ELSE shift + seq END ' + @crlf
+ ' FROM #txscramble a ' + @crlf
+ ' ,( SELECT CONVERT( INT, (rand() * (maxseq - 1) ) + 1 ) AS shift, maxseq ' + @crlf
+ ' FROM ( SELECT MAX( seq ) AS maxseq FROM #txscramble ) w1 ) w ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET new_val = b.orig_val ' + @crlf
+ ' FROM #txscramble a ' + @crlf
+ ' JOIN #txscramble b ON b.seq = a.new_seq ' + @crlf
+ ' ' + @crlf
+ ' UPDATE a ' + @crlf
+ ' SET ' + @col_name + ' ' + @crlf
+ ' = CASE WHEN CHARINDEX( b.orig_val, ' + @col_name + ' )=0
THEN CONVERT( VARCHAR( ' + @cwidth + '),b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )
+ LEN( orig_val ), len(' + @col_name + ') ) )
ELSE CONVERT( VARCHAR( ' + @cwidth + '),
SUBSTRING( ' + @col_name + ', 1, CHARINDEX( b.orig_val, ' + @col_name + ' )
- 1 ) + b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )
+ LEN( orig_val ), len(' + @col_name + ') ) ) END' + @crlf
+ ' FROM ' + @table_name + ' a ' + @crlf
+ ' JOIN #txscramble b ON dbo.string_field( a.' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' ) = b.orig_val ' + @crlf
--SET col = SUBSTRING( col, 1, CHARINDEX( orig_value, col ) - 1 ) + newval + SUBSTRING( col, CHARINDEX( orig_value, col ) + LEN( orig_value ), len(col) )
print @sql
END
IF @debug > 0 PRINT @sql
IF @debug < 10 EXEC (@sql)
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply