Scramble Production Data for testing purposes

  • I need some direction on how to scramble my production data for testing purposes.  We have a compensation system that we need to copy down from production to UAT and DEV.  Not the entire database needs to be scrambled, but a lot of it.  What is the best way to do this?

     

    Thanks...

  • What do you mean by "scrambled"?

    1. Encrypted?

    2. Rows mixed up (ie. original 1,2,3,4,5 - scrambled 4, 5, 1, 3, 2)?

    3. Something else?

    -SQLBill

  • I have SSN, salary, customer names and etc that I need to scramble the data within those fields so a third party vendor can't read it.

  • SQL Server allows data sent between the client and the server to be encrypted. This ensures that any application or user intercepting the data packets on the network cannot view confidential or sensitive data (for example, passwords sent across the network as a user logs into an instance of SQL Server). SQL Server can use the Secure Sockets Layer (SSL) to encrypt all data transmitted between an application computer and an instance of SQL Server. The SSL encryption is performed within the Super Socket Net-Library (Dbnetlib.dll and Ssnetlib.dll) and applies to all inter-computer protocols supported by SQL Server 2000. Enabling encryption slows the performance of the Net-Libraries. Encryption forces the following actions in addition to all of the work for an unencrypted connection:

  • I am struggling with the same issue.  Did anyone have a good solution to mixing up or changing names, SSNs, etc. in development environments?

  • Create a table with id and val columns

    select * from test

    1 one      

    2 two      

    3 three    

    4 four     

    -- Scramble the data with next record value

    select a.empno,max(b.val) VAL

    from (select * from test) a,

         (select * from test) b

    where b.empno > a.empno

    group by a.empno

    1 two      

    2 three    

    3 four     

     

     

     

  • The problem with that solution is that it is a very simple algorithm based on the data itself, and would not pass HIPAA requirements.

    A couple of things come to mind...

    1. Use a phone book (or similar database) to pull random names to replace the existing ones. 

    2. Create a table of the names you have, and randomly replace from that table (I'd carry it further and select first and last from different records.)

    You could use 1 and/or 2 for addresses as well.

    3. Use a random number algorithm to replace social security numbers, phone numbers, etc.

    4. There are commercial products available that will sanitize your data for you.  I'm not familiar with any of them off the top of my head, but it is available.

    Steve

  • Send me a PM if you don't find an answer to this.

    I wrote a few generic utility functions for this specific task (scrambling data for testing) and they're all ready to go.  I can send you a copy.

    (I thought I posted them here sometime back but I'm to lazy right now to see if they are in the library.)

     

  • John,

    I could really use a utility function like that as well.  Would you mind posting the file or referencing which post it is in? 

    Thanks a bunch! 

    ~Joe

  • Just had a PM about this--here it is, quick and dirty scramble utility.  Free to all-- you'll never see me copyright anything.  Feel free to fix/make better (and send me --PM-- a copy when you do).

    -----------------------

    DROP FUNCTION string_field

    go

    CREATE FUNCTION string_field

    (  @string        VARCHAR(4000) = NULL

       ,@delimiter    VARCHAR(20)

       ,@position     INT

    ) RETURNS VARCHAR(255)

    AS

    BEGIN

    ---NOTE: modified because prior version didn't handle space as delimiter

       DECLARE @result        VARCHAR(255)

              ,@work          VARCHAR(4000)

              ,@pattern       VARCHAR(255)

              ,@i             INT

              ,@j             INT

              ,@ld            INT

             

       SELECT @result = ""

             ,@ld = LEN( REPLACE( @delimiter, ' ', 'X' ) )  --if delim is space then len = 0 so this fixes that

       IF @position > 0 BEGIN

          SELECT @pattern = "%" + @delimiter + "%"

                ,@work = @string

                ,@j = 0 --init

         

          WHILE ( @j-2 < @position ) BEGIN

             SELECT @i = PATINDEX( @pattern, @work )

                   ,@j = @j-2 + 1

             IF @i > 0 BEGIN

                SELECT @result = SUBSTRING( @work, 1, @i - 1 )

                      ,@work   = SUBSTRING( @work, @i + @Ld, 4000 )

             END ELSE BEGIN

                IF @j-2 = @position

                   SELECT @result = @work

                ELSE

                   SELECT @result = ""

                         ,@j = @position

             END

          END                           

       END              

       RETURN (@result)

    END

    GO

     

    DROP PROC isoScramber

    GO

    CREATE PROC isoScramber

    ( @table_name    VARCHAR(30),

      @col_name      VARCHAR(30),

      @scramble_type VARCHAR(1) = 'S',          --S=sequential#, W=word replace

      @seq_seed      INT = 1,              --seed for scramble type S ( as IDENTITY function)

      @seq_incr      INT = 1,              --incr for scramble type S ( as IDENTITY function)

      @word_pos      INT = 0,  --0 = all   --word position scramble type W

      @word_delimiter VARCHAR(5) = ' ',    --word position scramble type W

      @debug         INT = 0

    ) AS

    SET NOCOUNT ON

    --NOTE that word pos 0 type W not implemented!!!!

    DECLARE @sql VARCHAR(6000)

           ,@cwidth  VARCHAR(7)

           ,@crlf varchar(2)

    SELECT @cwidth = CONVERT( VARCHAR(7) , col_length( @table_name, @col_name ) )

          ,@crlf = ''

    IF @cwidth IS NULL BEGIN

       PRINT 'INVALID TABLE/COLUMN NAME '

       RETURN

    END

    IF @debug > 0 SELECT @crlf = CHAR(13) + CHAR(10)

    IF @scramble_type = 'S' BEGIN

       SELECT @sql = ' SELECT IDENTITY( INT, ' + CONVERT(VARCHAR(10), @seq_seed ) + ', ' + CONVERT(VARCHAR(6), @seq_incr ) + ') as seq' + @crlf

                  + '       , w.orig_val           ' + @crlf

                  + '       ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf

                  + '   INTO #txscramble ' + @crlf

                  + '   FROM ( SELECT DISTINCT ' + @col_name + ' AS orig_val FROM ' + @table_name + ') w ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE #txscramble SET new_val = CONVERT( VARCHAR( ' + @cwidth + '), seq ) ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET ' + @col_name + ' = b.new_val ' + @crlf

                  + '   FROM ' + @table_name + ' a ' + @crlf

                  + '        JOIN #txscramble b ON a.' + @col_name + ' = b.orig_val ' + @crlf

    END ELSE IF @scramble_type = 'W' BEGIN

       SELECT @sql = ' SELECT IDENTITY( INT, 1, 1 ) as seq' + @crlf

                  + '       , w.orig_val           ' + @crlf

                  + '       , 0 AS new_seq         ' + @crlf

                  + '       ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf

                  + '   INTO #txscramble ' + @crlf

                  + '   FROM ( SELECT DISTINCT dbo.string_field( ' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' )  AS orig_val ' + @crlf

                  + '            FROM ' + @table_name + ') w ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET new_seq = CASE WHEN CONVERT( INT, shift ) + seq > maxseq THEN (CONVERT( INT, shift )+ seq ) - Maxseq ELSE shift + seq END ' + @crlf

                  + '   FROM #txscramble a ' + @crlf

                  + '       ,( SELECT CONVERT( INT, (rand() * (maxseq - 1) ) + 1 ) AS shift, maxseq ' + @crlf

                  + '            FROM ( SELECT MAX( seq ) AS maxseq FROM #txscramble ) w1 ) w ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET new_val = b.orig_val ' + @crlf

                  + '   FROM #txscramble a ' + @crlf

                  + '        JOIN #txscramble b ON b.seq = a.new_seq ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET ' + @col_name + ' '  + @crlf

                  + '        = CONVERT( VARCHAR( ' + @cwidth + '), SUBSTRING( ' + @col_name + ', 1, CHARINDEX( b.orig_val, ' + @col_name + ' ) - 1 ) + b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' ) + LEN( orig_val ), len(' + @col_name + ') ) ) ' + @crlf

                  + '   FROM ' + @table_name + ' a ' + @crlf

                  + '        JOIN #txscramble b ON dbo.string_field( a.' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' ) = b.orig_val ' + @crlf

    --SET col = SUBSTRING( col, 1, CHARINDEX( orig_value, col ) - 1 ) + newval + SUBSTRING( col, CHARINDEX( orig_value, col ) + LEN( orig_value ), len(col) ) 

    END

    IF @debug > 0 PRINT @sql

    IF @debug < 10 EXEC (@sql)

    RETURN

    GO

    /* 

    -----------------------------------------------------------------

    ----DEMO /TEST SCRIPT

    -----------------------------------------------------------------

    DROP TABLE testscramble

    SELECT '123-45-6789' AS SSN

          ,'Smith, John H.            ' AS name

          ,'123 East 2nd Street, apt 1     ' AS addr

      INTO testscramble

    INSERT INTO testscramble SELECT '123-45-6789', 'Smith, John H.            ', '123 Sweetwater Drive           '

    INSERT INTO testscramble SELECT '234-44-6789', 'Jones, James hawthorn     ', '456 North 5th street,     '

    INSERT INTO testscramble SELECT '123-99-6789', 'Sillius, seymore q.       ', '789 East 2nd Street, apt 1     '

    INSERT INTO testscramble SELECT '123-45-1234', 'Jameson, mary             ', '01234 West Street, apt 1     '

    INSERT INTO testscramble SELECT '123-33-6666', 'Bergeron, harrison        ', '5678 East 2nd Street, apt 1     '

    INSERT INTO testscramble SELECT '123-23-2342', 'Smith, Jane D.            ', '9012 North 3rd Street, apt 1     '

    INSERT INTO testscramble SELECT '456-45-6789', 'Smith, John H. number2    ', '345 East 8th Street,   '

    INSERT INTO testscramble SELECT '789-45-6789', 'Tennison, A. L.           ', '678 South 1std Street, apt 1     '

    INSERT INTO testscramble SELECT '222-45-6789', 'Dilbert, Jack             ', '910 East Maple Street '

    INSERT INTO testscramble SELECT '333-45-6789', 'Doofus, donal             ', '111 Elm Street '

    INSERT INTO testscramble SELECT '444-45-6789', 'Mouse, Mickey             ', '999 East 111th Street, apt 1     '

    select * from testscramble

    select * into savescramble

    from testscramble

    select * from testscramble

    EXEC isoScramber

      @table_name    = 'testscramble'

     ,@col_name      = 'ssn'

     ,@scramble_type = 'S'          --S=sequential#, W=word replace

     ,@seq_seed      =100000000

     ,@seq_incr      =1

     ,@word_pos      =0  --0 = all

     ,@word_delimiter = ' '

     ,@debug         = 9

    select * from testscramble

    select * from testscramble

    EXEC isoScramber

      @table_name    = 'testscramble'

     ,@col_name      = 'name'

     ,@scramble_type = 'W'          --S=sequential#, W=word replace

     ,@seq_seed      =1

     ,@seq_incr      =1

     ,@word_pos      =1  --0 = all

     ,@word_delimiter = ','

     ,@debug         = 9

    select * from testscramble

    select * from testscramble

    EXEC isoScramber

      @table_name    = 'testscramble'

     ,@col_name      = 'addr'

     ,@scramble_type = 'W'          --S=sequential#, W=word replace

     ,@seq_seed      =1

     ,@seq_incr      =1

     ,@word_pos      =3  --0 = all

     ,@word_delimiter = ' '

     ,@debug         = 9

    select * from testscramble

    drop table testscramble  

    drop table savescramble  

    select * into testscramble from savescramble

     

    */

    DROP FUNCTION count_char

    go

    CREATE FUNCTION count_char

    (  @string    VARCHAR(255)

      ,@character CHAR(1)

    ) RETURNS INT

    AS

    BEGIN

    RETURN LEN( @string ) - LEN( REPLACE( @string, @character, '' ))

    END

  • A few questions:

    1) How many records are you talking.

    2) Does the data need to be unique on some or all of the columns, which ones?

    If the data doesn't need to be unique on some of the columns then you can use simple substitutions.

    If it needs to be unique you can create object tables with enough data to support the number of records and apply identity column to each table to use with. Then join on the numbers and replace the values between them, this way no direct correlation can be made between real world data and the test data. If you try to obvascate you leave the potential for discovery in the test data.

    For Example if I need 10000 SSN values I would do like so.

    CREATE TABLE Numbers (

    Val char(1) not null primary key

    )

    GO

    INSERT Numbers (Val) Values('0')

    INSERT Numbers (Val) Values('1')

    INSERT Numbers (Val) Values('2')

    INSERT Numbers (Val) Values('3')

    INSERT Numbers (Val) Values('4')

    INSERT Numbers (Val) Values('5')

    INSERT Numbers (Val) Values('6')

    INSERT Numbers (Val) Values('7')

    INSERT Numbers (Val) Values('8')

    INSERT Numbers (Val) Values('9')

    GO

    CREATE TABLE SSNs (

     Idx int identity(1,1) Not null,

     SSN_Value Char(11) Not null Primary Key

    )

    GO

    INSERT SSNs (SSN_Value)

    SELECT

     TOP 10000 (A.Val + B.Val + C.Val + '-' + D.Val + E.Val + '-' + F.Val + G.Val + H.Val + I.Val) SSN

    FROM

     dbo.Numbers A

    CROSS JOIN

     dbo.Numbers B

    CROSS JOIN

     dbo.Numbers C

    CROSS JOIN

     dbo.Numbers D

    CROSS JOIN

     dbo.Numbers E

    CROSS JOIN

     dbo.Numbers F

    CROSS JOIN

     dbo.Numbers G

    CROSS JOIN

     dbo.Numbers H

    CROSS JOIN

     dbo.Numbers I

    GO

    This provides me a large enough replacement value whch has no real world interpretation to the actual data unless I choose to build one.

  • Thanks for the post, Antares!  I modified it slightly to use table variables (I'm becoming a huge fan of those!) and have spread it around a bit.

    Alternatively, if you needed more SSNs, you could leave the first three digits zero and use the last four.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • "Invalid length parameter passed to the substring function. The statement has been terminated."  error appears for the statement below:  (possibly because the childlastname field may well be null or blank for some reason.   It may sound dumb, but I haven't been able to figure how to fix this:  [Tried isnull(ChildLastName,' ') to fix it]

    Help?

     UPDATE a

        SET ChildLastName

            =

     CONVERT( VARCHAR( 25), SUBSTRING( ChildLastName, 1, CHARINDEX( b.orig_val, ChildLastName ) - 1 ) + b.new_val + SUBSTRING( ChildLastName, CHARINDEX( b.orig_val, ChildLastName ) + LEN( orig_val ), len(ChildLastName) ) ) 

             

       FROM Child a

            JOIN #txscramble b ON dbo.string_field( a.ChildLastName, ',', 1 ) = b.orig_val

     

    which is originally (not debugged)

     UPDATE a

        SET new_seq = CASE WHEN CONVERT( INT, shift ) + seq > maxseq THEN (CONVERT( INT, shift )+ seq ) - Maxseq ELSE shift + seq END

       FROM #txscramble a

           ,( SELECT CONVERT( INT, (rand() * (maxseq - 1) ) + 1 ) AS shift, maxseq

                FROM ( SELECT MAX( seq ) AS maxseq FROM #txscramble ) w1 ) w

     

  • This should do it.  A case statement that checks to see if the Charindex is zero. 

     

    CASE WHEN CHARINDEX( b.orig_val, ' + @col_name + ' )=0

          THEN CONVERT( VARCHAR( ' + @cwidth + '),b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )

          + LEN( orig_val ), len(' + @col_name + ') ) )

          ELSE CONVERT( VARCHAR( ' + @cwidth + '),

          SUBSTRING( ' + @col_name + ', 1, CHARINDEX( b.orig_val, ' + @col_name + ' )

          - 1 ) + b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )

          + LEN( orig_val ), len(' + @col_name + ') ) ) END'

     

     

    -----------------------------------------

     

    SET NOCOUNT ON

    --NOTE that word pos 0 type W not implemented!!!!

    DECLARE @sql VARCHAR(6000)

           ,@cwidth  VARCHAR(7)

           ,@crlf varchar(2)

    declare @nsql nvarchar(1000)

     SELECT @nsql= N'Select @cwidth=  col_length ('+ quotename(@table_name,'''')+','+ quotename(@col_name,'''')+N')'

    print @nsql--exec( @sql)

     EXEC sp_executesql @nsql, N'@cwidth varchar(50) OUTPUT', @cwidth OUTPUT

    print @cwidth

    --SELECT @cwidth = CONVERT( VARCHAR(7) , col_length( quotename(@table_name,''''), quotename(@col_name,'''') ) )

      --    ,@crlf = ''

    IF @cwidth IS NULL BEGIN

       PRINT 'INVALID TABLE/COLUMN NAME '

       RETURN

    END

    IF @debug > 0 SELECT @crlf = CHAR(13) + CHAR(10)

    IF @scramble_type = 'S' BEGIN

       SELECT @sql = ' SELECT IDENTITY( INT, ' + CONVERT(VARCHAR(10), @seq_seed ) + ', ' + CONVERT(VARCHAR(6), @seq_incr ) + ') as seq' + @crlf

                  + '       , w.orig_val           ' + @crlf

                  + '       ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf

                  + '   INTO #txscramble ' + @crlf

                  + '   FROM ( SELECT DISTINCT ' + @col_name + ' AS orig_val FROM ' + @table_name + ') w ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE #txscramble SET new_val = CONVERT( VARCHAR( ' + @cwidth + '), seq ) ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET ' + @col_name + ' = b.new_val ' + @crlf

                  + '   FROM ' + @table_name + ' a ' + @crlf

                  + '        JOIN #txscramble b ON a.' + @col_name + ' = b.orig_val ' + @crlf

       END

    ELSE IF @scramble_type = 'W' BEGIN

       SELECT @sql = ' SELECT IDENTITY( INT, 1, 1 ) as seq' + @crlf

                  + '       , w.orig_val           ' + @crlf

                  + '       , 0 AS new_seq         ' + @crlf

                  + '       ,CONVERT( VARCHAR( ' + @cwidth + ' ), '''' ) AS new_val ' + @crlf

                  + '   INTO #txscramble ' + @crlf

                  + '   FROM ( SELECT DISTINCT dbo.string_field( ' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' )  AS orig_val ' + @crlf

                  + '            FROM ' + @table_name + ') w ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET new_seq = CASE WHEN CONVERT( INT, shift ) + seq > maxseq THEN (CONVERT( INT, shift )+ seq ) - Maxseq ELSE shift + seq END ' + @crlf

                  + '   FROM #txscramble a ' + @crlf

                  + '       ,( SELECT CONVERT( INT, (rand() * (maxseq - 1) ) + 1 ) AS shift, maxseq ' + @crlf

                  + '            FROM ( SELECT MAX( seq ) AS maxseq FROM #txscramble ) w1 ) w ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET new_val = b.orig_val ' + @crlf

                  + '   FROM #txscramble a ' + @crlf

                  + '        JOIN #txscramble b ON b.seq = a.new_seq ' + @crlf

                  + '   ' + @crlf

                  + ' UPDATE a ' + @crlf

                  + '    SET ' + @col_name + ' '  + @crlf

                  + '        = CASE WHEN CHARINDEX( b.orig_val, ' + @col_name + ' )=0

          THEN CONVERT( VARCHAR( ' + @cwidth + '),b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )

          + LEN( orig_val ), len(' + @col_name + ') ) )

          ELSE CONVERT( VARCHAR( ' + @cwidth + '),

          SUBSTRING( ' + @col_name + ', 1, CHARINDEX( b.orig_val, ' + @col_name + ' )

          - 1 ) + b.new_val + SUBSTRING( ' + @col_name + ', CHARINDEX( b.orig_val, ' + @col_name + ' )

          + LEN( orig_val ), len(' + @col_name + ') ) ) END' + @crlf

                  + '   FROM ' + @table_name + ' a ' + @crlf

                  + '        JOIN #txscramble b ON dbo.string_field( a.' + @col_name + ', ''' + @word_delimiter + ''', ' + CONVERT( VARCHAR(3), @word_pos ) + ' ) = b.orig_val ' + @crlf

    --SET col = SUBSTRING( col, 1, CHARINDEX( orig_value, col ) - 1 ) + newval + SUBSTRING( col, CHARINDEX( orig_value, col ) + LEN( orig_value ), len(col) ) 

    print @sql

    END

    IF @debug > 0 PRINT @sql

    IF @debug < 10 EXEC (@sql)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply