Simple script to de-identify data

Question

Post reply

Simple script to de-identify data

Stewart "Arturius" Campbell

SSC Guru

Points: 72855
More actions
May 19, 2023 at 8:26 pm

#4193279

Comments posted to this topic are about the item Simple script to de-identify data
____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

frederico_fonseca SSCoach Points: 16309 More actions · Answer 1

quoting just to keep original text - reporting original due to spam signature links

wrote:

simple script in Python that can be used to de-identify data by replacing sensitive information with placeholders:

import re

def deidentify_text(text):
    # Replace email addresses with 
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', '', text)
    
    # Replace phone numbers with [PHONE]
    text = re.sub(r'\b(\+\d{1,2}\s?)?(\()?(\d{3})(?(2)\))[-.\s]?(\d{3})[-.\s]?(\d{4})\b', '[PHONE]', text)
    
    # Replace names with [NAME]
    text = re.sub(r'\b[A-Z][a-z]+\b', '[NAME]', text)
    
    # Replace addresses with [ADDRESS]
    text = re.sub(r'\b\d+\s\w+\s\w+\b', '[ADDRESS]', text)
    
    # Add more patterns and replacements for other sensitive information if needed
    
    return text

# Example usage
data = "John Doe's email is john.doe@example.com and his phone number is +1 (123) 456-7890."
deidentified_data = deidentify_text(data)
print(deidentified_data)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 2

Heh... "Just because you can do something in Python, doesn't mean you should". 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Thomas Franz Hall of Fame Points: 3973 More actions · Answer 3

The script will be *SLOW*, because it uses a scalar function and will be called Row-By-Agonizing-Row, multiple million times on a real database.

UPDATE tbl SET name = CAST(NEW_ID() AS VARCHAR(36))

-- Alternative to keep the length
UPDATE tbl SET name = LEFT(CAST(NEW_ID() AS VARCHAR(36)), LEN(name))

would do the same task (anonymizing) much faster, except that it uses UNIDs instead of some random chars.

PS: Don't reinvent the Wheel. Even not if you are not a professionall wheel engineer.

This reply was modified 2 years, 9 months ago by Thomas Franz.

God is real, unless declared integer.

Thomas Franz Hall of Fame Points: 3973 More actions · Answer 4

just did a small performance test (SQL 2022 Dev.):

SET STATISTICS TIME, IO ON;
DROP TABLE IF EXISTS #tmp1
DROP TABLE IF EXISTS #tmp2

SELECT TOP 2000000 o.name, LEFT(CAST(NEWID() AS VARCHAR(36)), LEN(o.name)) AS anonym
  INTO #tmp1
  FROM sys.objects AS o
 CROSS JOIN sys.objects AS o2
 CROSS JOIN sys.objects AS o3

SELECT TOP 2000000 o.name, dbo.scramble(o.name) AS anonym
  INTO #tmp2
  FROM sys.objects AS o
 CROSS JOIN sys.objects AS o2
 CROSS JOIN sys.objects AS o3

first statement (with NEWID) was done in 1,39 seconds, while your solution took 5:29 minutes.

And now imaging, that you do not have only one column, but multiple colums and havt to multiply this time by the number of columns (first name must be anonymized too, because there may be many Bobs and Johns out there, but there are much more unique first names as Sorlina, Nigita or Monalisa ( I met once a couple who gave their daugther this name, not to mention the names some promis as Elon Musk uses) out there

This reply was modified 2 years, 9 months ago by Thomas Franz.

God is real, unless declared integer.