May 19, 2023 at 8:26 pm
Comments posted to this topic are about the item Simple script to de-identify data
____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”
May 20, 2023 at 3:48 am
This was removed by the editor as SPAM
May 20, 2023 at 3:48 am
This was removed by the editor as SPAM
May 20, 2023 at 10:21 am
quoting just to keep original text - reporting original due to spam signature links
simple script in Python that can be used to de-identify data by replacing sensitive information with placeholders:
import re
def deidentify_text(text):
# Replace email addresses with
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', '', text)
# Replace phone numbers with [PHONE]
text = re.sub(r'\b(\+\d{1,2}\s?)?(\()?(\d{3})(?(2)\))[-.\s]?(\d{3})[-.\s]?(\d{4})\b', '[PHONE]', text)
# Replace names with [NAME]
text = re.sub(r'\b[A-Z][a-z]+\b', '[NAME]', text)
# Replace addresses with [ADDRESS]
text = re.sub(r'\b\d+\s\w+\s\w+\b', '[ADDRESS]', text)
# Add more patterns and replacements for other sensitive information if needed
return text
# Example usage
data = "John Doe's email is john.doe@example.com and his phone number is +1 (123) 456-7890."
deidentified_data = deidentify_text(data)
print(deidentified_data)
May 21, 2023 at 6:53 pm
Heh... "Just because you can do something in Python, doesn't mean you should". 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
May 24, 2023 at 12:10 pm
The script will be *SLOW*, because it uses a scalar function and will be called Row-By-Agonizing-Row, multiple million times on a real database.
UPDATE tbl SET name = CAST(NEW_ID() AS VARCHAR(36))
-- Alternative to keep the length
UPDATE tbl SET name = LEFT(CAST(NEW_ID() AS VARCHAR(36)), LEN(name))
would do the same task (anonymizing) much faster, except that it uses UNIDs instead of some random chars.
PS: Don't reinvent the Wheel. Even not if you are not a professionall wheel engineer.
God is real, unless declared integer.
May 24, 2023 at 12:58 pm
just did a small performance test (SQL 2022 Dev.):
SET STATISTICS TIME, IO ON;
DROP TABLE IF EXISTS #tmp1
DROP TABLE IF EXISTS #tmp2
SELECT TOP 2000000 o.name, LEFT(CAST(NEWID() AS VARCHAR(36)), LEN(o.name)) AS anonym
INTO #tmp1
FROM sys.objects AS o
CROSS JOIN sys.objects AS o2
CROSS JOIN sys.objects AS o3
SELECT TOP 2000000 o.name, dbo.scramble(o.name) AS anonym
INTO #tmp2
FROM sys.objects AS o
CROSS JOIN sys.objects AS o2
CROSS JOIN sys.objects AS o3
first statement (with NEWID) was done in 1,39 seconds, while your solution took 5:29 minutes.
And now imaging, that you do not have only one column, but multiple colums and havt to multiply this time by the number of columns (first name must be anonymized too, because there may be many Bobs and Johns out there, but there are much more unique first names as Sorlina, Nigita or Monalisa ( I met once a couple who gave their daugther this name, not to mention the names some promis as Elon Musk uses) out there
God is real, unless declared integer.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply