Simple script to de-identify data

  • Comments posted to this topic are about the item Simple script to de-identify data

    ____________________________________________
    Space, the final frontier? not any more...
    All limits henceforth are self-imposed.
    “libera tute vulgaris ex”

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • quoting just to keep original text - reporting original due to spam signature links

    dewo123 wrote:

    simple script in Python that can be used to de-identify data by replacing sensitive information with placeholders:

    import re

    def deidentify_text(text):
    # Replace email addresses with
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', '', text)

    # Replace phone numbers with [PHONE]
    text = re.sub(r'\b(\+\d{1,2}\s?)?(\()?(\d{3})(?(2)\))[-.\s]?(\d{3})[-.\s]?(\d{4})\b', '[PHONE]', text)

    # Replace names with [NAME]
    text = re.sub(r'\b[A-Z][a-z]+\b', '[NAME]', text)

    # Replace addresses with [ADDRESS]
    text = re.sub(r'\b\d+\s\w+\s\w+\b', '[ADDRESS]', text)

    # Add more patterns and replacements for other sensitive information if needed

    return text

    # Example usage
    data = "John Doe's email is john.doe@example.com and his phone number is +1 (123) 456-7890."
    deidentified_data = deidentify_text(data)
    print(deidentified_data)

  • Heh... "Just because you can do something in Python, doesn't mean you should". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The script will be *SLOW*, because it uses a scalar function and will be called Row-By-Agonizing-Row, multiple million times on a real database.

    UPDATE tbl SET name = CAST(NEW_ID() AS VARCHAR(36))

    -- Alternative to keep the length
    UPDATE tbl SET name = LEFT(CAST(NEW_ID() AS VARCHAR(36)), LEN(name))

    would do the same task (anonymizing) much faster, except that it uses UNIDs instead of some random chars.

    PS: Don't reinvent the Wheel. Even not if you are not a professionall wheel engineer.

    • This reply was modified 1 year, 6 months ago by  Thomas Franz.

    God is real, unless declared integer.

  • just did a small performance test (SQL 2022 Dev.):

    SET STATISTICS TIME, IO ON;
    DROP TABLE IF EXISTS #tmp1
    DROP TABLE IF EXISTS #tmp2

    SELECT TOP 2000000 o.name, LEFT(CAST(NEWID() AS VARCHAR(36)), LEN(o.name)) AS anonym
    INTO #tmp1
    FROM sys.objects AS o
    CROSS JOIN sys.objects AS o2
    CROSS JOIN sys.objects AS o3

    SELECT TOP 2000000 o.name, dbo.scramble(o.name) AS anonym
    INTO #tmp2
    FROM sys.objects AS o
    CROSS JOIN sys.objects AS o2
    CROSS JOIN sys.objects AS o3

    first statement (with NEWID) was done in 1,39 seconds, while your solution took 5:29 minutes.

    And now imaging, that you do not have only one column, but multiple colums and havt to multiply this time by the number of columns (first name must be anonymized too, because there may be many Bobs and Johns out there, but there are much more unique first names as Sorlina, Nigita or Monalisa ( I met once a couple who gave their daugther this name, not to mention the names some promis as Elon Musk uses) out there

    • This reply was modified 1 year, 6 months ago by  Thomas Franz.

    God is real, unless declared integer.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply