Anonymous Research

Question

Anonymous Research

Steve Jones - SSC Editor

SSC Guru

Points: 741243
More actions
February 11, 2012 at 2:41 pm

#150827

Comments posted to this topic are about the item Anonymous Research

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

steve.casey Old Hand Points: 311 More actions · Answer 1

Hi Steve

This isn't about the content of this particular editorial or article, but about the way that the content is sometimes displayed.

My screen is displaying a width of 1680 pixels. The article at

http://www.sqlservercentral.com/blogs/cleveland-dba/2012/02/09/ryo-maintenance-plan-enhancement-request/

requires 1450 pixels horizontally to be read without left-right scrolling... that's awfully tiring!

Keep up the good work 🙂

Thanks, Steve

Eric M Russell SSC Guru Points: 125623 More actions · Answer 2

You might not think this is a big deal, but as more data is gathered by companies and used for secondary purposes, like analysis, it becomes more likely to be inappropriately released. Is a log on a server more secure, or a copies of multiple logs on analysts' laptops? I'd think the former, or at least I'd hope the former. If that's true, then we should really be anonymizing data on a regular basis once it leaves hardened server machines.

We need to really think about why routinely extracting bulk datasets containing personally identifying elements from a production transactional datacase is really necessary.

For example, the BI or operational reporting team may need access to transactional level records in order to do things like report on the number of enrollments by service category for last month. To do that, they need each distict member to be identified by a unique key. However, that key doesn't need to be the member's name, phone, SSN, etc. It could be a suggogate integer based key, and really that's what the BI team would prefer anyhow.

Ideally, all these secondary data consumers should be working from something like an OLAP cube or a data mart that contains slices of data elements and subsets of rows that executive management has decided meet the requirements for their reporting needs.

An OLAP cube or PowerPivot spreadsheet on an executive's laptop may contain sensitive financial numbers that a company wouldn't want released to the public, but it shouldn't contain transactional level records about individual customers.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

TravisDBA SSCoach Points: 15780 More actions · Answer 3

Steve,

It's tricky. Even if all I had to go on were ZIP code, birthdate and sex columns, I could still probably find out who you are. But on the other hand, If the data was scrubbed totally clean of all possible personal identifiers, then the ability to use it for anything meaningful vastly depreciates. Data can either be perfectly anonymous or useful, but it cannot be both.:-D

"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

Eric M Russell SSC Guru Points: 125623 More actions · Answer 4

TravisDBA (2/13/2012)
Steve,
It's tricky. Even if all I had available were ZIP code, birthdate and sex columns, I could still probably identify who you are. But on the other hand, If the data was scrubbed totally clean of all possible personal identifiers, then its ability to use it for anything meaningful vastly depreciates. Data can either be perfectly anonymous or useful, but it cannot be both.:-D

Most analytical reporting is not interested in an individual's actual birthday but rather their age, or specifically their age group (ex: 18-25, 26 - 35, etc.). If age_group were provided rather than DOB, then that would be a huge leap in making the records anonymous. Also, zip codes can be further rolled up into marketing or demographic regions. For example, this identifies only a group of individuals:

member_id age_group sex region_code

567432 18-25 F 32

568891 18-25 M 28

568893 36-45 M 32

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

TravisDBA SSCoach Points: 15780 More actions · Answer 5

Eric M Russell (2/13/2012)
TravisDBA (2/13/2012)
Steve,
It's tricky. Even if all I had available were ZIP code, birthdate and sex columns, I could still probably identify who you are. But on the other hand, If the data was scrubbed totally clean of all possible personal identifiers, then its ability to use it for anything meaningful vastly depreciates. Data can either be perfectly anonymous or useful, but it cannot be both.:-D
Most analytical reporting is not interested in an individual's actual birthday but rather their age, or specifically their age group (ex: 18-25, 26 - 35, etc.). If age_group were provided rather than DOB, then that would be a huge leap in making the records anonymous. Also, zip codes can be further rolled up into marketing or demographic regions. For example, this identifies only a group of individuals:
member_id age_group sex region_code
567432 18-25 F 32
568891 18-25 M 28
568893 36-45 M 32

For demographic summary reports, this is fine, but not much use for anything else.:-D

"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

Steve Jones - SSC Editor SSC Guru Points: 741243 More actions · Answer 6

TravisDBA (2/13/2012)
Steve,
It's tricky. Even if all I had to go on were ZIP code, birthdate and sex columns, I could still probably find out who you are. But on the other hand, If the data was scrubbed totally clean of all possible personal identifiers, then the ability to use it for anything meaningful vastly depreciates. Data can either be perfectly anonymous or useful, but it cannot be both.:-D

Perhaps. That's why I'd like to see people working on this problem. I think that someone smarter than I could come up with ways to better anonymize data, and still extract useful patterns

Eric M Russell SSC Guru Points: 125623 More actions · Answer 7

TravisDBA (2/13/2012)
Eric M Russell (2/13/2012)
TravisDBA (2/13/2012)
Steve,
It's tricky. Even if all I had available were ZIP code, birthdate and sex columns, I could still probably identify who you are. But on the other hand, If the data was scrubbed totally clean of all possible personal identifiers, then its ability to use it for anything meaningful vastly depreciates. Data can either be perfectly anonymous or useful, but it cannot be both.:-D
Most analytical reporting is not interested in an individual's actual birthday but rather their age, or specifically their age group (ex: 18-25, 26 - 35, etc.). If age_group were provided rather than DOB, then that would be a huge leap in making the records anonymous. Also, zip codes can be further rolled up into marketing or demographic regions. For example, this identifies only a group of individuals:
member_id age_group sex region_code
567432 18-25 F 32
568891 18-25 M 28
568893 36-45 M 32
For demographic summary reports, this is fine, but not much use for anything else.:-D

The dataset still has one record per member, it's just that their DOB and Zip Code have been replaced with more discrete coding. For example, when I subscribe to a magazine, they ask for my age and income group; that's what the marketing department cares about, not my DOB and actual salary.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

TravisDBA SSCoach Points: 15780 More actions · Answer 8

Eric M Russell (2/13/2012)
TravisDBA (2/13/2012)
Eric M Russell (2/13/2012)
TravisDBA (2/13/2012)
Steve,
It's tricky. Even if all I had available were ZIP code, birthdate and sex columns, I could still probably identify who you are. But on the other hand, If the data was scrubbed totally clean of all possible personal identifiers, then its ability to use it for anything meaningful vastly depreciates. Data can either be perfectly anonymous or useful, but it cannot be both.:-D
Most analytical reporting is not interested in an individual's actual birthday but rather their age, or specifically their age group (ex: 18-25, 26 - 35, etc.). If age_group were provided rather than DOB, then that would be a huge leap in making the records anonymous. Also, zip codes can be further rolled up into marketing or demographic regions. For example, this identifies only a group of individuals:
member_id age_group sex region_code
567432 18-25 F 32
568891 18-25 M 28
568893 36-45 M 32
For demographic summary reports, this is fine, but not much use for anything else.:-D
The dataset still has one record per member, it's just that their DOB and Zip Code have been replaced with more discrete coding. For example, when I subscribe to a magazine, they ask for my age and income group; that's what the marketing department cares about, not my DOB and actual salary.

They don't need it, if they have your name and address in order to send the magazine to you, they can find out the rest.It's absolutley amazing what you can find out about someone nowadays with very little information to go on. I have done complete criminal and financial background checks on people on the Internet with very little information up front. 😀

"Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

Revenant SSC-Forever Points: 42467 More actions · Answer 9

You cannot hide on the Internet. As long as the other party gets your IP address, it has your postal code. (Unless you go through a proxy server in say Vanuatu, which very few people do.)

For me the problem with personally identifiable info up to this point was that I did not have enough of it to reliably identify the person when I was expected to. The number one example would be a pharmacy system for Eckerd's Drugs twenty years ago, when I had 18 Maria Martinezes in Austin born on the same day, two of them living in the same retirement home. And per TX laws, pharmacies were not allowed to ask for SINs.

In my more current experience, yiou will probably assign each person a GUID and separate PII from the rest, put it into a separate database and allow only a very limited number of users to access it, i.e., to put a person behind the GUID.