When is data anonymous? That is a very good question, and one that is increasingly relevant for my work.
Our datalabs at the University Library of Copenhagen (or whatever our current name is), is beginning to be a success. We have a steady increase in the number of students and researchers from the health sciences. And that triggers a recurring discussion.
Let me begin by noting that our users are very conscious of the issues regarding protecting sensitive information.They use encrypted hardware, secure connections to a degree I have only ever seen amongst people security consciuos enough to border on tinfoil folks. But they are still a bit naive about anonomyzing data.
I have no idea how to anonymize data. And the more I read about it, the less sure I am that it is actually possible. People smarter than me are probably able to figure out something. But I fear that this is a game rather like the DRM-games.
Yes, the studios can encrypt their Bluray discs. But they still need to be able to show the movie on a screen. The disc will have to be decrypted somehow. Otherwise it will just show static. And the data you are working with may be stripped of all identifying information. But there still needs to be information in it. Otherwise it is just useless.
So – I cannot advice our students on how to de-identify patients in a clinical study. But I can tell them horror stories. And I do. These are a few of them.
Netflix and IMDB
The classic story is the de-identification of Netflix users. Netflix has periodically released data on their users. Anonymized of course. Which movies have a given user watched, and how has that user rated them.
Another source about information on what movies a person has watched and rated is IMDB. And that information is not so secret. Let us asume that an unknown person has watched ten obscure movies on Netflix, and given the first five a high rating and the others a low. And that a known person on IMBD has rated the same five obscure movies high, and the other five low. Intuition would suggest that those two persons are the same. Is that a problem?
If you live in an area where being gay is a problem, you might not have problem people knowing that you have watched obscure, but innocent movies on IMDB. But the Netflix data, if linked to you, would reveal that you also have watched Another Gay Movie, Philadelphia and Milk. That might be a problem. I don’t think “Salo” is on Netflix. And I’m not necessarily that embarrassed to admit that I have watched it. But I would probably not want people to know that I have watched it ten times (if I had. Its horrifying). Heres a paper on the case.
A lot of demographic data is released to the public. We want people to know if living in a certain area causes cancer. And we want the underlying data out there, because there is just too much data to analyze, so if we could crowdsource that part of the process, it would be nice. So we anonymize the data, but leave in the postcodes. That might be a problem.
The danish postcode 1301 corresponds to the street “Landgreven”. According to www.krak.dk, 17 persons have an adress there. There might be a bit more. They only register people with a phonenumber. And leaves out people with an unlisted number. But let us assume that there are only those 17 persons. 8 of them are women. So if we have health data on medical procedures – broken down by postcode and gender, we might be able to say that one of 8 named women had an abortion. Not that there is much stigma associated with that in Denmark, or at least there shouldnt be. But it is still something you probably would like to keep to yourself.
Twitter, Flickr and graphs
Some people like to be anonomous on Twitter. Looking at the name-calling, flamewars and general pouring crap over people you disagree with on Twitter, it is surprising that not more people are trying to be anonymous on Twitter. But some people have legitimate reasons to try to be anonymous. Whistleblowers, human rigts activists etc.
Social media are characterized by graphs. Not pie charts and such. But networks. Each person is a node, and each connection, following eg, between nodes is an edge. The network defined by nodes and edges is called a graph. Two researchers Narayanan and Shmatikov have made an interesting study, “De-anonymizing social networks”. Take a lot of persons that have accounts on both Twitter and Flickr. Anonymise the Twitter accounts. One third of those Twitter accounts can be linked to the Flickr acocunt of the same person. In spite of the anonymisation.
How? Well, the graph describing who you follow and who follows you on Twitter, will share characteristics with the graph on Flickr. And those graphs are pretty unique. Read more here.