Last week, a Facebook dataset was released by a group of researchers (Amanda L. Traud, Peter J. Mucha, Mason A. Porter) in connection with their paper studying the role of user attributes – gender, class year, major, high school, and residence – on social network formations at various colleges and universities. The dataset — referred to by the researchers as the “Facebook 100″ — consists of the complete set of users from the Facebook networks at 100 American schools, and all of the in-network “friendship” links between those users as they existed at a single moment of time in September 2005.
The research paper indicates that the Facebook data was provided to the researchers “in anonymized form by Adam D’Angelo of Facebook.” (D’Angelo was then Facebook’s CTO, and left Facebook in 2008.) Curious as to what precisely was included in the data release, and what steps towards anonymization were taken, I downloaded the data (200 MB zip file) on the morning of February 11.
The data files are separated by institution, and in total include, by my estimation, about 1.2 million user accounts. The content of each institution’s file is described as containing the following:
Each of the school .mat files has an A matrix (sparse) and a “local_info” variable, one row per node: ID, a student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year, and high school.
Thus, the datasets include limited demographic information that was posted by users on their individual Facebook pages. The identity of users’ dorm and high schools were obscured by numerical identifiers, but to my surprise, the dataset included each user’s unique Facebook ID number. As a result, while user names and extended profile information were kept out of the data release, a simple query against Facebook’s databases would yield considerable identifiable information for each record. In short, the suggestion that the data has been “anonymized” is seriously flawed.
The consequences of this ease of re-identifying the dataset are numerous.
First, while only limited profile information is within the dataset, there is no indication that any consideration was given to users’ particular privacy settings. Based on the article, all user accounts from each of the 100 networks were provided to the researchers, and as long as the user provided the data to Facebook, it was turned over to the researchers. [Clarification: when I say "all user accounts" we provided, I do not mean full profile information was given to the researchers, just the particular data fields as described above]
Second, even though the specific data exposure within the dataset is limited, the fact that users can be identified and linked to their in-network social map fosters additional threats to privacy. Previous research (here and here, for example) has shown how “anonymous” datasets can be largely re-identified when there is access to other large sets of data where the subjects are already known. The “Facebook 100″ data, with the Facebook IDs intact to guide identification of users, might be useful in similar efforts.
To recap, the suggestion that the “Facebook 100″ data has been “anonymized” is seriously flawed, and its release might be putting the information of 1.2 million Facebook users at risk.
Interestingly, a few hours after the initial release of the “Facebook 100″ dataset, the researchers Mason Porter announced they were pulling the data due to an unspecified “bug”. Later that evening, the data was again made available with a message indicating that the data files were now fixed.
Again, I was curious, so I downloaded and examined the new dataset. The only change I could see was that now the Facebook ID was removed entirely from the data files, and the order of the records in each file was randomized.
Thus, the “bug” must’ve been that the data was easily re-identifiable, and the “fix” was to take additional steps to anonymize the records. Somone joked on the announcement email list that the “bug” must have something to do with Facebook attorneys, but the Porter’s message re-releasing the data jokes that no lawyers were involved, and that they “really were fixing the data files!”
To me, however, the language used in these explanations was disingenuous. The data, as far as I could tell, had no bugs that prevented its usefulness for social network analysis. No, the problem with the data was that it contained each user’s unique Facebook ID, thus allowing easy identification. The researchers Porter should have been open and honest about why the data was pulled and what they did to correct the situation.
That said, there are still a number of open questions regarding this particular dataset:
- What kind of internal processes, if any, did D’Angelo follow when releasing the data to these researchers? Was he authorized to do so?
- Was this kind of large data release routine? How many other similar releases have taken place?
To the research team:
- Was the data received by Facebook already obscured with numerical identifiers replacing student majors, minors, and high schools, or did you add those?
- UPDATE: I have received word from one of the researchers, Mason Porter, that the data sent to them by Facebook was indeed already obscured with numerical identifiers in the place of actual student major, minor, and high school information.
- Did your IRB review the data used for the research, and approve the subsequent data release?
- Was there any “bug” in the data, or was the attempt to gain greater anonymization of the data the sole reason to pull it from public access?
Obtaining answers to these questions can help us better understand the uniqueness of this situation, and to put better processes and protections in place to prevent similar data releases that falsely believe data is sufficiently anonymized and respecting of users’ privacy expectations.
I hope Facebook and the researchers are willing to engage in a discussion, and I’ll report back on any communication, as allowed.
UPDATE (Feb 15, 6:00pm): I have been in contact with one of the researchers, Mason Porter, who confirmed that the data sent to them by Facebook was indeed already obscured with numerical identifiers in the place of actual student major, minor, and high school information. I’ve inserted this reply into the question above. I have also made a few minor changes to the main text, clarifying that the email messages reporting the “bug” in the data came from Mason alone, and should not be attributed to the entire research team.
UPDATE 2 (Feb 15, 6:10pm): The link to the full, revised dataset (http://people.maths.ox.ac.uk/~porterm/data/facebook100.zip) is no longer active.
UDPATE 3 (Feb 16, 9am): Added a clarification that when I say “all user accounts” were provided to the researchers, I do not mean full profile information was given, just the particular data fields as described above.
UDPATE 4 (Feb 16, 11am): Mason Porter, one of the authors, has posted an explanatory note on his blog indicating that he’s been in contact with the Facebook Data Team, and per their request, “I have taken down the data, and I will be working with them to eventually post a version of the data set with which both they and I are happy.”