On the "Anonymity" of the Facebook Dataset (Updated)

(Updated below with responses to comments by Jason Kaufman, one of the lead researchers on this project)

(Another update: I’m pretty sure the “anonymous, Northeastern university” from where this dataset was derived is Harvard College. Details here)

A group of researchers have released a dataset of Facebook profile information from a group of college students for research purposes, which I know a lot of people will find quite valuable. (Thanks to Fred Stutzman for bringing it to my attention.)

Here is the description from the Berkman Center’s announcement:

The dataset comprises machine-readable files of virtually all the information posted on approximately 1,700 FB profiles by an entire cohort of students at an anonymous, northeastern American university. Profiles were sampled at one-year intervals, beginning in 2006. This first wave covers first-year profiles, and three additional waves of data will be added over time, one for each year of the cohort’s college career.

Though friendships outside the cohort are not part of the data, this snapshot of an entire class over its four years in college, including supplementary information about where students lived on campus, makes it possible to pose diverse questions about the relationships between social networks, online and offline.

Access to the dataset requires the submission of a research statement (which I haven’t yet done), but the codebook is publicly-available.

Of course, this sounds like an AOL-search-data-release-style privacy disaster waiting to happen. Recognizing this, the researchers detail some of the steps they’ve taken to try to protect the privacy of the subjects, including:

  • All identifying information was deleted or encoded immediately after the data were downloaded.
  • The roster of student names and identification numbers is maintained on a secure local server accessible only by the authors of this study. This roster will be destroyed immediately after the last wave of data is processed.
  • The complete set of cultural taste labels provides a kind of “cultural fingerprint” for many students, and so these labels will be released only after a substantial delay in order to ensure that students’ identities remain anonymous.
  • In order to access any part of the dataset, prospective users must read and electronically sign the user agreement reproduced below.

Let’s consider each one of these in order:

First, as the AOL debacle taught us, one might think “all identifying information” has been deleted, but often random bits of our data trail that alone seem anonymous can be pieced together, possibly exposing clues to our identity. The fact that the dataset includes each subjects’ gender, race, ethnicity, hometown state, and major makes it increasingly possibility that individuals could be identified. For example, if the data reveals that student #746 is a white Bulgarian male from Montana, majoring in East Asian Studies, there probably aren’t that many who fit such a description. Unlikely, but not bullet-proof.

Second, the researchers take good measures by keeping the master roster on a secure server and promising to destroy it once all the datasets have been released, in 2011. One hopes this remains secure until then.

Third, the researchers are right to recognize how a person’s unique set of cultural tastes could easily identifer her. But merely instituting a “substantial delay” before releasing this personal data does little to mitigate the privacy fears…it only delays them. Researchers routinely rely on datasets for years (some search engine studies are still using datasets from 1997!). Do they think that once a person graduates, she no longer might be harmed by her potential identification in such a dataset? Delaying the release of this data to 2011 is not only arbitrary, but much too short. A better tactic would be to gain the consent of the subject before releasing the data, or simply not releasing it at all to provide the fullest privacy protection of the subjects.

Fourth, requiring a user agreement and terms of use is a nice step.  Clearly, the researchers understand the potential harms that the dataset represents, since the agreement is full of requirements to not use the data to try to identify individuals (in fact, there are so many points on this, one fears the possibility might be all too real). Unfortunately, however, we’re all too familiar with clickwrap agreements, and most users won’t bother read the terms (and it is uncertain how they would be enforceable).

All told, good steps are being taken to address the privacy of the subjects in the dataset, but more could be done. We’ll wait to see if anyone does become identified within the data.

But one more thing

Since I first saw the press release for this dataset, I’ve been bothered by the description of the date as “approximately 1,700 FB profiles by an entire cohort of students at an anonymous, northeastern American university.”

Right off the bat, the source university loses full anonymity since it is identified as being in the northeastern US. Further, according to the codebook, this is a private, co-ed institution, whose class of 2009 initially had 1640 students in it.

A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT , ME , MA , NH , RI , VT ) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College. (The total bumps up to about 18 if we include NY and NJ)

Is one of these the source?

This might prove easy to discover, given the uniqueness of some of the subjects. Based on the codebook, the dataset includes only one self-identified Albanian, one Iranian, one Malaysian, one Nepali, and other solitary ethnicities. If we can isolate one of these people in the dataset, combined with the subject’s gender, home state, and major, it probably wouldn’t be that hard to discover who it is. (Same with the fact there is only one Folklore major, or one Slavic Studies major, etc.) Keying off these unique data elements will provide a possible path to identifying the school, and potentially many more individuals in the dataset.

This would have been much harder if I didn’t know it was a private school in the northeast us. Or if they took a random sample of the dataset, and didn’t tell me that actual number of students in the cohort.

Again, time will tell if this gets cracked.


UPDATE: Jason Kaufman, the principal investigator for this research project, was kind enough to read through my concerns and post a thoughful response in the comments. So did Alex Halavais. Please take a look.

I do feel the need to react to some of the arguments made by Kaufman and Halavais. They both seem to suggest that while the data might lead to the identification of some of the subjects, that these Facebook users don’t have an expectation (or a right) to privacy since they made this information public in the first place.

Kaufman remarks:

What might hackers want to do with this information, assuming they could crack the data and ’see’ these people’s Facebook info? Couldn’t they do this just as easily via Facebook itself? Our dataset contains almost no information that isn’t on Facebook. (Privacy filters obviously aren’t much of an obstacle to those who want to get around them.)

And Halavais notes:

The data is already there, this is merely (!) the collection of that data. Or to put it another way, AOL users presumed that no one was watching, but this is very different from Facebook users who are intending to share with someone (if not the researchers).

We see these kinds of arguments all the time: you have no expectation of privacy with public records, or if you’re on the public roads, you can’t expect privacy, or Facebook’s news feed simply made sharing the information you made public more efficient. All such notions are wrong: they ignore the contextual nature of privacy. Just making something known in one context – even a non-secret context – doesn’t mean “anything goes” in terms of the collection, storage, transmission, or use of that information.

So, let’s look at this Facebook dataset and claims made above. I can take issue with (at least) 3 points being articulated by Kaufman and Havalais.

One, Kaufman’s mention of “hackers” and focusing on what they might “do” with this information exposes a focus on “harms” when it comes to privacy concerns. One doesn’t need to be a victim of hacking, or have a tangible harm take place, in order for there to be concerns over the privacy of one’s personal information. Privacy is about dignity as much as about informational harm by some evil agent. As Havalais points out later in his comment, none of the subjects in this dataset consented to having their personal information used in a research study. Don’t they have a right to some control over their information?

This leads to the second point: just because users post information on Facebook doesn’t mean they intend for it to be scraped, aggregated, coded, disected, and distributed. Creating a Facebook account and posting information on the social networking site is a decision made with the intent to engage in a social community, to connect with people, share ideas and thoughts, communicate, be human. Just because some of the profile information is publicly avaiable (either consciously by the user, or due to a failure to adjust the default privacy settings), doesn’t mean there are no expectations of privacy with the data. This is contextual integrity 101.

Thrid, both Kaufman and Havalais seem to suggest that the information was already easily available to anyone who cared to look for it. Examining some of the details in the codebook reveal this isn’t necessarily true, and the researchers certainly had much more efficient ways of gathering the information than an average Facebook user. Frist, the researchers received an official roster of each freshman at the college, along with their university e-mail address. This allowed them to easily and systematically search for each student in the cohort. Second, and more importantly, they appeared to use research assistants from that school in order to access and download the profile information. That means that a Facebook user might have set their privacy settings to be viewable to only to other users within that school. As a result, the RA was able to view and download the data. However, now that same data – originally meant for only those within the college – has been made available to the entire world, perhaps against the expressed wishes of the data subject.

Let me repeat that last point: Some Facebook users might have restricted their accounts to only people from their own school. But since the researchers used RAs from that school to access to account information, that restricted data has been published outside of those boundaries.

The researchers even seem to acknowledge this when they state:

In other words, a given student’s information should not be considered objectively “public” or “private” (or even “not on Facebook”)—it should be considered “public” or “private” (or “not on Facebook”) from the perspective of the particular RA that downloaded the given student’s data.

So, if the student’s information should not be considered “objectively public”, then why is it being treated as such in the dataset?

In total, claims that the data was public in the first place simply do not hold up to scrutiny.

Now, don’t get me wrong. I completely see the research value in having this data. But we must be more careful in how we release such personal information to the world, and we must be certain to understand how privacy is contextual, not just based on whether an RA can download a profile.

Print Friendly

Tags: , , , ,