On a not-entirely-unrelated note, do you remember when I wrote about the Netflix prize awhile back? Netflix released 500K "anonymous" records to the public, in hopes that people could come up with a better recommendation algorithm (and get a million dollar cash prize). I was amazed to read in Bruce Schneier's latest cryptogram that this anonymous database can be cross-referenced against other known databases to reveal the identities of users. Basically, what some researchers did was compare the "anonymous" Netflix database against IMDB's database and figure out who a sampling of these people were.
Now, if you're saying "they figured out the identities of some movie buffs, big deal", well, you're wrong:
Someone with access to an anonymous dataset of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchants' telephone order database. Or Amazon's online book reviews could be the key to partially de-anonymizing a public database of credit card purchases, or a larger database of anonymous book reviews.Want another scary fact about data collection? 87 percent of the United States population can be identified simply with ZIP Code, Gender, and DOB. Hopefully you think twice before forking over your social security number, but you'd give up this seemingly-trivial data for just about anything without batting an eyelash...
Google, with its database of users' internet searches, could easily de-anonymize a public database of internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine's data, if it were released in an anonymized form. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.