Tuesday, March 16, 2010

Advancing science vs. compromising privacy

Data mining often brings up the association of malicious organizations that violate individuals' privacy. Three days ago, this tension was brought up a notch (at least in my eyes): Netflix decided to cancel the second round of the famous Netflix Prize. The reason is apparent in the New York Times article "Netflix Cancels Contest After Concerns Are Raised About Privacy". Researchers from the University of Texas have shown that the data disclosed by Netflix in the first contest could be used to identify users. One woman sued Netflix. The Federal Trade Commission got involved, and the rest is history.

What's different about this case is that the main benefactor of the data made public by Netflix is the scientific data mining community. The Netflix Prize competition lead to multiple worthy goals including algorithmic development, insights about existing methods, cross-disciplinary collaborations (in fact, the winning team was a collaboration between computer scientists and statisticians), collaborations between research groups (many competing teams joined forces to create more accurate ensemble predictions). There was actual excitement among data mining researchers! Canceling the sequel is perceived by many as an obstacle to innovation. Just read the comments on the cancellation posting on Netflix's blog.

After the first feeling of disappointment and some griping, I started to "think positively": What are ways that would allow companies such as Netflix to share their data publicly? One can think of simple technical solutions such as an "opt out" (or "opt in") when you rate movies on Netflix that would tell Netflix whether they can use your data in the contest. But clearly there are issues there such as bias and maybe even legal and technical issues.

But what about all that research on advanced data disclosure? Are there not ways to anonymize the data to a reasonable level of comfort? Many organizations (including the US Census Bureau) disclose data to the public while protecting privacy. My sense is that current data disclosure policies are aimed at disclosing data that will allow statistical inference, and hence the disclosed data are aggregated at some level, or else only relevant summary statistics are disclosed (for example, see A Data Disclosure Policy for Count Data Based on the COM-Poisson Distribution). Such data would not be useful for a predictive task where the algorithm should predict individual responses. Another popular masking method is data perturbation, where some noise is added to each data point in order to mask its actual value and avoid identification. The noise addition is intended not to affect statistical inference, but it's a good question how perturbation affects individual-level prediction.

It looks like the data mining community needs to come up with some data disclosure policies that support predictive analytics.