Tuesday, March 16, 2010

Advancing science vs. compromising privacy

Data mining often brings up the association of malicious organizations that violate individuals' privacy. Three days ago, this tension was brought up a notch (at least in my eyes): Netflix decided to cancel the second round of the famous Netflix Prize. The reason is apparent in the New York Times article "Netflix Cancels Contest After Concerns Are Raised About Privacy". Researchers from the University of Texas have shown that the data disclosed by Netflix in the first contest could be used to identify users. One woman sued Netflix. The Federal Trade Commission got involved, and the rest is history.

What's different about this case is that the main benefactor of the data made public by Netflix is the scientific data mining community. The Netflix Prize competition lead to multiple worthy goals including algorithmic development, insights about existing methods, cross-disciplinary collaborations (in fact, the winning team was a collaboration between computer scientists and statisticians), collaborations between research groups (many competing teams joined forces to create more accurate ensemble predictions). There was actual excitement among data mining researchers! Canceling the sequel is perceived by many as an obstacle to innovation. Just read the comments on the cancellation posting on Netflix's blog.

After the first feeling of disappointment and some griping, I started to "think positively": What are ways that would allow companies such as Netflix to share their data publicly? One can think of simple technical solutions such as an "opt out" (or "opt in") when you rate movies on Netflix that would tell Netflix whether they can use your data in the contest. But clearly there are issues there such as bias and maybe even legal and technical issues.

But what about all that research on advanced data disclosure? Are there not ways to anonymize the data to a reasonable level of comfort? Many organizations (including the US Census Bureau) disclose data to the public while protecting privacy. My sense is that current data disclosure policies are aimed at disclosing data that will allow statistical inference, and hence the disclosed data are aggregated at some level, or else only relevant summary statistics are disclosed (for example, see A Data Disclosure Policy for Count Data Based on the COM-Poisson Distribution). Such data would not be useful for a predictive task where the algorithm should predict individual responses. Another popular masking method is data perturbation, where some noise is added to each data point in order to mask its actual value and avoid identification. The noise addition is intended not to affect statistical inference, but it's a good question how perturbation affects individual-level prediction.

It looks like the data mining community needs to come up with some data disclosure policies that support predictive analytics.

5 comments:

Marcin Wojnarski said...

The solution may be to keep data undisclosed on the server and ask participants to submit the actual learning algorithm (code) instead of plain decisions. Then, train and test the algorithm on the server, so that participants don't need to have training data revealed. We used successfully such online scoring system in a discovery challenge related to genetic data analysis, affiliated with RSCTC data mining conference. This system is available at TunedIT platform and can be used for other competitions, as well.

Galit Shmueli said...

Thanks Marcin - that's an interesting solution indeed! And TunedIT looks like a terrific platform.

I agree that you don't need the raw data on your computer, only the ability to operate on it. My only concern is that you often need to explore the training data in order to build suitable algorithms. Does this solution allow users to plot/explore the server data by sending queries such as "create a histogram of a random sample"? or "what is the percentage of missing values?" That would take care of the ability to explore, but then you could create combinations of graphical queries that might again compromise confidentiality (say, if you can identify an outlier across multiple graphs).

Marcin Wojnarski said...

Galit, A small portion of training data may still be revealed publicly for the purpose of calculating statistics of this kind. Because this portion can be small, organizers may take special measures to ensure that no sensitive data are revealed or that they're related to a handful (10-100 instead of thousands-millions) of users who explicitly agreed on this.

As to querying server data, we currently have a solution that's mainly intended for debugging, but can also be used to query training data and calculate some statistics. Namely, during evaluation the algorithm may raise an exception with some information encoded inside. This exception is logged on the server and is visible to the author of the algorithm. There's a limit on the length of such stack trace, as well as on the maximum number of stack traces that can be generated in a given period of time (for different solutions submitted by the same participant) - these are security measures that prevent users from extracting too much information, like the data themselves.

Tim said...

My approach to this question is from the marketing side of the table. I certainly understand the privacy concerns that some people raise about companies releasing data with PII (personally identifiable information). Within the CRM marketing space, the most skilled agencies have gathered or purchased thousands of data points for most households in the US, linked together by some data point that can definitively identify you (phone #, SSN, address, etc). I've spent some time working for this kind of marketing agency, and they used tight security to safeguard the data (for the most part). In the end, being able to accumulate data in this way allows the companies to achieve higher direct mail response rates and target the best segments for their products and services.

So it helps the companies to allow the aggregation of this data, but does it help the consumer?

My opinion is that, over time, it will benefit the consumer as well. In our era of over-saturation from media messages, wouldn't it be great if the only ads you saw on TV or the only junk mail you received was something targeted directly to you, something that you were very likely to buy or respond to? But without gathering and utilizing PII in our predictive data models, there is no way to measure the success of a marketing effort, and therefore no way to improve the targeting over time. I'd be willing to give up a little privacy in order to reduce my junk mail quantity. What about you?

Galit Shmueli said...

Tim - thanks for your comment. You write "I'd be willing to give up a little privacy in order to reduce my junk mail quantity" and I think that most people would, if they trusted that their data were used for their own benefit rather than the company's... Currently, however, I believe that data mining is used by companies almost always for their financial benefit and at the expense of their customers' convenience, rather than for improving the quality of their customers' lives.

To my knowledge, there is a small but growing attempt to employ data mining for nobler purposes. Examples are Data Mining to Detect Domestic Abuse) and Enhancing care services quality of nursing homes using data mining.

But what about companies? Since they hold so much data about us, I believe that there is a huge potential for them to employ data mining for increasing the quality of life of individuals, families and communities. Although it is common to measure everything in $, companies must also think about issues of trust, image, and long-term impact. At a minimum, the effect of data mining usage should be honestly assessed in terms of metrics reflecting these issues. Ideally, the metrics should be factored into the data mining itself. I'd call it "Data Mining for GNH" (Gross National Happiness).