BzST | Business Analytics, Statistics, Teaching: privacy

Showing posts with label privacy. Show all posts

Sunday, February 04, 2018

Data Ethics Regulation: Two key updates in 2018

This year, two important new regulations will be impacting research with human subjects: the EU's General Data Protection Regulation (GDPR), which kicks in May 2018, and the USA's updated Common Rule, called the Final Rule, is in effect from Jan 2018. Both changes relate to protecting individuals' private information and will affect researchers using behavioral data in terms of data collection, access, use, applications for ethics committee (IRB) approvals/exemptions, collaborations within the same country/region and beyond, and collaborations with industry.
Both GDPR and the final rule try to modernize what today constitutes "private data" and data subjects' rights and balance it against "free flow of information between EU countries" (GDPR) or . However, the GDPR's approach is much more strongly in favor of protecting private data
Here are a few points to note about GDPR:

"Personal data" (GDPR) or "private information" (final rule) is very broadly defined and includes data on physical, physiological or behavioral characteristics of a person "which allow or confirm the unique identification of that natural person".
The GDPR affects any organization within the EU as well as "external organizations that are trading within the EU". It applies to personal data on any person, not just EU citizens/residents.
The GDPR distinguishes between "data controller" (the entity who has the data, in the eyes of the data subjects, e.g. a hospital) and "data processor" (the entity who operates on the data). Both entities are bound and liable by GDPR.
GDPR distinguishes between "data processing" (any operation related to the data including storage, structuring, record deletion, transfer) and "profiling" (automated processing of personal data to "evaluate personal aspects relating to a natural person".
The Final Rule now offers an option of relying on broad consent obtained for future research as an alternative to seeking IRB approval to waive the consent requirement.
Domestic collaborations within the US now require a single institutional review board (IRB) approval (for the portion of the research that takes place within the US) - effective 2021.

The Final Rule tries to lower burden for low-risk research. One attempt is new "exemption" categories for secondary research use of identifiable private information (i.e. re-using

identifiable information collected for some other ‘‘primary’’ or ‘‘initial’’ activity) when:

The identifiable private information is publicly available;
The information is recorded by the investigator in such a way that the identity of subjects cannot readily be ascertained, and the investigator does not contact subjects or try to re-identify subjects;
The secondary research activity is regulated under HIPAA; or
The secondary research activity is conducted by or on behalf of a federal entity and involves the use of federally generated non-research information provided that the original collection was subject to specific federal privacy protections and continues to be protected.

This approach to secondary data, and specifically to observational data from public sources, seems in contrast to the GDPR approach that states that the new regulations also apply when processing historical data for "historical research purposes". Metcalf (2018) criticized the above Final Rule exemption because "these criteria for exclusion focus on the status of the dataset (e.g., is it public? does it already exist?), not the content of the dataset nor what will be done with the dataset, which are more accurate criteria for determining the risk profile of the proposed research".

Monday, October 24, 2016

Experimenting with quantified self: two months hooked up to a fitness band

It's one thing to collect and analyze behavioral big data (BBD) and another to understand what it means to be the subject of that data. To really understand. Yes, we're all aware that our social network accounts and IoT devices share our private information with large and small companies and other organizations. And although we complain about our privacy, we are forgiving about sharing it, most likely because we really appreciate the benefits.

So, I decided to check out my data sharing in a way that I cannot ignore: I started wearing a fitness band. I bought one of the simplest bands available - the Mi Band Pulse from Xiaomi - this is not a smart watch but "simply" a fitness band that counts steps, measures heart rate and sleep. It's cheap price means that this is likely to spread to many users. Hence, BBD at speed on a diverse population.

I had two questions in mind:

Can I limit the usage so that I only get the benefits I really need/want while avoiding generating and/or sharing data I want to keep private?
What do I not know about the data generated and shared?

I tried to be as "private" as possible, never turning on location, turning on bluetooth for synching my data with my phone only once a day for a few minutes, turning on notifications only for incoming phone calls, not turning on notifications for anything else (no SMS, no third party apps notifications), only using the Mi Fit app (not linking to other apps like Google Fit), and only "befriending" one family member who bought the same product.

While my experiment is not as daring as a physician testing his/her developed drugs on themselves, I did learn a few important lessons. Here is what I discovered:

Despite intentionally turning off my phone bluetooth right after synching the band's data every day (which means the device should not be able to connect to my phone, or the cloud), the device and my phone did continue having secret "conversations". I discovered this when my band vibrated on incoming phone calls when bluetooth was turned off. This happened on multiple occasions. I am not sure what part of the technology chain to blame - my Samsung phone? the Android operating system? the Mi Band? It doesn't really matter. The point is: I thought the band was disconnected from my phone, but it was not.
Every time I opened the Mi Fit app on my phone, it requested access to location, which I had to decline every time. Quite irritating, and I am guessing many users just turn it on to avoid the irritation.
My family member who was my "friend" through the Mi Fit app could see all my data: number of steps, heart rate measurements (whenever I asked the band to measure heart rate), weight (which I entered manually), and even... my sleep hours. Wait: my family member now knew exactly what time I went to sleep and what time I got up every morning. That's something I didn't consider when we decided to become "friends". Plus, Xiaomi has the information about our "relationship" including "pokes" that we could send each other (causing a vibration).
The sleep counter claims to measure "deep sleep". It was showing that even if I slept 8 hours and felt wonderfully refreshed, my deep sleep was only 1 hour. At first I was alarmed. With a bit of search I learned that it doesn't mean much. But the alarming effect made me realize that even this simple device is not for everyone. Definitely not for over-worried people.

After 2 months of wearing of the band 24-by-7, what have I learned about myself that I didn't already know? That my "deep sleep" is only one hour per night (whatever that means). And that the feeling of the vibrating band when conquering the 8000 steps/day limit feels good. Addictively good, despite being meaningless. All the other data was really no news (I walk a lot on days when I teach, my weight is quite stable, my heart rate is reasonable, and I go to sleep too late).

To answer my two questions:

Can I limit the usage to only gain my required benefits? Not really. Some is beyond my control, and in some cases I am not aware of what I am sharing.
What do I not know? Quite a bit. I don't know what the device is really measuring (eg "deep sleep"), how it is sharing it (when my bluethooth is off), and what the company does with it.

Is my "behavioral" data useful? To me probably not. To Xiaomi probably yes. One small grain of millet (小米 = XiaoMi) is not much, but a whole sack is valuable.

So what now? I move to Mi Band 2. Its main new feature is a nudge every hour when you're sitting for too long.

Tuesday, April 17, 2012

Google Scholar -- you're not alone; Microsoft Academic Search coming up in searches

In searching for a few colleagues' webpages I noticed a new URL popping up in the search results. It either included the prefix academic.microsoft.com or the IP address 65.54.113.26. I got curious and checked it out to discover Microsoft Academic Search (Beta) -- a neat presentation of the author's research publications and collaborations. In addition to the usual list of publications, there are nice visualizations of publications and citations over time, a network chart of co-authors and citations, and even an Erdos Number graph. The genealogy graph claims that it is based on data mining so "might not be perfect".

All this is cool and helpful. But there is one issue that really bothers me: who owns my academic profile?

I checked my "own" Microsoft Academic Search page. Microsoft's software tried to guess my details (affiliation, homepage, papers, etc.) and was correct on some details but wrong on others. To correct the details required me to open a Windows Live ID account. I was able to avoid opening such an account until now (I am not a fan of endless accounts) and would have continued to avoid it, had I not been forced to do so: Microsoft created an academic profile page for me, without my consent, with wrong details. Guessing that this page will soon come up in user searches, I was compelled to correct the inaccurate details.

The next step was even more disturbing: once I logged in with my verified Window Live ID, I tried to correct my affiliation and homepage and added a photo. However, I received the message that the affiliation (Indian School of Business) is not recognized (!) and that Microsoft will have to review all my edits before changing them.

So who "owns" my academic identity? Since obviously Microsoft is crawling university websites to create these pages, it would have been more appropriate to find the authors' academic email addresses and email them directly to notify them of the page (with an "opt out" option!) and allow them to make any corrections without Microsoft's moderation.

Tuesday, March 16, 2010

Advancing science vs. compromising privacy

Data mining often brings up the association of malicious organizations that violate individuals' privacy. Three days ago, this tension was brought up a notch (at least in my eyes): Netflix decided to cancel the second round of the famous Netflix Prize. The reason is apparent in the New York Times article "Netflix Cancels Contest After Concerns Are Raised About Privacy". Researchers from the University of Texas have shown that the data disclosed by Netflix in the first contest could be used to identify users. One woman sued Netflix. The Federal Trade Commission got involved, and the rest is history.

What's different about this case is that the main benefactor of the data made public by Netflix is the scientific data mining community. The Netflix Prize competition lead to multiple worthy goals including algorithmic development, insights about existing methods, cross-disciplinary collaborations (in fact, the winning team was a collaboration between computer scientists and statisticians), collaborations between research groups (many competing teams joined forces to create more accurate ensemble predictions). There was actual excitement among data mining researchers! Canceling the sequel is perceived by many as an obstacle to innovation. Just read the comments on the cancellation posting on Netflix's blog.

After the first feeling of disappointment and some griping, I started to "think positively": What are ways that would allow companies such as Netflix to share their data publicly? One can think of simple technical solutions such as an "opt out" (or "opt in") when you rate movies on Netflix that would tell Netflix whether they can use your data in the contest. But clearly there are issues there such as bias and maybe even legal and technical issues.

But what about all that research on advanced data disclosure? Are there not ways to anonymize the data to a reasonable level of comfort? Many organizations (including the US Census Bureau) disclose data to the public while protecting privacy. My sense is that current data disclosure policies are aimed at disclosing data that will allow statistical inference, and hence the disclosed data are aggregated at some level, or else only relevant summary statistics are disclosed (for example, see A Data Disclosure Policy for Count Data Based on the COM-Poisson Distribution). Such data would not be useful for a predictive task where the algorithm should predict individual responses. Another popular masking method is data perturbation, where some noise is added to each data point in order to mask its actual value and avoid identification. The noise addition is intended not to affect statistical inference, but it's a good question how perturbation affects individual-level prediction.

It looks like the data mining community needs to come up with some data disclosure policies that support predictive analytics.

Thursday, March 06, 2008

Mining voters

While the presidential candidates are still doing their dances, it's interesting to see how they use datamining for improving their stance: The candidates apparently use companies that mine their voter databases in order to "micro-target" voters via ads and the like. See this blog posting on The New Republic-- courtesy of former student Igor Nakshin. Note also the comment about the existence of various such companies that tailor to the different candidates.

It would be interesting to test the impact of this "mining" on actual candidate voting and to compare the different tools. But how can this be done in an objective manner without the companies actually sharing their data? That would fall in the area of "privacy-preserving data mining".

Thursday, September 06, 2007

Data mining = Evil?

Some get a chill when they hear "data mining" because they associate it with "big brother". Well, here's one more major incident that sheds darkness on smart algorithms: The Department of Homeland Security declared the end of a data mining program called ADVISE (Analysis, Dissemination, Visualization, Insight and Semantic Enhancement). Why? Because it turns out that they were testing it for two years on live data on real people "without meeting privacy requirements" (Yahoo! News: DHS ends criticized data-mining program).

There is nothing wrong or evil about data mining. It's like any other tool: you can use it or abuse it. Issues of privacy and confidentiality in data usage have always been there and will continue to be a major concern as more and more of our private data gets stored in commercial, government, and other databases.

Many students in my data mining class use data from their workplace for their term project. The projects almost always turn out to be insightful and useful beyond the class exercise. But we do always make sure to obtain permission, de-identify, and protect and restrict access to the data as needed. Good practice is the key to keeping "data mining" a positive term!