BzST | Business Analytics, Statistics, Teaching: health

Showing posts with label health. Show all posts

Monday, October 24, 2016

Experimenting with quantified self: two months hooked up to a fitness band

It's one thing to collect and analyze behavioral big data (BBD) and another to understand what it means to be the subject of that data. To really understand. Yes, we're all aware that our social network accounts and IoT devices share our private information with large and small companies and other organizations. And although we complain about our privacy, we are forgiving about sharing it, most likely because we really appreciate the benefits.

So, I decided to check out my data sharing in a way that I cannot ignore: I started wearing a fitness band. I bought one of the simplest bands available - the Mi Band Pulse from Xiaomi - this is not a smart watch but "simply" a fitness band that counts steps, measures heart rate and sleep. It's cheap price means that this is likely to spread to many users. Hence, BBD at speed on a diverse population.

I had two questions in mind:

Can I limit the usage so that I only get the benefits I really need/want while avoiding generating and/or sharing data I want to keep private?
What do I not know about the data generated and shared?

I tried to be as "private" as possible, never turning on location, turning on bluetooth for synching my data with my phone only once a day for a few minutes, turning on notifications only for incoming phone calls, not turning on notifications for anything else (no SMS, no third party apps notifications), only using the Mi Fit app (not linking to other apps like Google Fit), and only "befriending" one family member who bought the same product.

While my experiment is not as daring as a physician testing his/her developed drugs on themselves, I did learn a few important lessons. Here is what I discovered:

Despite intentionally turning off my phone bluetooth right after synching the band's data every day (which means the device should not be able to connect to my phone, or the cloud), the device and my phone did continue having secret "conversations". I discovered this when my band vibrated on incoming phone calls when bluetooth was turned off. This happened on multiple occasions. I am not sure what part of the technology chain to blame - my Samsung phone? the Android operating system? the Mi Band? It doesn't really matter. The point is: I thought the band was disconnected from my phone, but it was not.
Every time I opened the Mi Fit app on my phone, it requested access to location, which I had to decline every time. Quite irritating, and I am guessing many users just turn it on to avoid the irritation.
My family member who was my "friend" through the Mi Fit app could see all my data: number of steps, heart rate measurements (whenever I asked the band to measure heart rate), weight (which I entered manually), and even... my sleep hours. Wait: my family member now knew exactly what time I went to sleep and what time I got up every morning. That's something I didn't consider when we decided to become "friends". Plus, Xiaomi has the information about our "relationship" including "pokes" that we could send each other (causing a vibration).
The sleep counter claims to measure "deep sleep". It was showing that even if I slept 8 hours and felt wonderfully refreshed, my deep sleep was only 1 hour. At first I was alarmed. With a bit of search I learned that it doesn't mean much. But the alarming effect made me realize that even this simple device is not for everyone. Definitely not for over-worried people.

After 2 months of wearing of the band 24-by-7, what have I learned about myself that I didn't already know? That my "deep sleep" is only one hour per night (whatever that means). And that the feeling of the vibrating band when conquering the 8000 steps/day limit feels good. Addictively good, despite being meaningless. All the other data was really no news (I walk a lot on days when I teach, my weight is quite stable, my heart rate is reasonable, and I go to sleep too late).

To answer my two questions:

Can I limit the usage to only gain my required benefits? Not really. Some is beyond my control, and in some cases I am not aware of what I am sharing.
What do I not know? Quite a bit. I don't know what the device is really measuring (eg "deep sleep"), how it is sharing it (when my bluethooth is off), and what the company does with it.

Is my "behavioral" data useful? To me probably not. To Xiaomi probably yes. One small grain of millet (小米 = XiaoMi) is not much, but a whole sack is valuable.

So what now? I move to Mi Band 2. Its main new feature is a nudge every hour when you're sitting for too long.

Monday, October 17, 2011

Early detection of what?

The interest in using pre-diagnostic data for the early detection of disease outbreaks, has evolved in interesting ways in the last 10 years. In the early 2000s, I was involved in an effort to explore the potential of non-traditional data sources, such as over-the-counter pharmacy sales and web searches on medical websites, which might give earlier signs of a disease outbreak than confirmed diagnostic data (lab tests, doctor diagnoses, etc.). The pre-diagnostic data sources that we looked at were not only expected to have an earlier footprint of the outbreak compared to traditional diagnostic data, but they were also collected at higher frequency (typically daily) compared to the weekly or even less frequent diagnostic data, and were made available with much less lag time. The general conclusion was that there indeed was potential in improving the detection time using such data (and for that we investigated and developed adequate data analytic methods). Evaluation was based on simulating outbreak footprints, which is a challenge in itself (what does a flu outbreak look like in pharmacy sales?), and on examining past data with known outbreaks (where there is often no consensus on the outbreak start date) -- for papers on these issues see here.

A few years ago, Google came out with Google Flu Trends, which monitors web searches for terms that are related to flu, with the underlying assumption that people (or their relatives/friends, etc.) who are experiencing flu-like symptoms would be searching for related terms on the web. Google compared their performance to the weekly diagnostic data by the Centers for Disease Control and Prevention (CDC). In a joint paper between a Google and CDC researchers, they claimed:

we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. (also published in Nature)

Blue= Google flu estimate; Orange=CDC data. From google.org/flutrends/about/how.html

What can you do if you have an early alert of a disease outbreak? the information can be used for stockpiling medicines, vaccination plans, providing public awareness, preparing hospitals, and more. Now comes the interesting part: recently, there has been criticism of the Google Flu Trends claims, saying that "while Google Flu Trends is highly correlated with rates of [Influenza-like illness], it has a lower correlation with actual influenza tests positive". In other words, Google detects not a flu outbreak, but rather a perception of flu. Does this means that Google Flu Trends is useless? Absolutely not. It just means that the goal and the analysis results must be aligned more carefully. As the Popular Mechanics blog writes:

Google Flu Trends might, however, provide some unique advantages precisely because it is broad and behavior-based. It could help keep track of public fears over an epidemic

Aligning the question of interest with the data (and analysis method) is related to what Ron Kenett and I call "Information Quality", or "the potential of a dataset to answer a question of interest using a given data analysis method". In the early disease detection problem, the lesson is that diagnostic and pre-diagnostic data should not just be considered two different data sets (monitored perhaps with different statistical methods), but they also differ fundamentally in terms of the questions they can answer.

Monday, September 19, 2011

Statistical considerations and psychological effects in clinical trials

I find it illuminating to read statistics "bibles" in various fields, which not only open my eyes to different domains, but also present the statistical approach and methods somewhat differently and considering unique domain-specific issues that cause "hmmmm" moments.

The 4th edition of Fundamentals of Clinical Trials, whose authors combine extensive practical experience at NIH and in academia, is full of hmmm moments. In one, the authors mention an important issue related to sampling that I have not encountered in other fields. In clinical trials, the gold standard is to allocate participants to either an intervention or a non-intervention (baseline) group randomly, with equal probabilities. In other words, half the participants receive the intervention and the other half does not (the non-intervention can be a placebo, the traditional treatment, etc.) The authors advocate a 50:50 ratio, because "equal allocation is the most powerful design". While there are reasons to change the ratio in favor of the intervention or baseline groups, equal allocation appears to have an important additional psychological advantage over unequal allocation in clinical trials:

Unequal allocation may indicate to the participants and to their personal physicians that one intervention is preferred over the other (pp. 98-99)

Knowledge of the sample design by the participants and/or the physicians also affects how randomization is carried out. It becomes a game between the designers and the participants and staff, where the two sides have opposing interests: to blur vs. to uncover the group assignments before they are made. This gaming requires devising special randomization methods (which, in turn, require data analysis that takes the randomization mechanism into account).

For example, to assure an equal number of participants in each of the two groups, given that participants enter sequentially, "block randomization" can be used. For instance, to assign 4 people to one of two groups A or B, consider all the possible arrangements AABB, AABA, etc., then choose one sequence at random, and assign participants accordingly. The catch is that if the staff have knowledge that the block size is 4 and know the first three allocations, they automatically know the fourth allocation and can introduce bias by using this knowledge to select every fourth participant.

Where else does such a psychological effect play a role in determining sampling ratios? In applications where participants and other stakeholders have no knowledge of the sampling scheme this is obviously a non-issue. For example, when Amazon or Yahoo! present different information to different users, the users have no idea about the sample design, and maybe not even that they are in an experiment. But how is the randomization achieved? Unless the randomization process is fully automated and not susceptible to reverse engineering, someone in the technical department might decide to favor friends by allocating them to the "better" group...

Thursday, September 15, 2011

Mining health-related data: How to benefit scientific research

Image from KDnuggets.com

While debates over privacy issues related to electronic health records are still ongoing, predictive analytics are beginning to being used with administrative health data (available to health insurance companies, aka, "health provider networks"). One such venue are large data mining contests. Let me describe a few and then get to my point about their contribution to pubic health, medicine and to data mining research.

The latest and grandest is the ongoing $3 million prize contest by Hereitage Provider Network, which opened in 2010 and lasts 2 years. The contest's stated goal is to create "an algorithm that predicts how many days a patient will spend in a hospital in the next year". Participants get a dataset of de-identified medical records of 100,000 individuals, on which they can train their algorithms. The article in KDNuggets.com suggests that this competition's goal is "to spur development of new approaches in the analysis of health data and create new predictive algorithms."

The 2010 SAS Data Mining Shootout contest was also health-related. Unfortunately, the contest webpage is no longer available (the problem description and data were previously available here), and I couldn't find any information on the winning strategies. From an article in KDNuggets:

"analyzing the medical, demographic, and behavioral data of 50,788 individuals, some of whom had diabetes. The task was to determine the economic benefit of reducing the Body Mass Indices (BMIs) of a selected number of individuals by 10% and to determine the cost savings that would accrue to the Federal Government's Medicare and Medicaid programs, as well as to the economy as a whole"

In 2009, the INFORMS data mining contest was co-organized by IBM Research and Health Care Intelligence, focused on "health care quality". Strangely enough, this contest website is also gone. A brief description by the organizer (Claudia Perlich) is given on KDNuggets.com, stating the two goals :

modeling of a patient transfer guideline for patients with a severe medical condition from a community hospital setting to tertiary hospital provider and
assessment of the severity/risk of death of a patient's condition.

What about presentations/reports from the winners? I had a hard time finding any (here is a deck of slides by a group competing in the 2011 SAS Shootout, also health-related). But photos holding awards and checks abound.

If these health-related data mining competitions are to promote research and solutions in these fields, the contest webpages with problem description, data, as well as presentations/reports by the winners should continue to be publicly available (as for the annual KDD Cup competitions by the ACM). Posting only names and photos of the winners makes data mining competitions look more like a consulting job where the data provider is interested in solving one particular problem for its own (financial or other) benefit. There is definitely scope for a data mining group/organization to collect all this info while it is live and post it in one central website.