BzST | Business Analytics, Statistics, Teaching: data source

Showing posts with label data source. Show all posts

Monday, October 17, 2011

Early detection of what?

The interest in using pre-diagnostic data for the early detection of disease outbreaks, has evolved in interesting ways in the last 10 years. In the early 2000s, I was involved in an effort to explore the potential of non-traditional data sources, such as over-the-counter pharmacy sales and web searches on medical websites, which might give earlier signs of a disease outbreak than confirmed diagnostic data (lab tests, doctor diagnoses, etc.). The pre-diagnostic data sources that we looked at were not only expected to have an earlier footprint of the outbreak compared to traditional diagnostic data, but they were also collected at higher frequency (typically daily) compared to the weekly or even less frequent diagnostic data, and were made available with much less lag time. The general conclusion was that there indeed was potential in improving the detection time using such data (and for that we investigated and developed adequate data analytic methods). Evaluation was based on simulating outbreak footprints, which is a challenge in itself (what does a flu outbreak look like in pharmacy sales?), and on examining past data with known outbreaks (where there is often no consensus on the outbreak start date) -- for papers on these issues see here.

A few years ago, Google came out with Google Flu Trends, which monitors web searches for terms that are related to flu, with the underlying assumption that people (or their relatives/friends, etc.) who are experiencing flu-like symptoms would be searching for related terms on the web. Google compared their performance to the weekly diagnostic data by the Centers for Disease Control and Prevention (CDC). In a joint paper between a Google and CDC researchers, they claimed:

we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. (also published in Nature)

Blue= Google flu estimate; Orange=CDC data. From google.org/flutrends/about/how.html

What can you do if you have an early alert of a disease outbreak? the information can be used for stockpiling medicines, vaccination plans, providing public awareness, preparing hospitals, and more. Now comes the interesting part: recently, there has been criticism of the Google Flu Trends claims, saying that "while Google Flu Trends is highly correlated with rates of [Influenza-like illness], it has a lower correlation with actual influenza tests positive". In other words, Google detects not a flu outbreak, but rather a perception of flu. Does this means that Google Flu Trends is useless? Absolutely not. It just means that the goal and the analysis results must be aligned more carefully. As the Popular Mechanics blog writes:

Google Flu Trends might, however, provide some unique advantages precisely because it is broad and behavior-based. It could help keep track of public fears over an epidemic

Aligning the question of interest with the data (and analysis method) is related to what Ron Kenett and I call "Information Quality", or "the potential of a dataset to answer a question of interest using a given data analysis method". In the early disease detection problem, the lesson is that diagnostic and pre-diagnostic data should not just be considered two different data sets (monitored perhaps with different statistical methods), but they also differ fundamentally in terms of the questions they can answer.

Monday, June 20, 2011

Got Data?!

The American Statistical Association's store used to sell cool T-shirts with the old-time beggar-statistician question "Got Data?" Today it is much easier to find data, thanks to the Internet. Dozens of student teams taking my data mining course have been able to find data from various sources on the Internet for their team projects. Yet, I often receive queries from colleagues in search of data for their students' projects. This is especially true for short courses, where students don't have sufficient time to search and gather data (which is highly educational in itself!).

One solution that I often offer is data from data mining competitions. KDD Cup is a classic, but there are lots of other data mining competitions that make huge amounts of real or realistic data available: past INFORMS Data Mining Contests (2008, 2009, 2010), ENBIS Challenges, and more. Here's one new competition to add to the list:

The European Network for Business and Industrial Statistics (ENBIS) announced the 2011 Challenge (in collaboration with SAS JMP). The title is "Maximising Click Through Rates on Banner Adverts: Predictive Modeling in the On Line World". It's a bit complicated to find the full problem description and data on the ENBIS website (you'll find yourself clicking-through endless "more" buttons - hopefully these are not data collected for the challenge!), so I linked them up.

It's time for T-shirts saying "Got Data! Want Knowledge?"

Thursday, March 06, 2008

New data repository by UN

As more government and other agencies move "online", some actually make their data publicly available. Adi Gadwale, one of my dedicated ex-students, sent a note about a new neat data repository made publicly available by the UN called UNdata. You can read more about it in the UN News bulletin or go directly to repository at http://data.un.org

The interface is definitely easy to navigate. Lots of time series for the different countries on many types of measurements. This is a good source of data that can be used to supplement other existing datasets (like one would use US census data to supplement demographic information).

Another interesting data repository is TRAC. It's mission is to obtain and provide all information that should be public by the Freedom of Information Act. It has data on many US agencies. Some data are free for download, but to get access to all the neat stuff you (or your institution) need a subscription.

Wednesday, March 07, 2007

Source for data

Adi Gadwale, a student in my 2004 MBA Data Mining class, still remembers my fetish with business data and data visualization. He just sent me a link to an IBM Research website called Many Eyes, which includes user-submitted datasets as well as Java-applet visualizations.

The datasets include quite a few "junk" datasets, lots with no description. But there are a few interesting ones: FDIC is a "scrubbed list of FDIC institutions removing inactive entities and stripping all columns apart from Assets, ROE, ROA, Offices (Branches), and State". It includes 8711 observations. Another is Absorption Coefficients of Common Materials - I can just see the clustering exercise! Or the 2006 Top 100 Video Games by Sales. There are social-network data, time series, and cross-sectional data. But again, it's like shopping at a second-hand store -- you really have to go through a lot of junk in order to find the treasures.

Happy hunting! (and thanks to Adi)