Thursday, April 23, 2009


In the process of planning the syllabus for my next PhD course on "Scientific Data-Collection", to be offered for the third time in Spring 2009, I have realized how fragmented the education of statisticians is, especially when considering the applied courses. The applied part of a typical degree in statistics will include a course in Design of Experiments, one on Surveys and Sampling, one on Computing with the latest statistical software (currently R), perhaps a Quality Control, and of course the usual Regression, Multivariate Analysis and other modeling courses. 

Because there is usually very little overlap between the courses (perhaps the terms "sigma" and "p-value" are repeated across them), or sometimes extreme overlap (we learned ANOVA both in a "Regression and ANOVA" course, and again in "Design of Experiments"). I initially conceptually compartmentalized courses into separate entities. Each had its own terminology and point of view. It took me a while to even get the difference between the "probability" part and the "statistics" part in the "Intro to probability and statistics" course. I can be cynical and attribute the mishmash to the diverse backgrounds of the faculty, but I suppose it is more due to "historical reasons".

After a while, to make better sense of my overall profession, I was able to cluster the courses into the broader "statistical models", "math and prob background", "computing", etc.
But truthfully, this too is very unsatisfactory. It's a very limited view of the "statistical purpose" of the tools, taking the phrases off the textbooks used in each subject.

For my Scientific Data-Collection course, where students come from wide range of business disciplines, I cover three main data collection methods (all within an Internet environment): Web collection (crawling, API, etc.), online surveys, and online/lab experiments. In a statistics curriculum you would never find such a combo. You won't even find a statistics textbook that covers all three topics. So why did we bind them? Because these are the main tools that researchers use today to gather data!

For each of the three topics we discuss how to design effective data collection schemes and tools. In additional to statistical considerations (guaranteeing that the collected data will be adequate for answering the reserach question of interest), and resource constraints (time, money, etc.), there are two additional aspects: ethical and technological. These are extremely important and are must-knows for any hands-on researcher.

Thinking of the non-statistical aspects of data collection has lead me to a broader and more conceptual view of the statistics profession. I like David Hand's definition of statistics as a technology (rather than a science). It means that we should think about our different methods as technologies within a context. Rather than thinking of our knowledge as a toolkit (with a hammer, screwdriver, and a few other tools), we should generalize across the different methods in terms of their use by non-statisticians. How and when do psychologists use surveys? experiments? regression models? T-tests? [Or are they compartmentalizing those according to the courses that they studied from Statistics faculty?] How are chemical engineers collecting, analyzing, and evaluating their data?

Ethical considerations are rarely discussed in statistics courses, although they usually are discussed in "research methods" grad courses in the social sciences. Yet, ethical considerations are all very closely related to the statistical design. Limitations on sample size can arise due to copyright law (web-crawling), due to safety of patients (clinical trials), or to non-response rates (surveys). Not to mention that every academic involved in human subjects research should be educated about Institutional Review Boards and the study approval process. Similarly, technological issues are closely related to sample size and the quality of the generated data. Servers that are down during a web crawl (or due to the web crawl!), email surveys that are not properly displayed on Firefox or caught by spam filters, or overly-sophisticated technological experiments are all issues that statistics students should also be educated about.

I propose to re-design the statistics curriculum around two coherent themes: "Modeling and Data Analysis Technologies", and "Data Collection / Study Design Technologies". An intro course should present these two different components, their role, and their links so that students will have context.

And, of course, the "Modeling and Data Analysis" theme should be clearly broken down into "explaining", "predicting", and "describing".

Saturday, April 18, 2009

Collecting online data (for research)

In the new era of large amounts of publicly available data, an issue that is sometimes overlooked is ethical data collection. Whereas for experimental studies involving humans we have clear guidelines and an organizational process for assessing and approving data collection (in the US, via the IRB), collecting observational data is much more ambiguous. For instance, if I want to collect data on 50,000 book titles on Amazon, including their ratings, reviews, and cover images - is it ethical to collect this information by web crawling? A first thought might be "why not? the information is there and I am not taking anything from anyone". However, there are hidden costs and risks here that must be considered. First, in the above example, the web crawler will be mimicking manual browsing, thereby accessing Amazon's server. This is one cost to Amazon. Secondly, Amazon posts this information for buyers for the purpose of generating revenue. When one's intention is not to actually purchase, then it is misuse of the public information. Finally, one must ask whether there is any risk to the data provider (for instance - maybe too heavy access can slow down the provider's server, thereby slowing down or even denying access to actual potential buyers).

When the goal of the data collection is research, then another factor to consider is the benefits of the research study to society, to scientific research or "general knowledge", and perhaps even to the company.

Good practice involves consideration of the costs, risks, and benefits to the data provider and accordingly designing your collection and letting the data provider know about your intention. Careful consideration of actual sample size is therefore still important even in this new environment. An interesting paper by Allen, Burk, and Davis (Academic Data Collection in Electronic Environments: Defining Acceptable Use of Internet Resources discusses these issues and offers guidelines for "acceptable use" of internet resources.

These days more and more companies (e.g., eBay and Amazon) are moving to "push" technology, where they make their data available for collection via API and RSS technologies. Obtaining data in this way avoids the ethical and legal considerations, but one is then limited to the data that the data source has chosen to provide. Moreover, the amount of data is usually limited. Hence, I believe that web crawling will continue to be used, but in combination with API and RSS the extent of crawling can be reduced.