Thursday, April 23, 2009


In the process of planning the syllabus for my next PhD course on "Scientific Data-Collection", to be offered for the third time in Spring 2009, I have realized how fragmented the education of statisticians is, especially when considering the applied courses. The applied part of a typical degree in statistics will include a course in Design of Experiments, one on Surveys and Sampling, one on Computing with the latest statistical software (currently R), perhaps a Quality Control, and of course the usual Regression, Multivariate Analysis and other modeling courses. 

Because there is usually very little overlap between the courses (perhaps the terms "sigma" and "p-value" are repeated across them), or sometimes extreme overlap (we learned ANOVA both in a "Regression and ANOVA" course, and again in "Design of Experiments"). I initially conceptually compartmentalized courses into separate entities. Each had its own terminology and point of view. It took me a while to even get the difference between the "probability" part and the "statistics" part in the "Intro to probability and statistics" course. I can be cynical and attribute the mishmash to the diverse backgrounds of the faculty, but I suppose it is more due to "historical reasons".

After a while, to make better sense of my overall profession, I was able to cluster the courses into the broader "statistical models", "math and prob background", "computing", etc.
But truthfully, this too is very unsatisfactory. It's a very limited view of the "statistical purpose" of the tools, taking the phrases off the textbooks used in each subject.

For my Scientific Data-Collection course, where students come from wide range of business disciplines, I cover three main data collection methods (all within an Internet environment): Web collection (crawling, API, etc.), online surveys, and online/lab experiments. In a statistics curriculum you would never find such a combo. You won't even find a statistics textbook that covers all three topics. So why did we bind them? Because these are the main tools that researchers use today to gather data!

For each of the three topics we discuss how to design effective data collection schemes and tools. In additional to statistical considerations (guaranteeing that the collected data will be adequate for answering the reserach question of interest), and resource constraints (time, money, etc.), there are two additional aspects: ethical and technological. These are extremely important and are must-knows for any hands-on researcher.

Thinking of the non-statistical aspects of data collection has lead me to a broader and more conceptual view of the statistics profession. I like David Hand's definition of statistics as a technology (rather than a science). It means that we should think about our different methods as technologies within a context. Rather than thinking of our knowledge as a toolkit (with a hammer, screwdriver, and a few other tools), we should generalize across the different methods in terms of their use by non-statisticians. How and when do psychologists use surveys? experiments? regression models? T-tests? [Or are they compartmentalizing those according to the courses that they studied from Statistics faculty?] How are chemical engineers collecting, analyzing, and evaluating their data?

Ethical considerations are rarely discussed in statistics courses, although they usually are discussed in "research methods" grad courses in the social sciences. Yet, ethical considerations are all very closely related to the statistical design. Limitations on sample size can arise due to copyright law (web-crawling), due to safety of patients (clinical trials), or to non-response rates (surveys). Not to mention that every academic involved in human subjects research should be educated about Institutional Review Boards and the study approval process. Similarly, technological issues are closely related to sample size and the quality of the generated data. Servers that are down during a web crawl (or due to the web crawl!), email surveys that are not properly displayed on Firefox or caught by spam filters, or overly-sophisticated technological experiments are all issues that statistics students should also be educated about.

I propose to re-design the statistics curriculum around two coherent themes: "Modeling and Data Analysis Technologies", and "Data Collection / Study Design Technologies". An intro course should present these two different components, their role, and their links so that students will have context.

And, of course, the "Modeling and Data Analysis" theme should be clearly broken down into "explaining", "predicting", and "describing".

1 comment:

Unknown said...

hey galit, lets synch up on this. i'm teaching a phd seminar in the spring too and was thinking of having an active data collection portion in it...ravi b