Friday, December 19, 2014

New curriculum design guidelines by American Statistical Association: Who will teach?

The American Statistical Association published new "Curriculum Guidelines for Undergraduate Programs in Statistical Science". This is the first update to the guidelines since 2000.
The executive summary lists the key points:
  1. Increased importance of data science
  2. Real applications
  3. More diverse models and approaches
  4. Ability to communicate
This set sounds right on target with what is expected of statisticians in industry (the authors of the report include prominent statisticians in industry). It highlights the current narrow focus of statistics programs as well as their lack of real-world usability. 

I found three notable mentions in the descriptions of the above points:
Point #1: "Students should be fluent in higher-level programming languages and facile with database systems."
Point #2: "Students require exposure to and practice with a variety of predictive and explanatory models in addition to methods for model-building and assessment."
Point #3: "Students need to be able to communicate complex statistical methods in basic terms to managers and other audiences and to visualize results in an accessible manner"
Agree! But - are Statistics faculty qualified to teach these topics/skills? Since these capabilities are not built into most Statistics graduate programs, faculty in Statistics departments typically have not been exposed to these topics, nor to methods for teaching them (two different skills!). While one can delegate programming to computer science instructors, a gap is being created between the students' abilities and the Statistics faculty abilities.

Point #2 talks about prediction and explanation - an extremely important distinction for both practice and research statisticians. This topic is still quite blurred in the Statistics community as well as in many other domains , and textbooks have still not caught up, thereby creating a gap in needed teaching materials.

Point #3 is an interesting one: while data visualization is a key concept in Statistics, it is typically used in the context of the Exploratory Data Analysis, where charts and summaries are used by the statistician to understand the data prior to analysis. Point #3 talks about a different use of visualization, for the purpose of communication between the statistician and the stakeholder. This requires a different approach to visualization, different from classic classes on box plots, histograms, and computing percentiles.

To summarize: great suggestions for improving the undergrad curriculum. But, successful implementation requires professional development for most faculty teaching in such programs.

Let me add my own key point, which is a critical issue underlying many data scandals and sagas: "Students need to understand what population their final cleaned sample generalizes to". The issue of generalization, not just in the sense of statistical inference, is at the heart of using data to come up with insights and decisions for new records and/or in new situations. After sampling, cleaning (!!), pre-processing, and analyzing the data, you often end up with results that are relevant to a very restricted population, which is far from what you initially intended.

On the aside: note the use of the term "Data Science" in the report - a term now claimed by statisticians, operations researchers, computer scientists and anyone trying to ride the new buzz. What does it mean here? The report reads (page 7):
Although a formal definition of data science is elusive, we concur with the StatsNSF committee statement that data science comprises the “science of planning for, acquisition, management, analysis of, and inference from data.”
Oops - what about non-inference uses such as prediction? and communication?