BzST | Business Analytics, Statistics, Teaching

Monday, July 30, 2012

Launched new book website for Practical Forecasting book

Last week I launched a new website for my textbook Practical Time Series Forecasting. The website offers resources such as the datasets used in the book, a block with news that pushes posts to the book Facebook page, information about the book and author, for instructors an online form for requesting an evaluation copy and another for requesting access to solutions, etc.

I am already anticipating my colleagues' question "what platform did you use?". Well, I did not hire a web designer, nor did I spend three months putting the website together using HTML. Instead, I used Google Sites. This is a great solution for those who like to manage their book website on their own (whether you're self-publishing or not). Very readable, clean design, integration with other Google Apps components (such as forms), and as hack-proof as it gets. Not to mention easy to update and maintain, and free hosting.

Thanks to the tools and platforms offered by Google and Amazon, self-publishing is not only a good realistic option for authors. It also allows a much closer connection between the author and the book users -- instructors, students and "independent" readers.

Wednesday, July 25, 2012

Explain/Predict in Epidemiology

Researchers in various fields have been sending me emails and reactions after reading my 2010 paper "To Explain or To Predict?". While I am aware of research methodology in a few areas, I'm learning in more detail about the scientific challenges caused by "predictive-less" areas.

In an effort to further disseminate this knowledge, I'll be posting these reactions in this blog (with the senders' approval, of course).

In a recent email, Stan Young, Assistant Director for Bioinformatics at NISS, commented about the explain/predict situation in epidemiology:

"I enjoyed reading your paper... I am interested in what I think is [epidemiologists] lack of clarity on explain/predict. They seem to take the position that no matter how many tests they compute, that any p-value <0.05 is a strong indication of something real (=explain) and that everyone should follow their policies (=predict) when, given all their analysis problems, they at the very best should consider their claims as hypothesis generating."

In a talk by epidemiology Professor Uri Goldbourt, who was a discussant in a recent "Explain or Predict" panel, I learned that modeling in epidemiology is nearly entirely descriptive. Unlike explanatory modeling, there is little underlying causal theory. And there is no prediction or evaluation of predictive power going on. Modeling typically focuses on finding correlations between measurable variables in observational studies that generalize to the population (and hence the wide use of inference, and unfortunately, a huge issue of multiple testing).

Predictive modeling has a huge potential to advance research in epidemiology. Among many benefits (such as theory validation), it would bring the field closer to today's "personalized" environment. Not only concentrating on "average patterns", but also generating personalized predictions for individuals.

I'd love to hear more from epidemiologists! Please feel free to post comments or to email me directly.

Tuesday, July 24, 2012

Linear regression for binary outcome: even better news

I recently attended the 8th World Congress in Probability and Statistics, where I heard an interesting talk by Andy Tsao. His talk "Naivity can be good: a theoretical study of naive regression" (Abstract #0586) was about the use of Naive Regression, which is the application of linear regression to a categorical outcome, treating the outcome as numerical. He asserted that predictions from Naive Regression will be quite good. My last post was about the "goodness" of a linear regression applied to a binary outcome in terms of the estimated coefficients. That's what explanatory modeling is about. What Dr. Tsao alerted me to, is that the predictions (or more correctly, classifications) too, will be good. In other words, it's useful for predictive modeling! In his words:

"This naivity is not blessed from current statistical or machine learning theory. However, surprisingly, it delivers good or satisfactory performances in many applications."

Note that to derive a classification from naive regression, you treat the prediction as the class probability (although it might be negative or >1), and apply a cutoff value as in any other classification method.

Dr. Tsao pointed me to the good old The Elements of Statistical Learning, which has a section called Linear Regression of an Indicator Matrix. There are two interesting takeaway from Dr. Tsao's talk:

Naive Regression and Linear Discriminant Analysis will have the same ROC curve, meaning that the ranking of predictions will be identical.
If the two groups are of equal size (n1=n2), then Naive Regression and Discriminant Analysis are equivalent and therefore produce the same classifications.

Monday, May 28, 2012

Linear regression for a binary outcome: is it Kosher?

Regression models are the most popular tool for modeling the relationship between an outcome and a set of inputs. Models can be used for descriptive, causal-explanatory, and predictive goals (but in very different ways! see Shmueli 2010 for more).

The family of regression models includes two especially popular members: linear regression and logistic regression (with probit regression more popular than logistic in some research areas). Common knowledge, as taught in statistics courses, is: use linear regression for a continuous outcome and logistic regression for a binary or categorical outcome. But why not use linear regression for a binary outcome? the two common answers are: (1) the linear regression can produce predictions that are not binary, and hence "nonsense" and (2) inference based on the linear regression coefficients will be incorrect.

I admit that I bought into these "truths" for a long time, until I learned never to take any "statistical truth" at face value. First, let us realize that problem #1 relates to prediction and #2 to description and causal explanation. In other words, if issue #1 can be "fixed" somehow, then I might consider linear regression for prediction even if the inference is wrong (who cares about inference if I am only interested in predicting individual observations?). Similarly, if there is a fix for issue #2, then I might consider linear regression as a kosher inference mechanism even if it produces "nonsense" predictions.

The 2009 paper Linear versus logistic regression when the dependent variable is a dichotomy by Prof. Ottar Hellevik from Oslo University de-mystifies some of these issues. First, he gives some tricks that help avoid predictions outside the [0,1] range. The author identifies a few factors that contribute to "nonsense predictions" by linear regression:

interactions that are not accounted for in the regression
non-linear relationships between a predictor and the outcome

The suggested remedy for these issues is including interaction terms for categorical variables, and if numerical predictors are involved, then bucket them into bins and include those as dummies + interactions. So, if the goal is predicting a binary outcome, linear regression can be modified and used.

Now to the inference issue. "The problem with a binary dependent variable is that the homoscedasticity assumption (similar variation on the dependent variable for units with different values on the independent variable) is not satisfied... This seems to be the main basis for the widely held opinion that linear regression is inappropriate with a binary dependent variable". Statistical theory tells us that violating the homoscedasticity assumption results in biased standard errors for the coefficients, and that the coefficients might not be the most precise in terms of variance. Yet, the coefficients themselves remain unbiased (meaning that with a sufficiently large sample they are "on target"). Hence, with a sufficiently large sample we need not worry! Precision is not an issue in very large samples, and hence the on-target coefficients are just what we need.

I will add that another concern is that the normality assumption is violated: the residuals from a regression model on a binary outcome will not look very bell-shaped... Again, with a sufficiently large sample, the distribution does not make much difference, since the standard errors are so small anyway.

Chart from Hellevik (2009)

Hellevik's paper pushes the envelope further in an attempt to explore "how small can you go" with your sample before getting into trouble. He uses simulated data and compares the results from logistic and linear regression for fairly small samples. He finds that the differences are minuscule.

The bottom line: linear regression is kosher for prediction if you take a few steps to accommodate non-linear relationships (but of course it is not guaranteed to produce better predictions than logistic regression!). For inference, for a sufficiently large sample where standard errors are tiny anyway, it is fine to trust the coefficients, which are in any case unbiased.

Tuesday, May 22, 2012

Policy-changing results or artifacts of big data?

The New York Times article Big Study Links Good Teachers to Lasting Gain covers a research study coming out of Harvard and Columbia on "The Long-Term Impacts of Teachers: Teacher Value-Added and Student Outcomes in Adulthood". The authors used sophisticated econometric models applied to data from a million students to conclude:

"We find that students assigned to higher VA [Value-Added] teachers are more successful in many dimensions. They are more likely to attend college, earn higher salaries, live in better neighborhoods, and save more for retirement. They are also less likely to have children as teenagers."

When I see social scientists using statistical methods in the Big Data realm I tend to get a little suspicious, since classic statistical inference behaves differently with large samples than with small samples (which are more typical in the social sciences). Let's take a careful look at some of the charts from this paper to figure out the leap from the data to the conclusions.

How much does a "value added" teacher contribute to a person's salary at age 28?

Figure 1: dramatic slope? largest difference is less than $1,000

The slope in the chart (Figure 1) might look quite dramatic. And I can tell you that, statistically speaking, the slope is not zero (it is a "statistically significant" effect). Now look closely at the y-axis amounts. Note that the data fluctuate only by a very small annual amount! (less than $1,000 per year). The authors get around this embarrassing magnitude by looking at the "lifetime value" of a student ("On average, having such a [high value-added] teacher for one year raises a child's cumulative lifetime income by $50,000 (equivalent to $9,000 in present value at age 12 with a 5% interest rate)."

Here's another dramatic looking chart:

What happens to the average student test score as a "high value-added teacher enters the school"?

The improvement appears to be huge! But wait, what are those digits on the y-axis? the test score goes up by 0.03 points!

Reading through the slides or paper, you'll find various mentions of small p-values, which indicate statistical significance ("p<0.001" and similar notations). This by no means says anything about the practical significance or the magnitude of the effects.

If this were a minor study published in a remote journal, I would say "hey, there are lots of those now." But when a paper covered by the New York Times and is published as in the serious National Bureau of Economic Research Working Paper series (admittedly, not a peer-reviewed journal), then I am worried. I am very worried.

Unless I am missing something critical, I would only agree with one line in the executive summary: "We find that when a high VA teacher joins a school, test scores rise immediately in the grade taught by that teacher; when a high VA teacher leaves, test scores fall." But with one million records, that's not a very interesting question. The interesting question which should drive policy is by how much?

Big Data is also becoming the realm in social sciences research. It is critical that researchers are aware of the dangers of applying small-sample statistical models and inference in this new era. Here is one place to start.

Tuesday, April 17, 2012

Google Scholar -- you're not alone; Microsoft Academic Search coming up in searches

In searching for a few colleagues' webpages I noticed a new URL popping up in the search results. It either included the prefix academic.microsoft.com or the IP address 65.54.113.26. I got curious and checked it out to discover Microsoft Academic Search (Beta) -- a neat presentation of the author's research publications and collaborations. In addition to the usual list of publications, there are nice visualizations of publications and citations over time, a network chart of co-authors and citations, and even an Erdos Number graph. The genealogy graph claims that it is based on data mining so "might not be perfect".

All this is cool and helpful. But there is one issue that really bothers me: who owns my academic profile?

I checked my "own" Microsoft Academic Search page. Microsoft's software tried to guess my details (affiliation, homepage, papers, etc.) and was correct on some details but wrong on others. To correct the details required me to open a Windows Live ID account. I was able to avoid opening such an account until now (I am not a fan of endless accounts) and would have continued to avoid it, had I not been forced to do so: Microsoft created an academic profile page for me, without my consent, with wrong details. Guessing that this page will soon come up in user searches, I was compelled to correct the inaccurate details.

The next step was even more disturbing: once I logged in with my verified Window Live ID, I tried to correct my affiliation and homepage and added a photo. However, I received the message that the affiliation (Indian School of Business) is not recognized (!) and that Microsoft will have to review all my edits before changing them.

So who "owns" my academic identity? Since obviously Microsoft is crawling university websites to create these pages, it would have been more appropriate to find the authors' academic email addresses and email them directly to notify them of the page (with an "opt out" option!) and allow them to make any corrections without Microsoft's moderation.

Tuesday, April 03, 2012

New Google Consumer Surveys: revolutionizing academic data collection?

Surveys are a key data collection tool in several academic research areas. As opposed to experiments or field studies that yield observational data, surveys can give access to attitudes, reaching "inside the head" of people rather than observing their behavior.

Technological advances in survey tool development now offer "poor academics" sufficiently powerful online survey tools, such as surveymonkey.com and Google forms. Yet, obtaining access to a large pool of potential respondents from a particular population remains a challenge. Another challenge is getting fast responses -- how do you reach people quickly and get many of them to respond quickly?

We may now have a solution that is affordable for academic research: A few days ago Google announced a new service called "Google Consumer Surveys". Similar to Ad Sense, where Google places ads on websites of publishers (and pays the publishers a commission), with Consumer Surveys, Google places a single-question survey (=poll) on websites of publishers. The publishers require website users to complete the poll to get access to premium content.

Google Consumer Surplus: How it works (from their website)

The good:

Very affordable: the charge for each response is $0.10 (=only $100 for the magic number of 1,000 responses). Or, for an audience targeted by demographics or some trait, it is $.50 per response (more here).
Fast: Google will likely post the polls on pages with high traffic.
Google presents the results with attractive charts
Getting IRB permission may be easier, given the stringent policies that Google mandates

The bad:

You can only post one question at a time. For a longer survey, breaking it up into single questions means that not the same person is answering all the questions. Also, each additional question increases the cost exponentially.
Google does not supply the poll creator with the raw data. You only get aggregated data. You can choose the aggregation (inferred age, gender, urban density, geography, or income). This is likely to be a huge "bad" for researchers who need access to the raw data for more advanced analyses than those provided by Google.
Currently Google only offers this service for websites in the US. To collect information from users visiting non-US website we will all have to continue holding our breath.

A curious anecdote: I filled in the support contact form to ask a few extra questions. I received speedy and helpful answers (within 24 hours), but they all landed in my Google Spam folder!