BzST | Business Analytics, Statistics, Teaching: Book

Showing posts with label Book. Show all posts

Wednesday, September 19, 2012

Self-publishing to the rescue

The new Coursera course by Princeton Professor Mung Chiang was so popular that Amazon and the publisher ran out of copies of the textbook before the course even started (see "new website features" announcement; requires login). I experienced a stockout of my own textbook ("Data Mining for Business Intelligence") a couple of years ago, which caused grief and slight panic to both students and instructors.

With stockouts in mind, and recognizing the difficulty of obtaining textbooks outside of North America (unavailable, too expensive, or long/costly shipping), I decided to take things into my own hands and self-publish a "Practical Analytics" series of textbooks. Currently, the series has three books. All are available in soft-cover editions and Kindle editions. I used CreateSpace.com, an Amazon company, for publishing the soft-cover editions. This reduces the stockout problem due to a print-on-demand model. I used Amazon KDP for publishing the Kindle editions, so definitely no stockouts there. Amazon makes the books available on its global websites and so reachable in many places worldwide (the Indian Flipkart also avails the books). Finally, since I got to set the prices, I made sure to keep them affordable (for example, in India the e-books are even cheaper than in the USA).

How has this endeavor fared? Well, more than 1000 copies were sold since March 2011. Several instructors adopted books for their courses. And from reader emails and ratings on Amazon, it looks like I'm on the right track.

To celebrate the power and joy of self-publishing as well as accessible and affordable knowledge, I am running a "free e-book promotion" next week. The following e-books will be available for free:

Practical Acceptance Sampling will be available for free on Monday, September 24, 2012.
Practical Risk Analysis for Project Planning will be available for free on Tuesday, September 25, 2012.

Both promotions will commence a little after midnight, Pacific Standard Time, and will last for 24 hours. To download each of the e-books, just go to the Amazon website during the promotion period and search for the title. You will then be able to download the book for free.

Enjoy, and feel free to share!

Saturday, October 01, 2011

Language and psychological state: explain or predict?

Quite a few of my social science colleagues think that predictive modeling is not a kosher tool for theory building. In our 2011 MISQ paper "Predictive Analytics in Information Systems Research" we argue that predictive modeling has a critical role to play not only in theory testing but also in theory building. How does it work? Here's an interesting example:

The new book The Secret Life of Pronouns by the cognitive psychologist Pennebaker is a fascinating read in many ways. The book describes how analysis of written language can be predictive of psychological state. In particular, the author describes an interesting text mining approach that analyzes text written by a person and creates a psychological profile of the writer. In the author's context, the approach is used to study the effect of writing on recovery from psychological trauma. You can get a taste of word analysis on the AnalyzeWords.com website, run by the author and his colleagues, which analyzes the personality of a tweeter.

In the book, Pennebaker describes how the automated analysis of language has shed light on the probability that people who underwent psychological trauma will recuperate. For instance, people who used a moderate amount of negative language were more likely to improve than those who used too little or too much negative language. Or, people who tended to change perspectives in their writing over time (from "I" to "they" or "we") were more likely to improve.

Now comes a key question. In the words of the author (p.14): "Do words reflect a psychological state or do they cause it?". The statistical/data-mining text mining application is obviously a predictive tool that is build on correlations/associations. Yet, by examining when it predicts accurately and studying the reasons for the accurate (or inaccurate) predictions, the predictive tool can shed insightful light on possible explanations, linking results to existing psychological theories and giving ideas for new ones. Then comes the "close the circle", where the predictive modeling is combined with explanatory modeling. For testing the explanatory power of words on psychological state, the way to go is experiments. And indeed, the book describes several such experiments investigating the causal effect of words on psychological state, which seem to indicate that there is no causal relationship.

[Thanks to my text-mining-expert colleague Nitin Indurkhya for introducing me to the book!]

Monday, September 19, 2011

Statistical considerations and psychological effects in clinical trials

I find it illuminating to read statistics "bibles" in various fields, which not only open my eyes to different domains, but also present the statistical approach and methods somewhat differently and considering unique domain-specific issues that cause "hmmmm" moments.

The 4th edition of Fundamentals of Clinical Trials, whose authors combine extensive practical experience at NIH and in academia, is full of hmmm moments. In one, the authors mention an important issue related to sampling that I have not encountered in other fields. In clinical trials, the gold standard is to allocate participants to either an intervention or a non-intervention (baseline) group randomly, with equal probabilities. In other words, half the participants receive the intervention and the other half does not (the non-intervention can be a placebo, the traditional treatment, etc.) The authors advocate a 50:50 ratio, because "equal allocation is the most powerful design". While there are reasons to change the ratio in favor of the intervention or baseline groups, equal allocation appears to have an important additional psychological advantage over unequal allocation in clinical trials:

Unequal allocation may indicate to the participants and to their personal physicians that one intervention is preferred over the other (pp. 98-99)

Knowledge of the sample design by the participants and/or the physicians also affects how randomization is carried out. It becomes a game between the designers and the participants and staff, where the two sides have opposing interests: to blur vs. to uncover the group assignments before they are made. This gaming requires devising special randomization methods (which, in turn, require data analysis that takes the randomization mechanism into account).

For example, to assure an equal number of participants in each of the two groups, given that participants enter sequentially, "block randomization" can be used. For instance, to assign 4 people to one of two groups A or B, consider all the possible arrangements AABB, AABA, etc., then choose one sequence at random, and assign participants accordingly. The catch is that if the staff have knowledge that the block size is 4 and know the first three allocations, they automatically know the fourth allocation and can introduce bias by using this knowledge to select every fourth participant.

Where else does such a psychological effect play a role in determining sampling ratios? In applications where participants and other stakeholders have no knowledge of the sampling scheme this is obviously a non-issue. For example, when Amazon or Yahoo! present different information to different users, the users have no idea about the sample design, and maybe not even that they are in an experiment. But how is the randomization achieved? Unless the randomization process is fully automated and not susceptible to reverse engineering, someone in the technical department might decide to favor friends by allocating them to the "better" group...

Saturday, April 09, 2011

Visualizing time series: suppressing one pattern to enhance another pattern

Visualizing a time series is an essential step in exploring its behavior. Statisticians think of a time series as a combination of four components: trend, seasonality, level and noise. All real-world series contain a level and noise, but not necessarily a trend and/or seasonality. It is important to determine whether trend and/or seasonality exist in a series in order to choose appropriate models and methods for descriptive or forecasting purposes. Hence, looking at a time plot, typical questions include:

is there a trend? if so, what type of function can approximate it? (linear, exponential, etc.) is the trend fixed throughout the period or does it change over time?
is there seasonal behavior? if so, is seasonality additive or multiplicative? does seasonal behavior change over time?

Exploring such questions using time plots (line plots of the series over time) is enhanced by suppressing one type of pattern for better visualizing other patterns. For example, suppressing seasonality can make a trend more visible. Similarly, suppressing a trend can help see seasonal behavior. How do we suppress seasonality? Suppose that we have monthly data and there is apparent annual seasonality. To suppress seasonality (also called seasonal adjustment), we can

Plot annual data (either annual averages or sums)
Plot a moving average (an average over a window of 12 months centered around each particular month)
Plot 12 separate series, one for each month (e.g., one series for January, another for February and so on)
Fit a model that captures monthly seasonality (e.g., a regression model with 11 monthly dummies) and look at the residual series

An example is shown in the Figure. The top left plot is the original series (showing monthly ridership on Amtrak trains). The bottom left panel shown a moving average line, suppressing seasonality and showing the trend. The top right panel shows a model that captures the seasonality. The lower left panel shows the residuals from the model, again enhancing the trend.

For further details and examples, see my recently published book Practical Time Series Forecasting: A Hands On Guide (available in soft-cover and as an eBook).

Thursday, December 23, 2010

No correlation -> no causation?

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

I found an interesting variation on the "correlation does not imply causation" mantra in the book Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences by Cohen et al. (apparently one of the statistics bibles in behavioral sciences). The quote (p.7) looks like this:

Correlation does not prove causation; however, the absence of correlation implies the absence of the existence of a causal relationship

Let's let the first part rest in peace. At first glance, the second part seems logical: you find no correlation, then how can there be causation? However, after further pondering I reached the conclusion that this logic is flawed, and that one might observe no correlation when in fact there exists underlying causation. The reason is that causality is typically discussed at the conceptual level while correlation is computed at the measurable data level.

Where is Waldo?

Consider an example where causality is hypothesized at an unmeasurable conceptual level, such as "higher creativity leads to more satisfaction in life". Computing the correlation between "creativity" and "satisfaction" requires operationalizing these concepts into measurable variables, that is, identifying measurable variables that adequately represent these underlying concepts. For example, answers to survey questions regarding satisfaction in life might be used to operationalize "satisfaction", while a Rorschach test might be used to measure "creativity". This process of operationalization obviously does not lead to perfect measures, not to mention that data quality can be sufficiently low to produce no correlation even if there exists an underlying causal relationship.

In short, the absence of correlation can also imply that the underlying concepts are hard to measure, are inadequately measured, or that the quality of the measured data is too low (i.e., too noisy) for discovering a causal underlying relationship.

Monday, December 13, 2010

Discovering moderated relationship in the era of large samples

I am currently visiting the Indian School of Business (ISB) and enjoying their excellent library. As in my student days, I roam the bookshelves and discover books on topics that I know little, some, or a lot. Reading and leafing through a variety of books, especially across different disciplines, gives some serious points for thought.

As a statistician I have the urge to see how statistics is taught and used in other disciplines. I discovered an interesting book coming from the psychology literature by Herman Aguinas called Regression Analysis for Categorical Moderators. "Moderators" in statistician language is "interactions". However, when social scientists talk about moderated relationships or moderator variables, there is no symmetry between the two variables that create the interaction. For example if X1=education level, X2=Gender, and Y=Satisfaction at work, then an inclusion of the moderator X1*X2 would follow a direct hypothesis such as "education level affects satisfaction at work differently for women and for men."

Now to the interesting point: Aguinis stresses the scientific importance of discovering moderated relationships and opens the book with the quote:

"If we want to know how well we are doing in the biological, psychological, and social sciences, an index that will serve us well is how far we have advanced in our understanding of the moderator variables of our field." --Hall & Rosenthal, 1991

Discovering moderators is important for understanding the bounds of generalizability as well as for leading to adequate policy recommendations. Yet, it turns out that "Moderator variables are difficult to detect even when the moderator test is the focal issue in a research study and a researcher has designed the study specifically with the moderator test in mind."

One main factor limiting the ability to detect moderated relationships (which tend to have small effects) is statistical power. Aguinas describes simulation studies showing this:

a small effect size was typically undetected when sample size was as large as 120, and ...unless a sample size of at least 120 was used, even ... medium and large moderating effects were, in general, also undetected.

This is bad news. But here is the good news: today, even researchers in the social sciences have access to much larger datasets! Clearly n=120 is in the past. Since this book has come out in 2004, have there been large-sample studies of moderated relationships in the social sciences?

I guess that's where searching electronic journals is the way to go...

Saturday, May 08, 2010

Short data mining videos

I just discovered a short set of videos (currently 35) on different data mining methods on the StatSoft website. This accompanies their neat free online book (I admit, I did end up buying the print copy). The videos show up at the top of various data mining topics in the online book. You can also subscribe to the video series.

Wednesday, September 05, 2007

Shaking up the statistics community

A new book is gaining emotional reactions for the normally calm statistics community (no pun intended): The Black Swan: The Impact of the Highly Improbably by Nassim Taleb uses blunt language to critique the field of statistics, statisticians, and users of statistics. I have not yet read the book, but from the many reviews and coverage I am running to get a copy.

The widely read ASA statistics journal The American Statistician decided to devote a special section that reviews the book and even obtained a (somewhat bland) response from the author. Four reputable statisticians (Robert Lund, Peter Westfall, Joseph Hilbe, and Aaron Brown) reviewed the book, some trying to confront some of the arguments and criticize the author for making some unscientific claims. A few even have formulas and derivations. All four agree that this is an important read for statisticians, and that it raises some interesting points that we should ponder upon.

The author's experiences come from the world of finance, where he worked for investment banks, a hedge fund, and finally made a fortune at his own hedge fund. His main claim (as I understand from the reviews and coverage) is that analytics should focus more on the tails, or the unusual, and not as much on the "average". That's true in many applications (e.g., in my own research in biosurveillance, for early detection of disease outbreak, or in anomaly detection as a whole). Before I make any other claims, though, I must rush to read the book!