Thursday, December 03, 2020

Machine learning algorithms surprises at deployment? (article on Medium)

Machine learning (ML) algorithms are being used to generate predictions in every corner of our decision-making life. Methods range from “simple” algorithms such as trees, forests, naive Bayes, linear and logistic regression models, and nearest-neighbor methods, through improvements such as boosting, bagging, regularization, and ensembling, to computationally-intensive, blackbox deep learning algorithms.

The new fashion of “apply deep learning to everything” has resulted in breakthroughs as well as in alarming disasters. Is this due to the volatility of deep learning algorithms? I argue this is because of the growing divorce between predictive algorithm developers, their deployment context, and their end users’ actions.

The full article is posted on Medium

Monday, December 10, 2018

Forecasting large collections of time series

With the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting "at scale"!

Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financial forecasting. The types of methods taught in forecasting courses tends to be discipline-specific:

  • Statisticians love ARIMA (auto regressive integrated moving average) models, with multivariate versions such as Vector ARIMA, as well as state space models and non-parametric methods such as STL decompositions.
  • Econometricians and finance academics go one step further into ARIMA variations such as ARFIMA (f=fractional), ARCH (autoregressive conditional heteroskedasticity), GARCH (g=general), NAGARCH (n=nonlinear, a=asymmetric), and plenty more
  • Electrical engineers use spectral analysis (the equivalent of ARIMA in the frequency domain)
  • Machine learning researchers use neural nets and other algorithms
In practice, it is common to see 3 types of methods being used by companies for forecasting future values of a time series : exponential smoothing, linear regression, and sometimes ARIMA. 

Image from
Why the difference? Because the goal is different! Statistical models such as ARIMA and all its econ flavors are often used for parameter estimation or statistical inference. Those are descriptive goals  (e.g., "is this series a random walk?", "what is the volatility of the errors?"). The spectral approach by electrical engineers is often used for the descriptive goals of characterizing a series' frequencies (signal processing), or for anomaly detection. In contrast, the business applications are strictly predictive: they want forecasts of future values. The simplest methods in terms of ease-of-use, computation, software availability, and understanding, are linear regression models and exponential smoothing. And those methods provide sufficiently accurate forecasts in many applications - hence their popularity! 

ML algorithms are in line with a predictive goal, aimed solely at forecasting. ARIMA and state space models can also be used for forecasting (albeit using a different modeling process than for a descriptive goal). The reason ARIMA is commonly used in practice, in my opinion, is due to the availability of automated functions.

For cases with a small number of time series to forecast (a typical case in many businesses), it is usually worthwhile investing time in properly modeling and evaluating each series individually in order to arrive at the simplest solution that provides the required level of accuracy. Data scientists are sometimes over-eager to improve accuracy beyond what is practically needed, optimizing measures such as RMSE, while the actual impact is measured in a completely different way that depends on how those forecasts are used for decision making. For example, forecasting demand has completely different implications for over- vs. under-forecasting; Users might be more averse to certain directions or magnitudes of error.

But what to do when you must forecast a large collection of time series? Perhaps on a frequent basis? This is "big data" in the world of time series. Amazon predict shipping time for each shipment using different shipping methods to determine the best shipping method (optimized with other shipments taking place at the same/nearby time); Uber forecasts ETA for each trip; Google Trends generates forecasts for any keyword a user types in near-realtime. And... IoT applications call for forecasts for time series from each of their huge number of devices. These applications obviously cannot invest time and effort into building handmade solutions. In such cases, automated forecasting is a practical solution. A good "big data" forecasting solution should
  • be flexible to capture a wide range of time series patterns 
  • be computationally efficient and scalable
  • be adaptable to changes in patterns that occur over time
  • provide sufficient forecasting accuracy
In my course "Business Anlaytics Using Forecasting" at NTHU this year, teams have experienced trying to forecast hundreds of series from a company we're collaborating with. They used various approaches and tools. The excellent forecast package in R by Rob Hyndman's team includes automated functions for ARIMA (auto.arima), exponential smoothing (ets), and a single-layer neural net (nnetar). Facebook's prophet algorithm (and R package) runs a linear regression. Some of these methods are computationally heavier (e.g. ARIMA) so implementation matters. 

While everyone gets excited about complex methods, in time series so far evidence is that "simple is king": naive forecasts are often hard to beat! In the recent M4 forecasting contest (with 100,000 series), what seemed to work well were combinations (ensembles) of standard forecasting methods such as exponential smoothing and ARIMA combined using a machine learning method for the ensemble weights. Machine learning algorithms were far inferior. The secret sauce is ensembles.

Because simple methods often work well, it is well worth identifying which series really do require more than a naive forecast. How about segmenting the time series into groups? Methods that first fit models to each series and then cluster the estimates are one way to go (although can be too time consuming for some applications). The ABC-XYZ approach takes a different approach: it divides a large set of time series into 4 types, based on the difficulty of forecasting (easy/hard) and magnitude of values (high/low) that can be indicative of their importance.

Forecasting is experiencing a new "split personality" phase, of small-scale tailored forecasting applications that integrate domain knowledge vs. large-scale applications that rely on automated "mass-production" forecasting. My prediction is that these two types of problems will continue to survive and thrive, requiring different types of modeling and different skills by the modelers.

For more on forecasting methods, the process of forecasting, and evaluating forecasting solutions see Practical Time Series Forecasting: A Hands-On Guide and the accompanying YouTube videos

Sunday, February 04, 2018

Data Ethics Regulation: Two key updates in 2018

This year, two important new regulations will be impacting research with human subjects: the EU's General Data Protection Regulation (GDPR), which kicks in May 2018, and the USA's updated Common Rule, called the Final Rule, is in effect from Jan 2018. Both changes relate to protecting individuals' private information and will affect researchers using behavioral data in terms of data collection, access, use, applications for ethics committee (IRB) approvals/exemptions, collaborations within the same country/region and beyond, and collaborations with industry.
Both GDPR and the final rule try to modernize what today constitutes "private data" and data subjects' rights and balance it against "free flow of information between EU countries" (GDPR) or . However, the GDPR's approach is much more strongly in favor of protecting private data
Here are a few points to note about GDPR:

  1. "Personal data" (GDPR) or "private information" (final rule) is very broadly defined and includes data on physical, physiological or behavioral characteristics of a person "which allow or confirm the unique identification of that natural person".
  2. The GDPR affects any organization within the EU as well as "external organizations that are trading within the EU". It applies to personal data on any person, not just EU citizens/residents.
  3. The GDPR distinguishes between "data controller" (the entity who has the data, in the eyes of the data subjects, e.g. a hospital) and "data processor" (the entity who operates on the data). Both entities are bound and liable by GDPR.
  4. GDPR distinguishes between "data processing" (any operation related to the data including storage, structuring, record deletion, transfer) and "profiling" (automated processing of personal data to "evaluate personal aspects relating to a natural person". 
  5. The Final Rule now offers an option of relying on broad consent obtained for future research as an alternative to seeking IRB approval to waive the consent requirement.
  6. Domestic collaborations within the US now require a single institutional review board (IRB) approval (for the portion of the research that takes place within the US) - effective 2021.
The Final Rule tries to lower burden for low-risk research. One attempt is new "exemption" categories for secondary research use of identifiable private information (i.e. re-using
identifiable information collected for some other ‘‘primary’’ or ‘‘initial’’ activity) when: 
  • The identifiable private information is publicly available;
  • The information is recorded by the investigator in such a way that the identity of subjects cannot readily be ascertained, and the investigator does not contact subjects or try to re-identify subjects; 
  • The secondary research activity is regulated under HIPAA; or
  • The secondary research activity is conducted by or on behalf of a federal entity and involves the use of federally generated non-research information provided that the original collection was subject to specific federal privacy protections and continues to be protected.
This approach to secondary data, and specifically to observational data from public sources, seems in contrast to the GDPR approach that states that the new regulations also apply when processing historical data for "historical research purposes". Metcalf (2018) criticized the above Final Rule exemption because "these criteria for exclusion focus on the status of the dataset (e.g., is it public? does it already exist?), not the content of the dataset nor what will be done with the dataset, which are more accurate criteria for determining the risk profile of the proposed research".

Monday, December 25, 2017

Election polls: description vs. prediction

My papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven't expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and predictive modeling both share the reliance on correlations, whereas explanatory modeling relies on causality, the former two are in fact different. Descriptive modeling aims to give a parsimonious statistical representation of a distribution or relationship, whereas predictive modeling aims at generating values for new/future observations.

The recent paper Election Polls—A Survey, A Critique, and Proposals by Kenett, Pfeffermann & Steinberg gives a fantastic illustration of the difference between description and prediction: they explain the different goals of election surveys (such as those conducted by Gallup) as compared to survey-based predictive models such as those by Nate Silver's FiveThirtyEight:
"There is a subtle, but important, difference between reflecting current public sentiment and predicting the results of an election. Surveys [election polls] have focused largely on the former—in other words, on providing a current snapshot of voting preferences, even when asking about voting preference as if elections were carried out on the day of the survey. In that regard, high information quality (InfoQ) surveys are accurately describing current opinions of the electorate. However, the public perception is often focused on projecting the survey results forward in time to election day, which is eventually used to evaluate the performance of election surveys. Moreover, the public often focuses solely on whether the polls got the winner right and not on whether the predicted vote shares were close to the true results."
In other words, whereas the goal of election surveys is to capture the public perception at different points in time prior to the election, they are often judged by the public as failed because of low predictive power on elections day. The authors continue to say:
"Providing an accurate current picture and predicting the ultimate winner are not contradictory goals. As the election approaches, survey results are expected to increasingly point toward the eventual election outcome, and it is natural that the success or failure of the survey methodology and execution is judged by comparing the final polls and trends with the actual election results."
Descriptive models differ from predictive models in another sense that can lead to vastly different results: in a descriptive model for an event of interest we can use the past and the future relative to that event time. For example, to describe spikes in pre-Xmas shopping volume we can use data on pre- and post-Xmas days. In contrast, to predict pre-Xmas shopping volume we can only use information available prior to the pre-Xmas shopping period of interest.

As Kenett et al. (2017) write, description and prediction are not contradictory. They are different, yet the results of descriptive models can provide leads for strong predictors, and potentially for explanatory variables (which require further investigation using explanatory modeling).

The new interesting book Everybody Lies by Stephens-Davidowitz is a great example of descriptive modeling that uncovers correlations that might be used for prediction (or even for future explanatory work). The author uncovers behavioral search patterns on Google by examining keywords search volumes using Google Trends and AdWords. For the recent US elections, the author identifies a specific keyword search term that separates between areas of high performance for Clinton vs. Trump:
"Silver noticed that the areas where Trump performed best made for an odd map. Trump performed well in parts of the Northeast and industrial Midwest, as well as the South. He performed notably worse out West. Silver looked for variables to try to explain this map... Silver found that the single factor that best correlated with Donald Trump's support in the Republican primaries was... [areas] that made the most Google searches for ______."
[I am intentionally leaving the actual keyword blank because it is offensive.]
While finding correlations is a dangerous game that can lead to many false discoveries (two measurements can be correlated because they are both affected by something else, such as weather), careful descriptive modeling tightly coupled with domain expertise can be useful for exploratory research, which later should be tested using explanatory modeling.

While I love Stephens-Davidowitz' idea of using Google Trends to uncover behaviors/thoughts/feelings that are hidden otherwise (because what people say in surveys often diverges from what they really do/think/feel), a main question is who do these search results represent? (sampling bias). But that's a different topic altogether.

Monday, November 06, 2017

Statistical test for "no difference"

To most researchers and practitioners using statistical inference, the popular hypothesis testing universe consists of two hypotheses:
H0 is the null hypothesis of "zero effect"
H1 is the alternative hypothesis of "a non-zero effect"

The alternative hypothesis (H1) is typically what the researcher is trying to find: a different outcome for a treatment and control group in an experiment, a regression coefficient that is non-zero, etc. Recently, several independent colleagues have asked me if there's a statistical way to show that an effect is zero, or, that there's no difference between groups. Can we simply use the above setup? The answer is no. Can we simply reverse the hypotheses? Uh-uh, because the "equal" must be in H0.

Minitab has equivalence testing (from
Here's why: In the classic setup the hypotheses are stated about the population of interest, and we take a sample from that population to test them. In this non-symmetrical setup, H0 is assumed to be true unless otherwise proven by the result in the sample. Hypothesis testing has its roots in Karl Popper's falsifiability principle, where any claim about the existence of an effect cannot be made unless it is first shown that a situation of no effect is untenable. This is similar to a democratic justice system, where the defendant is assumed not-guilty unless proven so. The burden of proof lies on the researcher/data. That's why we reject H0 with sufficient evidence or not reject H0, if we don't have sufficient evidence against it. This setup is not designed to arrive at a conclusion that H0 is true.

In a 2013 Letter to the Editor of Journal of Sports Sciences, titled Testing the null hypothesis: the forgotten legacy of Karl Popper? Mick Wilkinson suggests that this setup is the opposite of what a researcher should be doing according to the scientific method, and in fact "Our work should remain driven by conjecture and attempted falsification such that it is always the null hypothesis that is tested. The write up of our studies should make it clear that we are indeed testing the null hypothesis and conforming to the established and accepted philosophical conventions of the scientific method." He therefore suggests the following sequence:

  1. null hypothesis tests are carried out to first establish that a population effect is in fact unlikely to be zero
  2. a confidence-interval based approach estimates what the magnitude of effect might plausibly be
  3. a probability associated with the likelihood of the population effect exceeding an apriori smallest meaningful effect is calculated

While this provides a relevant criticism of the hypothesis testing paradigm, it does not directly provide a test of equivalence! The good news is that equivalence testing is a popular such scenario in pharmacokinetics, arising, for example, when a pharmaceutical wants to show that their developed generic drug is equivalent to a brand drug. This is termed bioequivalence. In other words, H1 is "the drugs are equivalent". The approach used there is the following:

  1. set up an equivalence bound that determines the smallest clinically-meaningful effect size of interest
  2. calculate a confidence interval around the observed effect size (say, difference between the mean outcomes of the generic and brand drugs)
  3. if the confidence interval includes the equivalence bound, then the groups are equivalent; otherwise they are not equivalent
The Wikipedia article on Equivalence Test points out two additional interesting uses of equivalence testing:
  • Avoiding misinterpretation of large p-values in ordinary testing as evidence for H0: "Equivalence tests can be performed in addition to null-hypothesis significance tests. This might prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect."
  • The confidence interval used in equivalence testing can help distinguish between statistical significance and practical/clinical significance: if it includes/excludes the value 0, that indicates statistical insignificance/significance between the groups (or an effect), while if it includes/excludes the equivalence bound that indicates practical insignificance/significance. The four options are shown in the figure.
Statistical vs. practical significance (from
How will sample size affect equivalence testing? We know that in ordinary hypothesis testing a sufficiently large sample will lead to detecting practically insignificant effects by generating a very small p-value, which is bad news for those relying on classic hypothesis testing! My colleague Foster Provost from NYU once challenged me how I could trust a statistical method that breaks down with large samples - a poignant thought that eventually led to my co-authored paper Too Big To Fail: Large Samples and the p-value Problem  (Lin et al., ISR 2013). What about equivalence tests? In equivalence testing, a very large sample will behave properly: with more data we'll get narrower confidence intervals (more certainty). A practically insignificant difference will therefore generate a narrow confidence interval that excludes the equivalence boundary (=equivalence). In contrast, a practically significant difference will generate a narrow confidence interval that completely exceeds the equivalence boundary (non-equivalence).

Tuesday, September 05, 2017

My videos for “Business Analytics using Data Mining” now publicly available!

Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections of BADM (which was started by Ravi Bapna in 2006), and the course was fast growing in popularity. Repeating the same lectures in multiple course sections made me realize it was time for scale! I therefore created 30+ videos, covering various supervised methods (k-NN, linear and logistic regression, trees, naive Bayes, etc.) and unsupervised methods (principal components analysis, clustering, association rules), as well as important principles such as performance evaluation, the notion of a holdout set, and more.

I created the videos to support teaching with our textbook “Data Mining for Business Analytics” (the 3rd edition and a SAS JMP edition came out last year; R edition coming out this month!). The videos highlight the key points in different chapters, (hopefully) motivating the watcher to read more in the textbook, which also offers more examples. The videos’ order follows my course teaching, but the topics are mostly independent.

The videos were a big hit in the ISB courses. Since moving to Taiwan, I've created and offered a similar flipped BADM course at National Tsing Hua University, and the videos are also part of the Predictive Analytics series. I’ve since added a few more topics (e.g., neural nets and discriminant analysis).

The audience for the videos (and my courses and textbooks) is non-technical folks who need to understand the logic and uses of data mining, at the managerial level. The videos are therefore about problem solving, and hence the "Business Analytics" in the title. They are different from the many excellent machine learning videos and MOOCs in focus and in technical level -- a basic statistics course that covers linear regression and some business experience should be sufficient for understanding the videos.
For 5 years, and until last week, the videos were only available to past and current students. However, the word spread and many colleagues, instructors, and students have asked me for access. After 5 years, and in celebration of the first R edition of our textbook Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, I decided to make it happen. All 30+ videos are now publicly available on my BADM YouTube playlist.

Currently the videos cater only to those who understand English. I opened the option for community-contributed captions, in the hope that folks will contribute captions in different languages to help make the knowledge propagate further.

This new playlist complements a similar set of videos, on "Business Analytics Using Forecasting" (for time series), that I created at NTHU and made public last year, as part of a MOOC offered on FutureLearn with the next round opening in October.

Finally, I’ll share that I shot these videos while I was living in Bhutan. They are all homemade -- I tried to filter out barking noises and to time the recording when ceremonies were not held close to our home. If you’re interested in how I made the materials and what lessons I learned for flipping my first course, check out my 2012 post.

Tuesday, March 14, 2017

Data mining algorithms: how many dummies?

There's lots of posts on "k-NN for Dummies". This one is about "Dummies for k-NN"

Categorical predictor variables are very common. Those who've taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps:

  1. Convert the categorical variable that has m categories, into m binary dummy variables
  2. Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category)
For example, if we have X={red, yellow, green}, in step 1 we create three dummies:
D_red = 1 if the value is 'red' and 0 otherwise
D_yellow = 1 if the value is 'yellow' and 0 otherwise
D_green = 1 if the value is 'green' and 0 otherwise

In the regression model we might have: Y = b0 + b1 D_red + b2 D_yellow + error
[Note: mathematically, it does not matter which dummy you drop out: the regression coefficients b1, b2
now compare against the left-out category].

When you move to data mining algorithms such as k-NN or trees, the procedure is different: we include all m dummies as predictors when m>2, but in the case m=2, we use a single dummy. Dropping a dummy (when m>2) will distort the distance measure, leading to incorrect distances.
Here's an example, based on X = {red, yellow, green}:

Case 1: m=3 (use 3 dummies)

Here are 3 records, their category (color), and their dummy values on (D_red, D_yellow, D_green):

The distance between each pair of records (in terms of color) should be identical, since all three records are different from each other. Suppose we use Euclidean distance. The distance between each pair of records will be equal to 2. For example:

Distance(#1, #2) = (1-0)^2 + (0-1)^2 + (0-0)^2 = 2.

If we drop one dummy, then the three distances will no longer be identical! For example, if we drop D_green:
Distance(#1, #2) = 1 + 1 = 2
Distance(#1, #3) = 1
Distance(#2, #3) = 1

Case 2: m=2 (use single dummy)

The above problem doesn't happen with m=2. Suppose we have only {red, green}, and use a single dummy. The distance between a pair of records will be 0 if the records are the same color, or 1 if they are different.
Why not use 2 dummies? If we use two dummies, we are doubling the weight of this variable but not adding any information. For example, comparing the red and green records using D_red and D_green would give Distance(#1, #3) = 1 + 1 = 2.

So we end up with distances of 0 or 2 instead of weights of 0 or 1.

Bottom line 

In data mining methods other than regression models (e.g., k-NN, trees, k-means clustering), we use m dummies for a categorical variable with m categories - this is called one-hot encoding. But if m=2 we use a single dummy.