Monday, December 10, 2018

Forecasting large collections of time series

With the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting "at scale"!

Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financial forecasting. The types of methods taught in forecasting courses tends to be discipline-specific:

  • Statisticians love ARIMA (auto regressive integrated moving average) models, with multivariate versions such as Vector ARIMA, as well as state space models and non-parametric methods such as STL decompositions.
  • Econometricians and finance academics go one step further into ARIMA variations such as ARFIMA (f=fractional), ARCH (autoregressive conditional heteroskedasticity), GARCH (g=general), NAGARCH (n=nonlinear, a=asymmetric), and plenty more
  • Electrical engineers use spectral analysis (the equivalent of ARIMA in the frequency domain)
  • Machine learning researchers use neural nets and other algorithms
In practice, it is common to see 3 types of methods being used by companies for forecasting future values of a time series : exponential smoothing, linear regression, and sometimes ARIMA. 

Image from https://itnext.io
Why the difference? Because the goal is different! Statistical models such as ARIMA and all its econ flavors are often used for parameter estimation or statistical inference. Those are descriptive goals  (e.g., "is this series a random walk?", "what is the volatility of the errors?"). The spectral approach by electrical engineers is often used for the descriptive goals of characterizing a series' frequencies (signal processing), or for anomaly detection. In contrast, the business applications are strictly predictive: they want forecasts of future values. The simplest methods in terms of ease-of-use, computation, software availability, and understanding, are linear regression models and exponential smoothing. And those methods provide sufficiently accurate forecasts in many applications - hence their popularity! 

ML algorithms are in line with a predictive goal, aimed solely at forecasting. ARIMA and state space models can also be used for forecasting (albeit using a different modeling process than for a descriptive goal). The reason ARIMA is commonly used in practice, in my opinion, is due to the availability of automated functions.

For cases with a small number of time series to forecast (a typical case in many businesses), it is usually worthwhile investing time in properly modeling and evaluating each series individually in order to arrive at the simplest solution that provides the required level of accuracy. Data scientists are sometimes over-eager to improve accuracy beyond what is practically needed, optimizing measures such as RMSE, while the actual impact is measured in a completely different way that depends on how those forecasts are used for decision making. For example, forecasting demand has completely different implications for over- vs. under-forecasting; Users might be more averse to certain directions or magnitudes of error.

But what to do when you must forecast a large collection of time series? Perhaps on a frequent basis? This is "big data" in the world of time series. Amazon predict shipping time for each shipment using different shipping methods to determine the best shipping method (optimized with other shipments taking place at the same/nearby time); Uber forecasts ETA for each trip; Google Trends generates forecasts for any keyword a user types in near-realtime. And... IoT applications call for forecasts for time series from each of their huge number of devices. These applications obviously cannot invest time and effort into building handmade solutions. In such cases, automated forecasting is a practical solution. A good "big data" forecasting solution should
  • be flexible to capture a wide range of time series patterns 
  • be computationally efficient and scalable
  • be adaptable to changes in patterns that occur over time
  • provide sufficient forecasting accuracy
In my course "Business Anlaytics Using Forecasting" at NTHU this year, teams have experienced trying to forecast hundreds of series from a company we're collaborating with. They used various approaches and tools. The excellent forecast package in R by Rob Hyndman's team includes automated functions for ARIMA (auto.arima), exponential smoothing (ets), and a single-layer neural net (nnetar). Facebook's prophet algorithm (and R package) runs a linear regression. Some of these methods are computationally heavier (e.g. ARIMA) so implementation matters. 

While everyone gets excited about complex methods, in time series so far evidence is that "simple is king": naive forecasts are often hard to beat! In the recent M4 forecasting contest (with 100,000 series), what seemed to work well were combinations (ensembles) of standard forecasting methods such as exponential smoothing and ARIMA combined using a machine learning method for the ensemble weights. Machine learning algorithms were far inferior. The secret sauce is ensembles.

Because simple methods often work well, it is well worth identifying which series really do require more than a naive forecast. How about segmenting the time series into groups? Methods that first fit models to each series and then cluster the estimates are one way to go (although can be too time consuming for some applications). The ABC-XYZ approach takes a different approach: it divides a large set of time series into 4 types, based on the difficulty of forecasting (easy/hard) and magnitude of values (high/low) that can be indicative of their importance.

Forecasting is experiencing a new "split personality" phase, of small-scale tailored forecasting applications that integrate domain knowledge vs. large-scale applications that rely on automated "mass-production" forecasting. My prediction is that these two types of problems will continue to survive and thrive, requiring different types of modeling and different skills by the modelers.

For more on forecasting methods, the process of forecasting, and evaluating forecasting solutions see Practical Time Series Forecasting: A Hands-On Guide and the accompanying YouTube videos

Sunday, February 04, 2018

Data Ethics Regulation: Two key updates in 2018

This year, two important new regulations will be impacting research with human subjects: the EU's General Data Protection Regulation (GDPR), which kicks in May 2018, and the USA's updated Common Rule, called the Final Rule, is in effect from Jan 2018. Both changes relate to protecting individuals' private information and will affect researchers using behavioral data in terms of data collection, access, use, applications for ethics committee (IRB) approvals/exemptions, collaborations within the same country/region and beyond, and collaborations with industry.
Both GDPR and the final rule try to modernize what today constitutes "private data" and data subjects' rights and balance it against "free flow of information between EU countries" (GDPR) or . However, the GDPR's approach is much more strongly in favor of protecting private data
Here are a few points to note about GDPR:

  1. "Personal data" (GDPR) or "private information" (final rule) is very broadly defined and includes data on physical, physiological or behavioral characteristics of a person "which allow or confirm the unique identification of that natural person".
  2. The GDPR affects any organization within the EU as well as "external organizations that are trading within the EU". It applies to personal data on any person, not just EU citizens/residents.
  3. The GDPR distinguishes between "data controller" (the entity who has the data, in the eyes of the data subjects, e.g. a hospital) and "data processor" (the entity who operates on the data). Both entities are bound and liable by GDPR.
  4. GDPR distinguishes between "data processing" (any operation related to the data including storage, structuring, record deletion, transfer) and "profiling" (automated processing of personal data to "evaluate personal aspects relating to a natural person". 
  5. The Final Rule now offers an option of relying on broad consent obtained for future research as an alternative to seeking IRB approval to waive the consent requirement.
  6. Domestic collaborations within the US now require a single institutional review board (IRB) approval (for the portion of the research that takes place within the US) - effective 2021.
The Final Rule tries to lower burden for low-risk research. One attempt is new "exemption" categories for secondary research use of identifiable private information (i.e. re-using
identifiable information collected for some other ‘‘primary’’ or ‘‘initial’’ activity) when: 
  • The identifiable private information is publicly available;
  • The information is recorded by the investigator in such a way that the identity of subjects cannot readily be ascertained, and the investigator does not contact subjects or try to re-identify subjects; 
  • The secondary research activity is regulated under HIPAA; or
  • The secondary research activity is conducted by or on behalf of a federal entity and involves the use of federally generated non-research information provided that the original collection was subject to specific federal privacy protections and continues to be protected.
This approach to secondary data, and specifically to observational data from public sources, seems in contrast to the GDPR approach that states that the new regulations also apply when processing historical data for "historical research purposes". Metcalf (2018) criticized the above Final Rule exemption because "these criteria for exclusion focus on the status of the dataset (e.g., is it public? does it already exist?), not the content of the dataset nor what will be done with the dataset, which are more accurate criteria for determining the risk profile of the proposed research".