BzST | Business Analytics, Statistics, Teaching: time series

Showing posts with label time series. Show all posts

Monday, December 10, 2018

Forecasting large collections of time series

With the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting "at scale"!

Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financial forecasting. The types of methods taught in forecasting courses tends to be discipline-specific:

Statisticians love ARIMA (auto regressive integrated moving average) models, with multivariate versions such as Vector ARIMA, as well as state space models and non-parametric methods such as STL decompositions.
Econometricians and finance academics go one step further into ARIMA variations such as ARFIMA (f=fractional), ARCH (autoregressive conditional heteroskedasticity), GARCH (g=general), NAGARCH (n=nonlinear, a=asymmetric), and plenty more
Electrical engineers use spectral analysis (the equivalent of ARIMA in the frequency domain)
Machine learning researchers use neural nets and other algorithms

In practice, it is common to see 3 types of methods being used by companies for forecasting future values of a time series : exponential smoothing, linear regression, and sometimes ARIMA.

Image from https://itnext.io

Why the difference? Because the goal is different! Statistical models such as ARIMA and all its econ flavors are often used for parameter estimation or statistical inference. Those are descriptive goals (e.g., "is this series a random walk?", "what is the volatility of the errors?"). The spectral approach by electrical engineers is often used for the descriptive goals of characterizing a series' frequencies (signal processing), or for anomaly detection. In contrast, the business applications are strictly predictive: they want forecasts of future values. The simplest methods in terms of ease-of-use, computation, software availability, and understanding, are linear regression models and exponential smoothing. And those methods provide sufficiently accurate forecasts in many applications - hence their popularity!

ML algorithms are in line with a predictive goal, aimed solely at forecasting. ARIMA and state space models can also be used for forecasting (albeit using a different modeling process than for a descriptive goal). The reason ARIMA is commonly used in practice, in my opinion, is due to the availability of automated functions.

For cases with a small number of time series to forecast (a typical case in many businesses), it is usually worthwhile investing time in properly modeling and evaluating each series individually in order to arrive at the simplest solution that provides the required level of accuracy. Data scientists are sometimes over-eager to improve accuracy beyond what is practically needed, optimizing measures such as RMSE, while the actual impact is measured in a completely different way that depends on how those forecasts are used for decision making. For example, forecasting demand has completely different implications for over- vs. under-forecasting; Users might be more averse to certain directions or magnitudes of error.

But what to do when you must forecast a large collection of time series? Perhaps on a frequent basis? This is "big data" in the world of time series. Amazon predict shipping time for each shipment using different shipping methods to determine the best shipping method (optimized with other shipments taking place at the same/nearby time); Uber forecasts ETA for each trip; Google Trends generates forecasts for any keyword a user types in near-realtime. And... IoT applications call for forecasts for time series from each of their huge number of devices. These applications obviously cannot invest time and effort into building handmade solutions. In such cases, automated forecasting is a practical solution. A good "big data" forecasting solution should

be flexible to capture a wide range of time series patterns
be computationally efficient and scalable
be adaptable to changes in patterns that occur over time
provide sufficient forecasting accuracy

In my course "Business Anlaytics Using Forecasting" at NTHU this year, teams have experienced trying to forecast hundreds of series from a company we're collaborating with. They used various approaches and tools. The excellent forecast package in R by Rob Hyndman's team includes automated functions for ARIMA (auto.arima), exponential smoothing (ets), and a single-layer neural net (nnetar). Facebook's prophet algorithm (and R package) runs a linear regression. Some of these methods are computationally heavier (e.g. ARIMA) so implementation matters.

While everyone gets excited about complex methods, in time series so far evidence is that "simple is king": naive forecasts are often hard to beat! In the recent M4 forecasting contest (with 100,000 series), what seemed to work well were combinations (ensembles) of standard forecasting methods such as exponential smoothing and ARIMA combined using a machine learning method for the ensemble weights. Machine learning algorithms were far inferior. The secret sauce is ensembles.

Because simple methods often work well, it is well worth identifying which series really do require more than a naive forecast. How about segmenting the time series into groups? Methods that first fit models to each series and then cluster the estimates are one way to go (although can be too time consuming for some applications). The ABC-XYZ approach takes a different approach: it divides a large set of time series into 4 types, based on the difficulty of forecasting (easy/hard) and magnitude of values (high/low) that can be indicative of their importance.

Forecasting is experiencing a new "split personality" phase, of small-scale tailored forecasting applications that integrate domain knowledge vs. large-scale applications that rely on automated "mass-production" forecasting. My prediction is that these two types of problems will continue to survive and thrive, requiring different types of modeling and different skills by the modelers.

For more on forecasting methods, the process of forecasting, and evaluating forecasting solutions see Practical Time Series Forecasting: A Hands-On Guide and the accompanying YouTube videos.

Wednesday, March 07, 2012

Forecasting + Analytics = ?

Quantitative forecasting is an age-old discipline, highly useful across different functions of an organization: from forecasting sales and workforce demand to economic forecasting and inventory planning.

Business schools have offered courses with titles such as "Time Series Forecasting", "Forecasting Time Series Data", "Business Forecasting", more specialized courses such as "Demand Planning and Sales Forecasting" or even graduate programs with title "Business and Economic Forecasting". Simple "Forecasting" is also popular. Such courses are offered at the undergraduate, graduate and even executive education. All these might convey the importance and usefulness of forecasting, but they are far from conveying the coolness of forecasting.

I've been struggling to find a better term for the courses that I teach on-ground and online, as well as for my recent book (with the boring name Practical Time Series Forecasting). The name needed to convey that we're talking about forecasting, particularly about quantitative data-driven forecasting, plus the coolness factor. Today I discovered it! Prof Refik Soyer from GWU's School of Business will be offering a course called "Forecasting for Analytics". A quick Google search did not find any results with this particular phrase -- so the credit goes directly to Refik. I also like "Forecasting Analytics", which links it to its close cousins "Predictive Analytics" and "Visual Analytics", all members of the Business Analytics family.

Saturday, April 09, 2011

Visualizing time series: suppressing one pattern to enhance another pattern

Visualizing a time series is an essential step in exploring its behavior. Statisticians think of a time series as a combination of four components: trend, seasonality, level and noise. All real-world series contain a level and noise, but not necessarily a trend and/or seasonality. It is important to determine whether trend and/or seasonality exist in a series in order to choose appropriate models and methods for descriptive or forecasting purposes. Hence, looking at a time plot, typical questions include:

is there a trend? if so, what type of function can approximate it? (linear, exponential, etc.) is the trend fixed throughout the period or does it change over time?
is there seasonal behavior? if so, is seasonality additive or multiplicative? does seasonal behavior change over time?

Exploring such questions using time plots (line plots of the series over time) is enhanced by suppressing one type of pattern for better visualizing other patterns. For example, suppressing seasonality can make a trend more visible. Similarly, suppressing a trend can help see seasonal behavior. How do we suppress seasonality? Suppose that we have monthly data and there is apparent annual seasonality. To suppress seasonality (also called seasonal adjustment), we can

Plot annual data (either annual averages or sums)
Plot a moving average (an average over a window of 12 months centered around each particular month)
Plot 12 separate series, one for each month (e.g., one series for January, another for February and so on)
Fit a model that captures monthly seasonality (e.g., a regression model with 11 monthly dummies) and look at the residual series

An example is shown in the Figure. The top left plot is the original series (showing monthly ridership on Amtrak trains). The bottom left panel shown a moving average line, suppressing seasonality and showing the trend. The top right panel shows a model that captures the seasonality. The lower left panel shows the residuals from the model, again enhancing the trend.

For further details and examples, see my recently published book Practical Time Series Forecasting: A Hands On Guide (available in soft-cover and as an eBook).

Saturday, September 04, 2010

Forecasting stock prices? The new INFORMS competition

Image from www.lumaxart.com

The 2010 INFORMS Data Mining Contest is underway. This time the goal is to predict 5-minute stock prices. That's right - forecasting stock prices! In my view, the meta-contest is going to be the most interesting part. By meta-contest I mean looking beyond the winning result (what method, what prediction accuracy) and examining the distribution of prediction accuracies across all the contestants, how the winner is chosen, and most importantly, how the winning result will be interpreted in terms of concluding about the predictability level of stocks.

Why is a stock prediction competition interesting? Because according to the Efficient Market Hypothesis (EMH), stocks and other traded assets are random walks (no autocorrelation between consecutive price jumps). In other words, they are unpredictable. Even if there is a low level of autocorrelation, then the bid-offer spread and transaction costs make stock predictions worthless. I've been fascinated with how quickly and drastically the Wikipedia page on the Efficient Market Hypothesis has changed in the last years (see the page history). The proponents of the EMH seem to be competing with its opponents in revising the page. As of today, the opponents are ahead in terms of editing the page -- perhaps the recent crisis is giving them an advantage.

The contest's evaluation page explains that the goal is to forecast whether the stock price will increase or decrease in the next time period. Then, entries will be evaluated in terms of the average AUC (area under the ROC curve). Defining the problem as a binary prediction problem and using the AUC to evaluate the results adds an additional challenge: the average AUC has various flaws in terms of measuring predictive accuracy. In a recent article in the journal Machine Learning, the well-known statistician Prof David Hand shows that in addition to other deficiencies "...the AUC uses different misclassification cost distributions for different classifiers."

In any case, among the many participants in the competition there is going to be a winner. And that winner will have the highest prediction accuracy for that stock, at least in the sense of average AUC. No uncertainty about that. But will that mean that the winning method is the magic bullet for traders? Most likely not. Or, at least, I would not be convinced until I saw the method consistently outperform a random walk across a large number of stocks and different time periods. For one, I would want to see the distribution of results of the entire set of participants and compare it to a naive classifier to evaluate how "lucky" the winner was.

The competition page reads: The results of this contest could have a big impact on the finance industry. I find that quite scary, given the limited scope of the data, the evaluation metric, and the focus on the top results rather than the entire distribution.

Thursday, February 28, 2008

Forecasting with econometric models

Here's another interesting example where explanatory and predictive tasks create different models: econometric models. These are essentially regression models of the form:

Y(t) = beta0 + beta1 Y(t-1) + beta2 X(t) + beta3 X(t-1) + beta4 Z(t-1) + noise

An example would be forecasting Y(t)= consumer spending at time t, where the input variables can be consumer spending in previous time periods and/or other information that is available at time t or earlier.

In economics, when Y(t) is the state of the economy at time t, there is a distinction between three types of variables (aka "indicators"): Leading, coincident, and lagging variables. Leading indicators are those that change before the economy changes (e.g. the stock market); coincident indicators change during the period when the economy changes (e.g., GDP), and lagging indicators change after the economy changes (e.g., unemployment). -- see about.com.

This distinction is especially revealing when we consider the difference between building an econometric model for the purpose of explaining vs. forecasting. For explaining, you can have both leading and coincident variables as inputs. However, if the purpose is forecasting, the inclusion of coincident variables requires one to forecast them before they can be used to forecast Y(t). An alternative is to lag those variables and include them only in leading-indicator format.

I found a neat example of a leading indicator on thefreedictionary.com: The "Leading Lipstick Indicator"

is based on the theory that a consumer turns to less-expensive indulgences, such as lipstick, when she (or he) feels less than confident about the future. Therefore, lipstick sales tend to increase during times of economic uncertainty or a recession. This term was coined by Leonard Lauder (chairman of Estee Lauder), who consistently found that during tough economic times, his lipstick sales went up. Believe it or not, the indicator has been quite a reliable signal of consumer attitudes over the years. For example, in the months following the Sept 11 terrorist attacks, lipstick sales doubled

Tuesday, February 26, 2008

Data mining competition season

Those who've been following my postings probably recall "competition season" when all of a sudden there are multiple new interesting datasets out there, each framing a business problem that requires the combination of data mining and creativity.

Two such competitions are the SAS Data Mining Shootout and the 2008 Neural Forecasting Competition. The SAS problem concerns revenue management for an airline who wants to improve their customer satisfaction. The NN5 competition is about forecasting cash withdrawals from ATMs.

Here are the similarities between the two competitions: they both provide real data and reasonably real business problems. Now to a more interesting similarity: they both have time series forecasting tasks. From a recent survey on the popularity of types of data mining techniques, it appears that time series are becoming more and more prominent. They also both require registration in order to get access to the data (I didn't compare their terms of use, but that's another interesting comparison), and welcome any type of modeling. Finally, they are both tied to a conference, where competitors can present their results and methods.

What would be really nice is if, like in KDD, the winners' papers would be published online and made publicly available.

Wednesday, January 16, 2008

Cycle plots for time series

In his most recent newsletter, Stephen Few from PerceptualEdge presents a short and interesting article on Cycle Plots (by Naomi Robbins). These are plots for visualizing time series, which enhance both cyclical and trend components of the series. Cycle plots were invented by Cleveland, Dunn, and Terpenning in 1978, and seem quite useful. I have not seen them integrated into any visualization tool, although they definitely are useful and easy to interpret. The closest implementation that I've seen (aside from creating them yourself or using one of the macros suggested in the article) is Spotfire DXP's hierarchies. A hierarchy enables one to define some time scales embedded within other time scales, such as "day within week within month within year". One can then plot the time series at any level of hierarchy, thereby supporting the visualization of trends and cycles at different time scales.

Wednesday, July 11, 2007

The Riverplot: Visualizing distributions over time

The boxplot is one of the neatest visualizations for examining the distribution of values, or for comparing distribtions. It is more compact than a histogram in that it only presents the median, the two quartiles, the range of the data, and outliers. It also requires less user input than a histogram (where the user usually has to determine the number of bins). I view the boxplot and histogram as complements, and examining both is good practice.

But how can you visualize a distribution of values over time? Well, a series of boxplots often does the trick. But if the frequency is very high (e.g., ticker data) and the time scale of interest can be considered continuous, then an alternative is the River Plot. This is a visualization that we developed together with our colleagues at the Human Computer Interaction Lab on campus. It is essentiall a "continuous boxplot" that displays the median and quartiles (and potentially the range or other statistics). It is suitable when you have multiple time series that can be considered replicates (e.g., bid in multiple eBay auctions for an iPhone). We implemented it in the interactive time series visualization tool called Time Searcher, which allows to visualize and interactively explore a large set of time series with attributes.

Time Searcher is a powerful tool and allows the user to search for patterns, filter, and also to forecast an ongoing time series from its past and a historic database of similar time series. But then the Starbucks effect of too many choices kicks in. Together with our colleague Paolo Buono from Universita de Bari, Italy, we added the feature of "simultaneous previews": the user can choose multiple different parameter setting and view the resulting forecasts simultaneously. This was presented in the most recent InfoVis conference (Similarity-Based Forecasting with Simultaneous Previews: A River Plot Interface for Time Series Forecasting).

Tuesday, March 06, 2007

Accuracy measures

There is a host of metrics for evaluating predictive performance. They are all based on aggregating the forecast errors in some form. The two most famous metrics are RMSE (Root-mean-squared-error) and MAPE (Mean-Absolute-Percentage-Error). In an earlier posting (Feb-23-2006) I disclosed a secret deciphering method for computing these metrics.

Although these two have been the most popular in software, competitions, and published papers, they have their shortages. One serious flaw of the MAPE is that zero counts contribute to the MAPE the value of infinity (because of the division by zero). One solution is to leave the zero counts out of the computation, but then these counts and their predictive error must be reported separately.

I found a very good survey paper of various metrics, which lists the many different metrics and their advantages and weaknesses. The paper, Another look at measures of forecast accuracy,(International Journal of Forecasting, 2006), by Hindman and Koehler, concludes that the best metric to use is the Mean Absolute Scaled Error, which has the mean acronym MASE.

Thursday, March 01, 2007

Lots of real time series data!

I love data-mining or statistics competitions - they always provide great real data! However, the big difference between a gold mine and "just some data" is whether the data description and their context is complete. This reflects, in my opinion, the difference between "data mining for the purpose of data mining" vs. "data mining for business analytics" (or any other field of interest, such as engineering or biology).

Last year, the BICUP2006 posted an interesting dataset on bus ridership in Santiego de Chile. Although there was a reasonable description of the data (number of passengers at a bus stations at 15-minute intervals), there was no information on the actual context of the problem. The goal of the competition was to accuractly forecast 3 days into the future of the data given. Although this has its challenges, the main question is whether a method that accurately predicts these 3 days would be useful for the Santiago Transportation Bureau, or anyone else outside of the competition. For instance, the training data included 3 weeks, where there is a pronounced weekday/weekend effect. However, the prediction set include only 3 weekdays. A method that predicts accuractly weekdays might suffer on weekends. It is therefore imperative to include the final goal of the analysis. Will this forecaster be used to assist in bus scheduling on weekdays only? during rush hours only? How accurate do the forecasts need to be for practical use? Maybe a really simple model predicts accuractly enough for the purpose at hand.

Another such instance is the upcoming NN3 Forecasting Competition (as part of the 2007 International Symposium on Forecasting). The dataset includes 111 time series, varying in length (about 40-140 time points). However, I have not found any description neither of the data nor of the context. In reality we would always know at least the time frequency: are these measurements every second? minute? day? month? year? This information is obviously important for determining factors like seasonality and which methods are appropriate.
To download the data and see a few examples, you will need to register your email.

An example of a gold mine is the T-competition, which concentrates on forecasting transportation data. In addition to the large number of series (ranging in length and at various frequencies from daily to yearly), there is a solid description of what each series is, and the actual dates of measurement. They even include a set of seasonal indexes for each series. The data come from an array of transportation measurements in both Europe and North America.