Showing posts with label business analytics. Show all posts
Showing posts with label business analytics. Show all posts

Monday, December 10, 2018

Forecasting large collections of time series

With the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting "at scale"!

Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financial forecasting. The types of methods taught in forecasting courses tends to be discipline-specific:

  • Statisticians love ARIMA (auto regressive integrated moving average) models, with multivariate versions such as Vector ARIMA, as well as state space models and non-parametric methods such as STL decompositions.
  • Econometricians and finance academics go one step further into ARIMA variations such as ARFIMA (f=fractional), ARCH (autoregressive conditional heteroskedasticity), GARCH (g=general), NAGARCH (n=nonlinear, a=asymmetric), and plenty more
  • Electrical engineers use spectral analysis (the equivalent of ARIMA in the frequency domain)
  • Machine learning researchers use neural nets and other algorithms
In practice, it is common to see 3 types of methods being used by companies for forecasting future values of a time series : exponential smoothing, linear regression, and sometimes ARIMA. 

Image from https://itnext.io
Why the difference? Because the goal is different! Statistical models such as ARIMA and all its econ flavors are often used for parameter estimation or statistical inference. Those are descriptive goals  (e.g., "is this series a random walk?", "what is the volatility of the errors?"). The spectral approach by electrical engineers is often used for the descriptive goals of characterizing a series' frequencies (signal processing), or for anomaly detection. In contrast, the business applications are strictly predictive: they want forecasts of future values. The simplest methods in terms of ease-of-use, computation, software availability, and understanding, are linear regression models and exponential smoothing. And those methods provide sufficiently accurate forecasts in many applications - hence their popularity! 

ML algorithms are in line with a predictive goal, aimed solely at forecasting. ARIMA and state space models can also be used for forecasting (albeit using a different modeling process than for a descriptive goal). The reason ARIMA is commonly used in practice, in my opinion, is due to the availability of automated functions.

For cases with a small number of time series to forecast (a typical case in many businesses), it is usually worthwhile investing time in properly modeling and evaluating each series individually in order to arrive at the simplest solution that provides the required level of accuracy. Data scientists are sometimes over-eager to improve accuracy beyond what is practically needed, optimizing measures such as RMSE, while the actual impact is measured in a completely different way that depends on how those forecasts are used for decision making. For example, forecasting demand has completely different implications for over- vs. under-forecasting; Users might be more averse to certain directions or magnitudes of error.

But what to do when you must forecast a large collection of time series? Perhaps on a frequent basis? This is "big data" in the world of time series. Amazon predict shipping time for each shipment using different shipping methods to determine the best shipping method (optimized with other shipments taking place at the same/nearby time); Uber forecasts ETA for each trip; Google Trends generates forecasts for any keyword a user types in near-realtime. And... IoT applications call for forecasts for time series from each of their huge number of devices. These applications obviously cannot invest time and effort into building handmade solutions. In such cases, automated forecasting is a practical solution. A good "big data" forecasting solution should
  • be flexible to capture a wide range of time series patterns 
  • be computationally efficient and scalable
  • be adaptable to changes in patterns that occur over time
  • provide sufficient forecasting accuracy
In my course "Business Anlaytics Using Forecasting" at NTHU this year, teams have experienced trying to forecast hundreds of series from a company we're collaborating with. They used various approaches and tools. The excellent forecast package in R by Rob Hyndman's team includes automated functions for ARIMA (auto.arima), exponential smoothing (ets), and a single-layer neural net (nnetar). Facebook's prophet algorithm (and R package) runs a linear regression. Some of these methods are computationally heavier (e.g. ARIMA) so implementation matters. 

While everyone gets excited about complex methods, in time series so far evidence is that "simple is king": naive forecasts are often hard to beat! In the recent M4 forecasting contest (with 100,000 series), what seemed to work well were combinations (ensembles) of standard forecasting methods such as exponential smoothing and ARIMA combined using a machine learning method for the ensemble weights. Machine learning algorithms were far inferior. The secret sauce is ensembles.

Because simple methods often work well, it is well worth identifying which series really do require more than a naive forecast. How about segmenting the time series into groups? Methods that first fit models to each series and then cluster the estimates are one way to go (although can be too time consuming for some applications). The ABC-XYZ approach takes a different approach: it divides a large set of time series into 4 types, based on the difficulty of forecasting (easy/hard) and magnitude of values (high/low) that can be indicative of their importance.

Forecasting is experiencing a new "split personality" phase, of small-scale tailored forecasting applications that integrate domain knowledge vs. large-scale applications that rely on automated "mass-production" forecasting. My prediction is that these two types of problems will continue to survive and thrive, requiring different types of modeling and different skills by the modelers.

For more on forecasting methods, the process of forecasting, and evaluating forecasting solutions see Practical Time Series Forecasting: A Hands-On Guide and the accompanying YouTube videos

Tuesday, September 05, 2017

My videos for “Business Analytics using Data Mining” now publicly available!

Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections of BADM (which was started by Ravi Bapna in 2006), and the course was fast growing in popularity. Repeating the same lectures in multiple course sections made me realize it was time for scale! I therefore created 30+ videos, covering various supervised methods (k-NN, linear and logistic regression, trees, naive Bayes, etc.) and unsupervised methods (principal components analysis, clustering, association rules), as well as important principles such as performance evaluation, the notion of a holdout set, and more.

I created the videos to support teaching with our textbook “Data Mining for Business Analytics” (the 3rd edition and a SAS JMP edition came out last year; R edition coming out this month!). The videos highlight the key points in different chapters, (hopefully) motivating the watcher to read more in the textbook, which also offers more examples. The videos’ order follows my course teaching, but the topics are mostly independent.

The videos were a big hit in the ISB courses. Since moving to Taiwan, I've created and offered a similar flipped BADM course at National Tsing Hua University, and the videos are also part of the Statistics.com Predictive Analytics series. I’ve since added a few more topics (e.g., neural nets and discriminant analysis).

The audience for the videos (and my courses and textbooks) is non-technical folks who need to understand the logic and uses of data mining, at the managerial level. The videos are therefore about problem solving, and hence the "Business Analytics" in the title. They are different from the many excellent machine learning videos and MOOCs in focus and in technical level -- a basic statistics course that covers linear regression and some business experience should be sufficient for understanding the videos.
For 5 years, and until last week, the videos were only available to past and current students. However, the word spread and many colleagues, instructors, and students have asked me for access. After 5 years, and in celebration of the first R edition of our textbook Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, I decided to make it happen. All 30+ videos are now publicly available on my BADM YouTube playlist.


Currently the videos cater only to those who understand English. I opened the option for community-contributed captions, in the hope that folks will contribute captions in different languages to help make the knowledge propagate further.

This new playlist complements a similar set of videos, on "Business Analytics Using Forecasting" (for time series), that I created at NTHU and made public last year, as part of a MOOC offered on FutureLearn with the next round opening in October.

Finally, I’ll share that I shot these videos while I was living in Bhutan. They are all homemade -- I tried to filter out barking noises and to time the recording when ceremonies were not held close to our home. If you’re interested in how I made the materials and what lessons I learned for flipping my first course, check out my 2012 post.

Thursday, August 15, 2013

Designing a Business Analytics program, Part 3: Structure

This post continues two earlier posts (Part 1: Intro and Part 2: Content) on Designing a Business Analytics (BA) program. This part focuses on the structure of a BA program, and especially course structure.

In the program that I designed, each of the 16 courses combines on-ground sessions with online components. Importantly, the opening and closing of a course should be on-ground.

The hybrid online/on-ground design is intended to accommodate participants who cannot take long periods of time-off to attend campus. Yet, even in a residential program, a hybrid structure can be more effective, if it is properly implemented. The reason is that a hybrid model is more similar to the real-world functioning of an analyst. At the start and end of a project, close communication is needed with the domain experts and stakeholders to assure that everyone is clear about the goals and the implications. In between these touch points, the analytics group works "offline" (building models, evaluating, testing, going back and forth) while communicating among the group and from time to time with the domain people.

A hybrid "sandwich" BA program can be set up to mimic this process:
  • The on-ground sessions at the start and end of each course help set the stage and expectations, build communication channels between the instructor and participants as well as among participants; at the close of a course, participants present their work and receive peer and instructor feedback.
  • The online components guide participants (and teams of participants) through the skill development and knowledge acquisition that the course aims at. Working through a live project, participants can acquire the needed knowledge (1) via lecture videos, textbook readings, case studies and articles, software tutorials and more, (2) via self-assessment and small deliverables that build up needed proficiency, and (3) a live online discussion board where participants are required to ask, answer, discuss and share experiences, challenges and discoveries. If designing and implementing the online component is beyond the realm of the institution, it is possible to integrate existing successful online courses, such as those offered on Statistics.com or on Coursera, EdX and other established online course providers.
For example, in a Predictive Analytics course, a major component is a team project with real data, solving a potentially real problem. The on-ground sessions would focus on translating a business problem into an analytics problem and setting the expectations and stage for the process the teams will be going through. Teams would submit proposals and discuss with the instructor to assure feasibility and determine the way forward. The online components would include short lecture videos, textbook reading, short individual assignments to master software and technique, and a vibrant online discussion board with topics at different technical and business levels (this is similar to my semi-MOOC course Business Analytics Using Data Mining). In the closing on-ground sessions, teams present their work to the entire group and discuss challenges and insights; each team might meet with the instructor to receive feedback and do a second round of improvement. Finally, an integrative session would provide closure and linkage to other courses.

Designing a Business Analytics program, Part 2: Content

This post follows Part 1: Intro of Designing a Business Analytics program. In this post, I focus on the content to be covered in the program, in the form of courses and projects.

The following design is based on my research of many programs, on discussions with faculty in various analytics areas, with analysts and managers at different levels, and on feedback from many past MBA students who have taken my analytics courses over the years (data mining, forecasting, visualization, statistics, etc.) and are now managing data at a broad range of companies and organizations.

Content
Dealing with data, little or mountains, and being able to tackle an array of business challenges and opportunities, requires a broad and diverse set of tools and approaches. From data access and management to modeling, assessment and deployment requires a skill set that derives from the fields of statistics, computer science, operations research, and more. In addition, one needs integrative and "big picture" thinking and effective communication skills. Here is a list of 16 courses, divided into four sets, that attempts to achieve such a skill set (by no means is this the only set - would love to hear comments):

Set I
  1. Analytic Thinking (what is a model? what is the role of a model? data in context and data-domain integration)
  2. Data Visualization (data exploration, interactive visualization, charts and dashboards, data presentation and effective communication, use of BI tools)
  3. Statistical Analysis 1: Estimation and inference (observational studies and experiments; estimating population means, proportions, and more; testing hypotheses regarding population numbers; using programming and menu-driven software)
  4. Statistical Analysis 2: Regression models (linear, logistic, ANOVA)
Set II
  1. Data Management 1: Database design and implementation, data warehousing
  2. Forecasting Analytics: Exploring and modeling time series
  3. Data Management 2: Big Data (Hadoop-MapReduce and more)
  4. Operations 1: Simulation (principles of simulation; Monte Carlo and Discrete Event simulation)
Set III
  1. Operations 2: Optimization (optimization techniques, sensitivity analysis, and more)
  2. Statistical Analysis 3: Advanced statistical models (censoring and truncation, modeling count data, handling missing values, design of experiments (A/B testing and beyond))
  3. Data Collection (Web data collection, online surveys, experiments)
  4. Data Mining 1: Supervised Learning - Predictive Analytics (predictive algorithms, evaluating predictive power, using software)
Set IV
  1. Data Mining 2: Unsupervised Learning (dimension reduction, clustering, association rules, recommender systems)
  2. Contemporary Analytics 1 (choose between: text mining, network analytics, social analytics, customer analytics, web analytics, risk analytics)
  3. Contemporary Analytics 2 (from the list above)
  4. Integrative Thinking (BA in different fields, choosing and integrating tools and analytic approaches into an effective solution)
The courses are divided into sets of four, where courses in each set can be offered in parallel. The order should take into account coverage of other courses and natural linkages.

Lastly: two industry team projects that require integrating skills from multiple courses should give participants the opportunity to interface with industry, test their skills in a more realistic setting, and gain initial experience and confidence to move forward on their own.

Continue to Part 3: Structure

Designing a Business Analytics program, Part 1: Intro

I have been receiving many inquiries about programs in "Business Analytics" (BA), online and offline, in the US and outside the US. The few programs that are already out there (see an earlier post) are relatively new, so it is difficult to assess their success in producing data-savvy analysts.

Rather than concentrate on the uncertainty, let me share my view and experience regarding the skill set that such programs should provide. To be practical, I will share the program that I designed for the Indian School of Business one-year certificate program in BA(*), in terms of content and structure. Both reflect the needed skills and knowledge that I believe make a valuable data analyst in a company. As well as a powerful consultant.

The program was designed for participants who have a few years of business experience and are planning to manage the data crunchers, but must acquire a solid knowledge of the crunchers' toolkit, and especially how it can be used effectively to tackle business goals, challenges and opportunities.

Business Analytics experts have a broad skill set
One important note: Although some universities and business schools are tempted to rename an existing operations or statistics program as a BA (or "Big Data" or "Data Science", etc) program, this will by no means supply the required diversity of skills. A program in BA should not look like a statistics program. It also should not look like a program in operations research. The key is therefore a combination of courses from different areas (statistics and operations among them), which usually requires experts from across campus. In a recent post by visualization expert Nathan Yaw, he comments on the need to know more than just visualization to be successful in the field ("It still surprises me how little statistics visualization people know... Look at job listings though, and most employers list it in the required skill set, so it's a big plus for you hiring-wise.")

The next two posts describe the content and structure of the program.

Continue to Part 2: Structure

(*) The final program structure and content at ISB were modified by the program administrator to accommodate constraints and shortages.

Tuesday, January 15, 2013

What does "business analytics" mean in academia?

But what exactly does this mean?
In the recent ISIS conference, I organized and moderated a panel called "Business Analytics and Big Data: How it affects Business School Research and Teaching". The goal was to tackle the ambiguity in the terms "Business Analytics" and "Big Data" in the context of business school research and teaching. I opened with a few points:

  1. Some research b-schools are posting job ads for tenure-track faculty in "Business Analytics" (e.g., University of Maryland; Google "professor business analytics position" for plenty more). What does this mean? what is supposed to be the background of these candidates and where are they supposed to publish to get promoted? ("The Journal of Business Analytics"?)
  2. A recent special issue of the top scholarly journal Management Science was devoted to "Business Analytics". What types of submissions fall under this label? what types do not?
  3. Many new "Business Analytics" programs have been springing up in business schools worldwide. What is new about their offerings? 

Panelists Anitesh, Ram and David - photo courtesy of Ravi Bapna
The panelist were a mix of academics (Prof Anitesh Barua from UT Austin and Prof Ram Chellapah from Emory University) and industry (Dr. David Hardoon, SAS Singapore). The audience was also a mixed crowd of academics mostly from MIS departments (in business schools) and industry experts from companies such as IBM and Deloitte.

The discussion took various twists and turns with heavy audience discussion. Here are several issues that emerged from the discussion:

  • Is there any meaning to BA in academia or is it just the use of analytics (=data tools) within a business context? Some industry folks said that BA is only meaningful within a business context, not research wise.
  • Is BA just a fancier name for statisticians in a business school or does it convey a different type of statistician? (similar to the adoption of "operations management" (OM) by many operation research (OR) academics)
  • The academics on the panel made the point that BA has been changing the flavor of research in terms of adding a discovery/exploratory dimension that does not typically exist in social science and IS research. Rather than only theorize-then-test-with-data, data are now explored in further detail using tools such as visualization and micro-level models. The main concern, however, was that it is still very difficult to publish such research in top journals.
  • With respect to "what constitutes a BA research article", Prof. Ravi Bapna said "it's difficult to specify what papers are BA, but it is easy to spot what is not BA".
  • While machine learning and data mining have been around for some good time, and the methods have not really changed, the application of both within a business context has become more popular due to friendlier software and stronger computing power. These new practices are therefore now an important core in MBA and other business programs. 
  • One type of b-school program that seems to lag behind on the BA front is the PhD program. Are we equipping our PhD students with abilities to deal with and take advantage of large datasets for developing theory? Are PhD programs revising their curriculum to include big data technologies and machine learning capabilities as required core courses?
Some participants claimed that BA is just another buzzword that will go away after some time. So we need not worry about defining it or demystifying it. After all, the software vendors coin such terms, create a buzz, and finally the buzz moves on. Whether this is the case with BA or with Big Data is yet to be seen. In the meanwhile, we should ponder whether we are really doing something new in our research, and if so, pinpoint to what exactly it is and how to formulate it as requirements for a new era of researchers.

Thursday, October 04, 2012

Flipping and virtualizing learning

Adopting new technology for teaching has been one of my passions, and luckily my students have been understanding even during glitches or choices that turn out to be ineffective (such as the mobile/Internet voting technology that I wrote about last year). My goal has been to use technology to make my courses more interactive: I use clickers for in-class polling (to start discussions and assess understanding, not for grading!); last year, after realizing that my students were constantly on Facebook, I finally opened a Facebook account and ran a closed FB group for out-of-class discussions; In my online courses on statistics.com I created interactive lessons (slides with media, quizzes, etc.) using Udutu.com. On the pedagogical side, I have tried to focus on hands-on learning: team projects took over exams, in-class presentations and homework that get your hands dirty.

But all these were just baby steps, preparing me for the big leap. In the last month, I have been immersed in a complete transformation of one of my on-ground courses: The new approach is a combination of a new technology and a recent pedagogical movement. The pedagogical side is called 'flipping the classroom', where class time is not spent on one-directional lecturing but rather on discussions and other interactive activities. The technological leap is the move towards a Massive Open Online Course (MOOC) – but in my case a "moderate open online course". As a first step, the course will be open only to the community of the Indian School of Business (students, alumni, faculty and staff). The long term plan is to open it up globally.

The course Business Analytics using Data Mining is opening in less than two weeks. I've been working round-the-clock creating content for the online and on-ground components, figuring out the right technologies that can support all the requirements, and collaborating with colleagues at CrowdANALYTIX and at Hansa Cequity to integrate large local datasets and a platform for running data mining contests into the course.

Here are the ingredients that I found essential:
  • You need strong support from the university! Luckily, ISB is a place that embraces innovation and is willing to evaluate cutting-edge teaching approaches.
  • A platform that is easy for a (somewhat tech-savvy instructor) instructor to design, to upload materials, to update, to interact with participants, and in general, to run. If you are a control freak like me, the last thing you want is to need to ask someone else to upload, edit, or change things. After researching many possibilities, I decided to use the Google platform. Not the new Google Course Builder platform (who has time for programming in Javascript?), but rather a unique combination of Google Sites, Google Drive, Google Groups, YouTube embedding, etc. The key is Google Sites, which is an incredibly versatile tool (and free! thanks Google!). Another advantage of Google Sites is that you have the solid backbone of Google behind you. If your university uses Google Apps for Education, all the better (we hope to move there soon...)
  • Definitely worthwhile to invest in a good video editing software. This was a painful experience. After starting with one software that was causing grief, I switched to Camtasia Studio, and very quickly purchased a license. It is an incredibly powerful yet simple to use software for recording video+screen+audio and then editing (cutting out coughs, for instance)
  • Hardware for lecture videos: Use a good webcam that also has a good mic. I learned that audio quality is the biggest reason for people to stop watching a video. Getting the Thimphu street dogs to stop barking is always a challenge. If you're in a power-outage prone area, make sure to get a back-up battery (UPS).
  • Have several people go over the course platform to make sure that all the links work, the videos stream, etc. Also, get someone to assist with participants' technical queries. There are always those who need hand-holding.
The way the course will work at ISB is that the ISB community can join the online component (lecture videos, guided reading, online forum, contests). Registered students will also attend on-ground meetings that will focus on discussions, project-based learning, and other interactive activities. 

We opened registration to the community today and there are already more than 200 registrants. I guess everyone is curious! Whether the transformation will be a huge success or will die out quietly is yet to be seen. But for sure, there will be insights and learning for all of us.


Tuesday, August 07, 2012

The mad rush: Masters in Analytics programs

The recent trend among mainstream business schools is opening a graduate program or a concentration in Business Analytics (BA). Googling "MS Business Analytics" reveals lots of big players offering such programs. A few examples (among many others) are:

These programs are intended (aside from making money) to bridge the knowledge gap between the "data or IT team" and the business experts. Graduates should be able to lead analytics teams in companies, identifying opportunities where analytics can add value, understanding pitfalls, being able to figure out the needed human and technical resources, and most importantly -- communicating analytics with top management. Unlike "marketing analytics" or other domain-specific programs, Business Analytics programs are "tools" oriented.

As a professor of statistics, I feel a combination of excitement and pain. The word Analytics is clearly more attractive than Statistics. But it is also broader in two senses. First, it combines methods and tools from a wider set of disciplines: statistics, operations research, artificial intelligence, computer science. Second, although technical skills are required to some degree, the focus is on the big picture and how the tools fit into the business process. In other words, it's about Business Analytics.

I am excited about the trend of BA programs because finally they are able to force disciplines such as statistics into considering the large picture and fitting in both in terms of research and teaching. Research is clearly better guided by real problems. The top research journals are beginning to catch up: Management Science has an upcoming special issue on Business Analytics. As for teaching, it is exciting to teach students who are thirsty for analytics. The challenge is for instructors with PhDs in statistics, operations, computer science or other disciplines to repackage the technical knowledge into a communicable, interesting and useful curriculum. Formulas or algorithms, as beautiful as they might appear to us, are only tolerated when their beauty is clearly translated into meaningful and useful knowledge. Considering the business context requires a good deal of attention and often modifying our own modus operandi (we've all been brainwashed by our research discipline).

But then, there's the painful part of the missed opportunity for statisticians to participate as major players (or is it envy?). The statistics community seems to be going through this cycle of "hey, how did we get left behind?". This happened with data mining, and is now happening with data analytics. The great majority of Statistics programs continuously fail to be the leaders of the non-statistics world. Examining the current BA trend, I see that

  1. Statisticians are typically not the leaders of these programs. 
  2. Business schools who lack statistics faculty (and that's typical) are either hiring non-research statisticians as adjunct faculty to teach statistics and data mining courses or else these courses are taught by faculty from other areas such as information systems and operations.
  3. "Data Analytics" or "Analytics" degrees are still not offered by mainstream Statistics departments. For example, North Carolina State U has an Institute for Advanced Analytics that offers an MS in Analytics degree. Yet, this does not appear to be linked to the Statistics Department's programs. Carnegie Mellon's Heinz Business College offers a Master degree with concentration in BI and BA, yet the Statistics department offers a Masters in Statistical Practice.
My greatest hope is that a new type of "analytics" research faculty member evolves. The new breed, while having deep knowledge in one field, will also posses more diverse knowledge and openness to other analytics fields (statistical modeling, data mining, operations research methods, computing, human-computer visualization principles). At the same time, for analytics research to flourish, the new breed academic must have a foot in a particular domain, any domain, be it in the social sciences, humanities, engineering, life-sciences, or other. I can only imagine the exciting collaboration among such groups of academics, as well as the value that they bring to research, teaching and knowledge dissemination to other fields.

Wednesday, March 07, 2012

Forecasting + Analytics = ?

Quantitative forecasting is an age-old discipline, highly useful across different functions of an organization: from  forecasting sales and workforce demand to economic forecasting and inventory planning.

Business schools have offered courses with titles such as "Time Series Forecasting", "Forecasting Time Series Data", "Business Forecasting",  more specialized courses such as "Demand Planning and Sales Forecasting" or even graduate programs with title "Business and Economic Forecasting". Simple "Forecasting" is also popular. Such courses are offered at the undergraduate, graduate and even executive education. All these might convey the importance and usefulness of forecasting, but they are far from conveying the coolness of forecasting.

I've been struggling to find a better term for the courses that I teach on-ground and online, as well as for my recent book (with the boring name Practical Time Series Forecasting). The name needed to convey that we're talking about forecasting, particularly about quantitative data-driven forecasting, plus the coolness factor. Today I discovered it! Prof Refik Soyer from GWU's School of Business will be offering a course called "Forecasting for Analytics". A quick Google search did not find any results with this particular phrase -- so the credit goes directly to Refik. I also like "Forecasting Analytics", which links it to its close cousins "Predictive Analytics" and "Visual Analytics", all members of the Business Analytics family.


Monday, February 20, 2012

Explain or predict: simulation

Some time ago, when I presented the "explain or predict" work, my colleague Avi Gal asked where simulation falls. Simulation is a key method in operations research, as well as in statistics. A related question arose in my mind when thinking of Scott Nestler's distinction between descriptive/predictive/prescriptive analytics. Scott defines prescriptive analytics as "what should happen in the future? (optimization, simulation)".

So where does simulation fall? Does it fall in a completely different goal category, or can it be part of the explain/predict/describe framework?

My opinion is that simulation, like other data analytics techniques, does not define a goal in itself but is rather a tool to achieve one of the explain/predict/describe goals. When the purpose is to test causal hypotheses, simulation can be used to study what-if the causal effect was true, by simulating data from the "causally-true" hypothesis and comparing it to data from "causally-false" scenarios. In predictive and forecasting tasks, where the purpose is to predict new or future data, simulation can be used to generate predictions. It can also be used to evaluate the robustness of predictions under different scenarios (that would have been very useful in recent years economic forecasts!). In descriptive tasks, where the purpose is to approximate data and quantify relationships, simulation can be used to check the sensitivity of the quantified effects to various model assumptions.

On a related note, Scott challenged me on a post from two years ago where I stated that the term data mining used by operations research (OR) does not really mean data mining. I still hold that view, although I believe that the terminology has now changed: INFORMS now uses the term Analytics in place of data mining. This term is indeed a much better choice, as it is an umbrella term covering a variety of data analytics methods, including data mining, statistical models and OR methods. David Hardoon, Principal Analytics at SAS Singapore, has shown me several terrific applications that combine methods from these different toolkits. As in many cases, combining methods from different disciplines is often the best way to add value.

Tuesday, December 20, 2011

Trading and predictive analytics

I attended today's class in the course Trading Strategies and Systems offered by Prof Vasant Dhar from NYU Stern School of Business. Luckily, Vasant is offering the elective course here at the Indian School of Business, so no need for transatlantic travel.

The topic of this class was the use of news in trading. I won't disclose any trade secrets (you'll have to attend the class for that), but here's my point: Trading is a striking example of the distinction between explanation and prediction. Generally, techniques are based on correlations and on "blackbox" predictive models such as neural nets. In particular, text mining and sentiment analysis are used for extracting information from (often unstructured) news articles for the purpose of prediction.

Vasant mentioned the practical advantage of a machine-learning approach for extracting useful content from text over linguistics know-how. This reminded me of a famous comment by Frederick Jelinek, a prominent
Natural Language Processing researcher who passed away recently:
"Whenever I fire a linguist our system performance improves" (Jelinek, 1998)
This comment was based on Jelinek's experience at IBM Research, while working on computer speech recognition and machine translation.

Jelinek's comment did not make linguists happy. He later defended this claim in a paper entitled "Some of My Best Friends are Linguists" by commenting,
"We all hoped that linguists would provide us with needed help. We were never reluctant to include linguistic knowledge or intuition into our systems; if we didn't succeed it was because we didn't fi nd an effi cient way to include it."
Note: there are some disputes regarding the exact wording of the quote ("Anytime a linguist leaves the group the recognition rate goes up") and its timing -- see note #1 in the Wikipedia entry.

Tuesday, September 06, 2011

"Predict" or "Forecast"?

What is the difference between "prediction" and "forecasting"? I heard this being asked quite a few times lately. The Predictive Analytics World conference website has a Predictive Analytics Guide page with the following Q&A:

How is predictive analytics different from forecasting?
Predictive analytics is something else entirely, going beyond standard forecasting by producing a predictive score for each customer or other organizational element. In contrast, forecasting provides overall aggregate estimates, such as the total number of purchases next quarter. For example, forecasting might estimate the total number of ice cream cones to be purchased in a certain region, while predictive analytics tells you which individual customers are likely to buy an ice cream cone.
In a recent interview on "Data Analytics", Prof Ram Gopal asked me a similar question. I have a slightly different view of the difference: the term "forecasting" is used when it is a time series and we are predicting the series into the future. Hence "business forecasts" and "weather forecasts". In contrast, "prediction" is the act of predicting in a cross-sectional setting, where the data are a snapshot in time (say, a one-time sample from a customer database). Here you use information on a sample of records to predict the value of other records (which can be a value that will be observed in the future). That's my personal distinction.



While forecasting has traditionally focused on providing "overall aggregate estimates", that has long changed, and methods of forecasting are commonly used to provide individual estimates. Think again of weather forecasts -- you can get forecasts for very specific areas. Moreover, daily (and even minute-by-minute) weather forecasts are generated for many different geographical areas. Another example is SKU-level forecasting for inventory management purposes. Stores and large companies often use forecasting to predict every product they carry. These are not aggregate values, but individual-product forecasts.

"Old fashioned" forecasting has indeed been around for a long time, and has been taught in statistics and operations research programs and courses. While some forecasting models require a lot of statistical expertise (such as ARIMA, GARCH and other acronyms), there is a terrific and powerful set of data-driven, computationally fast, automated methods that can be used for forecasting even at the individual product/service level. Forecasting, in my eyes, is definitely part of predictive analytics.

Wednesday, July 27, 2011

Analytics: You want to be in Asia

Business Intelligence and Data Mining have become hot buzzwords in the West. Using Google Insights for Search to "see what the world is searching for" (see image below), we can see that the popularity of these two terms seems to have stabilized (if you expand the search to 2007 or earlier, you will see the earlier peak and also that Data Mining was hotter for a while). Click on the image to get to the actual result, with which you can interact directly. There are two very interesting insights from this search result:
  1. Looking at the "Regional Interest" for these terms, we see that the #1 country searching for these terms is India! Hong Kong and Singapore are also in the top 5. A surge of interest in Asia!
  2. Adding two similar terms that have the term Analytics, namely Business Analytics and Data Analytics, unveils a growing interest in Analytics (whereas the two non-analytics terms have stabilized after their peak).
What to make of this? First, it means Analytics is hot. Business Analytics and Data Analytics encompass methods for analyzing data that add value to a business or any other organization. Analytics includes a wide range of data analysis methods, from visual analytics to descriptive and explanatory modeling, and predictive analytics. From statistical modeling, to interactive visualization (like the one shown here!), to machine-learning algorithms and more. Companies and organizations are hungry for methods that can turn their huge and growing amounts of data into actionable knowledge. And the hunger is most pressing in Asia.
Click on the image to refresh the Google Insight for Search result (in a new window)

Tuesday, November 16, 2010

November Analytics magazine on BI

click to read the latest issue
A bunch of interesting articles about business analytics and predictive analytics from a managerial point of view, in the November issue of INFORMS Analytics magazine.