BzST | Business Analytics, Statistics, Teaching: data mining

Showing posts with label data mining. Show all posts

Tuesday, September 05, 2017

My videos for “Business Analytics using Data Mining” now publicly available!

Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections of BADM (which was started by Ravi Bapna in 2006), and the course was fast growing in popularity. Repeating the same lectures in multiple course sections made me realize it was time for scale! I therefore created 30+ videos, covering various supervised methods (k-NN, linear and logistic regression, trees, naive Bayes, etc.) and unsupervised methods (principal components analysis, clustering, association rules), as well as important principles such as performance evaluation, the notion of a holdout set, and more.

I created the videos to support teaching with our textbook “Data Mining for Business Analytics” (the 3rd edition and a SAS JMP edition came out last year; R edition coming out this month!). The videos highlight the key points in different chapters, (hopefully) motivating the watcher to read more in the textbook, which also offers more examples. The videos’ order follows my course teaching, but the topics are mostly independent.

The videos were a big hit in the ISB courses. Since moving to Taiwan, I've created and offered a similar flipped BADM course at National Tsing Hua University, and the videos are also part of the Statistics.com Predictive Analytics series. I’ve since added a few more topics (e.g., neural nets and discriminant analysis).

The audience for the videos (and my courses and textbooks) is non-technical folks who need to understand the logic and uses of data mining, at the managerial level. The videos are therefore about problem solving, and hence the "Business Analytics" in the title. They are different from the many excellent machine learning videos and MOOCs in focus and in technical level -- a basic statistics course that covers linear regression and some business experience should be sufficient for understanding the videos.

For 5 years, and until last week, the videos were only available to past and current students. However, the word spread and many colleagues, instructors, and students have asked me for access. After 5 years, and in celebration of the first R edition of our textbook Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, I decided to make it happen. All 30+ videos are now publicly available on my BADM YouTube playlist.

Currently the videos cater only to those who understand English. I opened the option for community-contributed captions, in the hope that folks will contribute captions in different languages to help make the knowledge propagate further.

This new playlist complements a similar set of videos, on "Business Analytics Using Forecasting" (for time series), that I created at NTHU and made public last year, as part of a MOOC offered on FutureLearn with the next round opening in October.

Finally, I’ll share that I shot these videos while I was living in Bhutan. They are all homemade -- I tried to filter out barking noises and to time the recording when ceremonies were not held close to our home. If you’re interested in how I made the materials and what lessons I learned for flipping my first course, check out my 2012 post.

Tuesday, March 14, 2017

Data mining algorithms: how many dummies?

There's lots of posts on "k-NN for Dummies". This one is about "Dummies for k-NN"

Categorical predictor variables are very common. Those who've taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps:

Convert the categorical variable that has m categories, into m binary dummy variables
Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category)

For example, if we have X={red, yellow, green}, in step 1 we create three dummies:
D_red = 1 if the value is 'red' and 0 otherwise
D_yellow = 1 if the value is 'yellow' and 0 otherwise
D_green = 1 if the value is 'green' and 0 otherwise

In the regression model we might have: Y = b0 + b1 D_red + b2 D_yellow + error
[Note: mathematically, it does not matter which dummy you drop out: the regression coefficients b1, b2
now compare against the left-out category].

When you move to data mining algorithms such as k-NN or trees, the procedure is different: we include all m dummies as predictors when m>2, but in the case m=2, we use a single dummy. Dropping a dummy (when m>2) will distort the distance measure, leading to incorrect distances.
Here's an example, based on X = {red, yellow, green}:

**Case 1: m=3 (use 3 dummies)**

Here are 3 records, their category (color), and their dummy values on (D_red, D_yellow, D_green):

The distance between each pair of records (in terms of color) should be identical, since all three records are different from each other. Suppose we use Euclidean distance. The distance between each pair of records will be equal to 2. For example:

Distance(#1, #2) = (1-0)^2 + (0-1)^2 + (0-0)^2 = 2.

If we drop one dummy, then the three distances will no longer be identical! For example, if we drop D_green:
Distance(#1, #2) = 1 + 1 = 2
Distance(#1, #3) = 1
Distance(#2, #3) = 1

Case 2: m=2 (use single dummy)

The above problem doesn't happen with m=2. Suppose we have only {red, green}, and use a single dummy. The distance between a pair of records will be 0 if the records are the same color, or 1 if they are different.
Why not use 2 dummies? If we use two dummies, we are doubling the weight of this variable but not adding any information. For example, comparing the red and green records using D_red and D_green would give Distance(#1, #3) = 1 + 1 = 2.

So we end up with distances of 0 or 2 instead of weights of 0 or 1.

Bottom line

In data mining methods other than regression models (e.g., k-NN, trees, k-means clustering), we use m dummies for a categorical variable with m categories - this is called one-hot encoding. But if m=2 we use a single dummy.

Wednesday, August 19, 2015

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.

From http://blog.excelmasterseries.com

Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,..., Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) Then, we include m-1 of the dummy variables in the regression model. The major point is to exclude one of the m dummy variables to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

In linear and logistic regression models, including all m variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

The only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.

Thursday, March 06, 2014

The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables:

"If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a "baseline" and leave out its indicator." (from Business Statistics by Sharpe, De Veaux & Velleman)

Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a linear regression is to avoid perfect multicollinearity, where an exact linear relationship exists between the k predictors. Perfect multicollinearity causes computational and interpretation challenges (see slide #6). This k-dummies issue is also called the Dummy Variable Trap.

While these guidelines are required for linear regression, which other predictive models require them? The k-1 dummy rule applies to models where all the predictors are considered together, as a linear combination. Therefore, in addition to linear regression models, the rule would apply to logistic regression models, discriminant analysis, and in some cases to neural networks.

What happens if we use k-1 dummies in other predictive models?
The choice of the dropped dummy variable does not affect the results of regression models, but can affect other methods. For instance, let's consider a classification/regression tree. In a tree, predictors are evaluated one-by-one, and therefore omitting one of the k dummies can result in an inferior predictive model. For example, suppose we have 12 monthly dummies and that in reality only January is different from other months (the outcome differs between January and other months). Now, we run a tree omitting the January dummy as an input and keep the other 11 monthly dummies. The only way the tree might discover the January effect is by creating 11 levels of splits by each of the dummies. This is much less efficient than a single split on the January dummy.

This post is inspired by a discussion in the recent Predictive Analytics 1 online course. This topic deserves more than a short post, yet I haven't seen a thorough discussion anywhere.

Thursday, November 28, 2013

Running a data mining contest on Kaggle

Following the success last year, I've decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle.

Setting up a competition on Kaggle is not trivial and I'd like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I've discovered). With some assistance from the Kaggle support team, who are extremely helpful, I was able to decipher the process. So here goes:

Step #1: get your dataset into the right structure. Your initial dataset should include input and output columns for all records (assuming that the goal is to predict the outcome from the inputs). It should also include an ID column with running index numbers.

Save this as an Excel or CSV file.
Split the records into two datasets: a training set and a test set.
Keep the training and test datasets in separate CSV files. For the test set, remove the outcome column(s).
Kaggle will split the test set into a private and public subsets. It will score each of them separately. Results for the public records will appear in the leaderboard. Only you will see the results for the private subsets. If you want to assign the records yourself to public/private, create a column Usage in the test dataset and type Private or Public for each record.

Step #2: open a Kaggle InClass account and start a competition using the wizard. Filling in the Basic Details and Entry & Rules pages is straightforward.

Step #3: The tricky page is Your Data. Here you'll need to follow the following sequence in order to get it working:

Choose the evaluation metric to be used in the competition. Kaggle has a bunch of different metrics to choose from. In my two Kaggle contests, I actually wanted a metric that was not on the list, and voila! the support team was able to help by activating a metric that was not generally available for my competition. Last year I used a lift-type measure. This year it is an average-cost-per-observation metric for a binary classification task. In short, if you don't find exactly what you're looking for, it is worth asking the folks at Kaggle.
After the evaluation metric is set, upload a solution file (CSV format). This file should include only an ID column (with the IDs for all the records that participants should score), and the outcome column(s). If you include any other columns, you'll get error messages. The first row of your file should include the names of these columns.
After you've uploaded a solutions file, you'll be able to see whether it was successful or not. Aside from error messages, you can see your uploaded files. Scroll to the bottom and you'll see the file that you've submitted; or if you submitted multiple times, you'll see all the submitted files; if you selected a random public/private partition, the "derived solution" file will include an extra column with labels "public" and "private". It's a good idea to download this file, so that you can later compare your results with the system.
After the solution file has been successfully uploaded and its columns mapped, you must upload a "sample submission file". This file is used to map the columns in the solutions file with what needs to be measured by Kaggle. The file should include an ID column like that in the solution file, plus a column with the predictions. Nothing more, nothing less. Again, the first row should include the column names. You'll have an option to define rules about allowed values for these columns.
After successfully submitting the sample submission file, you will be able to test the system by submitting (mock) solutions in the "submission playground". One good test is using the naive rule (in a classification task, submit all 0s or all 1s). Compare your result to the one on Kaggle to make sure everything is set up properly.
Finally, in the "Additional data files" you upload the two data files: the training dataset (which includes the ID, input and output columns) and the test dataset (which includes the ID and input columns). It is also useful to upload a third file, which contains a sample valid submission. This will help participants see what their file should look like, and they can also try submitting this file to see how the system works. You can use the naive-rule submission file that you created earlier to test the system.
That's it! The rest (Documentation, Preview and Overview) are quite straightforward. After you're done, you'll see a button "submit for review". You can also share the contest with another colleague prior to releasing it. Look for "Share this competition wizard with a coworker" on the Basic Details page.

If I've missed tips or tricks that others have used, please do share. My current competition, "predicting cab booking cancellation" (using real data from YourCabs in Bangalore) has just started, and it will be open not only to our students, but to the world.

Submission deadline: Midnight Dec 22, 2013, India Standard Time. All welcome!

Tuesday, January 15, 2013

What does "business analytics" mean in academia?

But what exactly does this mean?

In the recent ISIS conference, I organized and moderated a panel called "Business Analytics and Big Data: How it affects Business School Research and Teaching". The goal was to tackle the ambiguity in the terms "Business Analytics" and "Big Data" in the context of business school research and teaching. I opened with a few points:

Some research b-schools are posting job ads for tenure-track faculty in "Business Analytics" (e.g., University of Maryland; Google "professor business analytics position" for plenty more). What does this mean? what is supposed to be the background of these candidates and where are they supposed to publish to get promoted? ("The Journal of Business Analytics"?)
A recent special issue of the top scholarly journal Management Science was devoted to "Business Analytics". What types of submissions fall under this label? what types do not?
Many new "Business Analytics" programs have been springing up in business schools worldwide. What is new about their offerings?

Panelists Anitesh, Ram and David - photo courtesy of Ravi Bapna

The panelist were a mix of academics (Prof Anitesh Barua from UT Austin and Prof Ram Chellapah from Emory University) and industry (Dr. David Hardoon, SAS Singapore). The audience was also a mixed crowd of academics mostly from MIS departments (in business schools) and industry experts from companies such as IBM and Deloitte.

The discussion took various twists and turns with heavy audience discussion. Here are several issues that emerged from the discussion:

Is there any meaning to BA in academia or is it just the use of analytics (=data tools) within a business context? Some industry folks said that BA is only meaningful within a business context, not research wise.
Is BA just a fancier name for statisticians in a business school or does it convey a different type of statistician? (similar to the adoption of "operations management" (OM) by many operation research (OR) academics)
The academics on the panel made the point that BA has been changing the flavor of research in terms of adding a discovery/exploratory dimension that does not typically exist in social science and IS research. Rather than only theorize-then-test-with-data, data are now explored in further detail using tools such as visualization and micro-level models. The main concern, however, was that it is still very difficult to publish such research in top journals.
With respect to "what constitutes a BA research article", Prof. Ravi Bapna said "it's difficult to specify what papers are BA, but it is easy to spot what is not BA".
While machine learning and data mining have been around for some good time, and the methods have not really changed, the application of both within a business context has become more popular due to friendlier software and stronger computing power. These new practices are therefore now an important core in MBA and other business programs.
One type of b-school program that seems to lag behind on the BA front is the PhD program. Are we equipping our PhD students with abilities to deal with and take advantage of large datasets for developing theory? Are PhD programs revising their curriculum to include big data technologies and machine learning capabilities as required core courses?

Some participants claimed that BA is just another buzzword that will go away after some time. So we need not worry about defining it or demystifying it. After all, the software vendors coin such terms, create a buzz, and finally the buzz moves on. Whether this is the case with BA or with Big Data is yet to be seen. In the meanwhile, we should ponder whether we are really doing something new in our research, and if so, pinpoint to what exactly it is and how to formulate it as requirements for a new era of researchers.

Thursday, October 04, 2012

Flipping and virtualizing learning

Adopting new technology for teaching has been one of my passions, and luckily my students have been understanding even during glitches or choices that turn out to be ineffective (such as the mobile/Internet voting technology that I wrote about last year). My goal has been to use technology to make my courses more interactive: I use clickers for in-class polling (to start discussions and assess understanding, not for grading!); last year, after realizing that my students were constantly on Facebook, I finally opened a Facebook account and ran a closed FB group for out-of-class discussions; In my online courses on statistics.com I created interactive lessons (slides with media, quizzes, etc.) using Udutu.com. On the pedagogical side, I have tried to focus on hands-on learning: team projects took over exams, in-class presentations and homework that get your hands dirty.

But all these were just baby steps, preparing me for the big leap. In the last month, I have been immersed in a complete transformation of one of my on-ground courses: The new approach is a combination of a new technology and a recent pedagogical movement. The pedagogical side is called 'flipping the classroom', where class time is not spent on one-directional lecturing but rather on discussions and other interactive activities. The technological leap is the move towards a Massive Open Online Course (MOOC) – but in my case a "moderate open online course". As a first step, the course will be open only to the community of the Indian School of Business (students, alumni, faculty and staff). The long term plan is to open it up globally.

The course Business Analytics using Data Mining is opening in less than two weeks. I've been working round-the-clock creating content for the online and on-ground components, figuring out the right technologies that can support all the requirements, and collaborating with colleagues at CrowdANALYTIX and at Hansa Cequity to integrate large local datasets and a platform for running data mining contests into the course.

Here are the ingredients that I found essential:

You need strong support from the university! Luckily, ISB is a place that embraces innovation and is willing to evaluate cutting-edge teaching approaches.
A platform that is easy for a (somewhat tech-savvy instructor) instructor to design, to upload materials, to update, to interact with participants, and in general, to run. If you are a control freak like me, the last thing you want is to need to ask someone else to upload, edit, or change things. After researching many possibilities, I decided to use the Google platform. Not the new Google Course Builder platform (who has time for programming in Javascript?), but rather a unique combination of Google Sites, Google Drive, Google Groups, YouTube embedding, etc. The key is Google Sites, which is an incredibly versatile tool (and free! thanks Google!). Another advantage of Google Sites is that you have the solid backbone of Google behind you. If your university uses Google Apps for Education, all the better (we hope to move there soon...)
Definitely worthwhile to invest in a good video editing software. This was a painful experience. After starting with one software that was causing grief, I switched to Camtasia Studio, and very quickly purchased a license. It is an incredibly powerful yet simple to use software for recording video+screen+audio and then editing (cutting out coughs, for instance)
Hardware for lecture videos: Use a good webcam that also has a good mic. I learned that audio quality is the biggest reason for people to stop watching a video. Getting the Thimphu street dogs to stop barking is always a challenge. If you're in a power-outage prone area, make sure to get a back-up battery (UPS).
Have several people go over the course platform to make sure that all the links work, the videos stream, etc. Also, get someone to assist with participants' technical queries. There are always those who need hand-holding.

The way the course will work at ISB is that the ISB community can join the online component (lecture videos, guided reading, online forum, contests). Registered students will also attend on-ground meetings that will focus on discussions, project-based learning, and other interactive activities.

We opened registration to the community today and there are already more than 200 registrants. I guess everyone is curious! Whether the transformation will be a huge success or will die out quietly is yet to be seen. But for sure, there will be insights and learning for all of us.

Tuesday, August 07, 2012

The mad rush: Masters in Analytics programs

The recent trend among mainstream business schools is opening a graduate program or a concentration in Business Analytics (BA). Googling "MS Business Analytics" reveals lots of big players offering such programs. A few examples (among many others) are:

These programs are intended (aside from making money) to bridge the knowledge gap between the "data or IT team" and the business experts. Graduates should be able to lead analytics teams in companies, identifying opportunities where analytics can add value, understanding pitfalls, being able to figure out the needed human and technical resources, and most importantly -- communicating analytics with top management. Unlike "marketing analytics" or other domain-specific programs, Business Analytics programs are "tools" oriented.

As a professor of statistics, I feel a combination of excitement and pain. The word Analytics is clearly more attractive than Statistics. But it is also broader in two senses. First, it combines methods and tools from a wider set of disciplines: statistics, operations research, artificial intelligence, computer science. Second, although technical skills are required to some degree, the focus is on the big picture and how the tools fit into the business process. In other words, it's about Business Analytics.

I am excited about the trend of BA programs because finally they are able to force disciplines such as statistics into considering the large picture and fitting in both in terms of research and teaching. Research is clearly better guided by real problems. The top research journals are beginning to catch up: Management Science has an upcoming special issue on Business Analytics. As for teaching, it is exciting to teach students who are thirsty for analytics. The challenge is for instructors with PhDs in statistics, operations, computer science or other disciplines to repackage the technical knowledge into a communicable, interesting and useful curriculum. Formulas or algorithms, as beautiful as they might appear to us, are only tolerated when their beauty is clearly translated into meaningful and useful knowledge. Considering the business context requires a good deal of attention and often modifying our own modus operandi (we've all been brainwashed by our research discipline).

But then, there's the painful part of the missed opportunity for statisticians to participate as major players (or is it envy?). The statistics community seems to be going through this cycle of "hey, how did we get left behind?". This happened with data mining, and is now happening with data analytics. The great majority of Statistics programs continuously fail to be the leaders of the non-statistics world. Examining the current BA trend, I see that

Statisticians are typically not the leaders of these programs.
Business schools who lack statistics faculty (and that's typical) are either hiring non-research statisticians as adjunct faculty to teach statistics and data mining courses or else these courses are taught by faculty from other areas such as information systems and operations.
"Data Analytics" or "Analytics" degrees are still not offered by mainstream Statistics departments. For example, North Carolina State U has an Institute for Advanced Analytics that offers an MS in Analytics degree. Yet, this does not appear to be linked to the Statistics Department's programs. Carnegie Mellon's Heinz Business College offers a Master degree with concentration in BI and BA, yet the Statistics department offers a Masters in Statistical Practice.

My greatest hope is that a new type of "analytics" research faculty member evolves. The new breed, while having deep knowledge in one field, will also posses more diverse knowledge and openness to other analytics fields (statistical modeling, data mining, operations research methods, computing, human-computer visualization principles). At the same time, for analytics research to flourish, the new breed academic must have a foot in a particular domain, any domain, be it in the social sciences, humanities, engineering, life-sciences, or other. I can only imagine the exciting collaboration among such groups of academics, as well as the value that they bring to research, teaching and knowledge dissemination to other fields.

Tuesday, April 17, 2012

Google Scholar -- you're not alone; Microsoft Academic Search coming up in searches

In searching for a few colleagues' webpages I noticed a new URL popping up in the search results. It either included the prefix academic.microsoft.com or the IP address 65.54.113.26. I got curious and checked it out to discover Microsoft Academic Search (Beta) -- a neat presentation of the author's research publications and collaborations. In addition to the usual list of publications, there are nice visualizations of publications and citations over time, a network chart of co-authors and citations, and even an Erdos Number graph. The genealogy graph claims that it is based on data mining so "might not be perfect".

All this is cool and helpful. But there is one issue that really bothers me: who owns my academic profile?

I checked my "own" Microsoft Academic Search page. Microsoft's software tried to guess my details (affiliation, homepage, papers, etc.) and was correct on some details but wrong on others. To correct the details required me to open a Windows Live ID account. I was able to avoid opening such an account until now (I am not a fan of endless accounts) and would have continued to avoid it, had I not been forced to do so: Microsoft created an academic profile page for me, without my consent, with wrong details. Guessing that this page will soon come up in user searches, I was compelled to correct the inaccurate details.

The next step was even more disturbing: once I logged in with my verified Window Live ID, I tried to correct my affiliation and homepage and added a photo. However, I received the message that the affiliation (Indian School of Business) is not recognized (!) and that Microsoft will have to review all my edits before changing them.

So who "owns" my academic identity? Since obviously Microsoft is crawling university websites to create these pages, it would have been more appropriate to find the authors' academic email addresses and email them directly to notify them of the page (with an "opt out" option!) and allow them to make any corrections without Microsoft's moderation.

Sunday, March 11, 2012

Big Data: The Big Bad Wolf?

"Big Data" is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call "fat data"), others of huge numbers of records ("long data"), and some talk of both. How much is big? Again, depends who you ask.

As a statistician who's (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of "just get a good sample and get it over with", and later recognizing that "fitting the data to the toolkit" (or, "to a hammer everything looks like a nail") is straight-jacketing some great opportunities.

The LinkedIn group Advanced Business Analytics, Data Mining and Predictive Modeling reacted passionately to a the question "What is the value of Big Data research vs. good samples?" posted by a statistician and analytics veteran Michael Mout. Respondents have been mainly from industry - statisticians and data miners. I'd say that the sentiment analysis would come out mixed, but slightly negative at first ("at some level, big data is not necessarily a good thing"; "as statisticians, we need to point out the disadvantages of Big Data"). Over time, sentiment appears to be more positive, but not reaching anywhere close to the huge Big Data excitement in the media.

I created a Wordle of the text in the discussion until today (size represents frequency). It highlights the main advantages and concerns of Big Data. Let me elaborate:

Big data permit the detection of complex patterns (small effects, high order interactions, polynomials, inclusion of many features) that are invisible with small data sets
Big data allow studying rare phenomena, where a small percentage of records contain an event of interest (fraud, security)
Sampling is still highly useful with big data (see also blog post by Meta Brown); with the ability to take lots of smaller samples, we can evaluate model stability, validity and predictive performance
Statistical significance and p-values become meaningless when statistical models are fitted to very large samples. It is then practical significance that plays the key role.
Big data support the use of algorithmic data mining methods that are good at feature selection. Of course, it is still necessary to use domain knowledge to avoid "garbage-in-garbage-out"
Such algorithms might be black-boxes that do not help understand the underlying relationship, but are useful in practice for predicting new records accurately
Big data allow the use of many non-parametric methods (statistical and data mining algorithms) that make much less assumptions about data (such as independent observations)

Thanks to social media, we're able to tap on many brains that have experience, expertise and... some preconceptions. The data collected from such forums can help us researchers to focus our efforts on the needed theoretical investigation of Big Data, to help move from sentiments to theoretically-backed-and-practically-useful knowledge.

Wednesday, March 07, 2012

Forecasting + Analytics = ?

Quantitative forecasting is an age-old discipline, highly useful across different functions of an organization: from forecasting sales and workforce demand to economic forecasting and inventory planning.

Business schools have offered courses with titles such as "Time Series Forecasting", "Forecasting Time Series Data", "Business Forecasting", more specialized courses such as "Demand Planning and Sales Forecasting" or even graduate programs with title "Business and Economic Forecasting". Simple "Forecasting" is also popular. Such courses are offered at the undergraduate, graduate and even executive education. All these might convey the importance and usefulness of forecasting, but they are far from conveying the coolness of forecasting.

I've been struggling to find a better term for the courses that I teach on-ground and online, as well as for my recent book (with the boring name Practical Time Series Forecasting). The name needed to convey that we're talking about forecasting, particularly about quantitative data-driven forecasting, plus the coolness factor. Today I discovered it! Prof Refik Soyer from GWU's School of Business will be offering a course called "Forecasting for Analytics". A quick Google search did not find any results with this particular phrase -- so the credit goes directly to Refik. I also like "Forecasting Analytics", which links it to its close cousins "Predictive Analytics" and "Visual Analytics", all members of the Business Analytics family.

Monday, February 20, 2012

Explain or predict: simulation

Some time ago, when I presented the "explain or predict" work, my colleague Avi Gal asked where simulation falls. Simulation is a key method in operations research, as well as in statistics. A related question arose in my mind when thinking of Scott Nestler's distinction between descriptive/predictive/prescriptive analytics. Scott defines prescriptive analytics as "what should happen in the future? (optimization, simulation)".

So where does simulation fall? Does it fall in a completely different goal category, or can it be part of the explain/predict/describe framework?

My opinion is that simulation, like other data analytics techniques, does not define a goal in itself but is rather a tool to achieve one of the explain/predict/describe goals. When the purpose is to test causal hypotheses, simulation can be used to study what-if the causal effect was true, by simulating data from the "causally-true" hypothesis and comparing it to data from "causally-false" scenarios. In predictive and forecasting tasks, where the purpose is to predict new or future data, simulation can be used to generate predictions. It can also be used to evaluate the robustness of predictions under different scenarios (that would have been very useful in recent years economic forecasts!). In descriptive tasks, where the purpose is to approximate data and quantify relationships, simulation can be used to check the sensitivity of the quantified effects to various model assumptions.

On a related note, Scott challenged me on a post from two years ago where I stated that the term data mining used by operations research (OR) does not really mean data mining. I still hold that view, although I believe that the terminology has now changed: INFORMS now uses the term Analytics in place of data mining. This term is indeed a much better choice, as it is an umbrella term covering a variety of data analytics methods, including data mining, statistical models and OR methods. David Hardoon, Principal Analytics at SAS Singapore, has shown me several terrific applications that combine methods from these different toolkits. As in many cases, combining methods from different disciplines is often the best way to add value.

Tuesday, December 20, 2011

Trading and predictive analytics

I attended today's class in the course Trading Strategies and Systems offered by Prof Vasant Dhar from NYU Stern School of Business. Luckily, Vasant is offering the elective course here at the Indian School of Business, so no need for transatlantic travel.

The topic of this class was the use of news in trading. I won't disclose any trade secrets (you'll have to attend the class for that), but here's my point: Trading is a striking example of the distinction between explanation and prediction. Generally, techniques are based on correlations and on "blackbox" predictive models such as neural nets. In particular, text mining and sentiment analysis are used for extracting information from (often unstructured) news articles for the purpose of prediction.

Vasant mentioned the practical advantage of a machine-learning approach for extracting useful content from text over linguistics know-how. This reminded me of a famous comment by Frederick Jelinek, a prominent
Natural Language Processing researcher who passed away recently:

"Whenever I fire a linguist our system performance improves" (Jelinek, 1998)

This comment was based on Jelinek's experience at IBM Research, while working on computer speech recognition and machine translation.

Jelinek's comment did not make linguists happy. He later defended this claim in a paper entitled "Some of My Best Friends are Linguists" by commenting,

"We all hoped that linguists would provide us with needed help. We were never reluctant to include linguistic knowledge or intuition into our systems; if we didn't succeed it was because we didn't fi nd an effi cient way to include it."

Note: there are some disputes regarding the exact wording of the quote ("Anytime a linguist leaves the group the recognition rate goes up") and its timing -- see note #1 in the Wikipedia entry.

Saturday, October 01, 2011

Language and psychological state: explain or predict?

Quite a few of my social science colleagues think that predictive modeling is not a kosher tool for theory building. In our 2011 MISQ paper "Predictive Analytics in Information Systems Research" we argue that predictive modeling has a critical role to play not only in theory testing but also in theory building. How does it work? Here's an interesting example:

The new book The Secret Life of Pronouns by the cognitive psychologist Pennebaker is a fascinating read in many ways. The book describes how analysis of written language can be predictive of psychological state. In particular, the author describes an interesting text mining approach that analyzes text written by a person and creates a psychological profile of the writer. In the author's context, the approach is used to study the effect of writing on recovery from psychological trauma. You can get a taste of word analysis on the AnalyzeWords.com website, run by the author and his colleagues, which analyzes the personality of a tweeter.

In the book, Pennebaker describes how the automated analysis of language has shed light on the probability that people who underwent psychological trauma will recuperate. For instance, people who used a moderate amount of negative language were more likely to improve than those who used too little or too much negative language. Or, people who tended to change perspectives in their writing over time (from "I" to "they" or "we") were more likely to improve.

Now comes a key question. In the words of the author (p.14): "Do words reflect a psychological state or do they cause it?". The statistical/data-mining text mining application is obviously a predictive tool that is build on correlations/associations. Yet, by examining when it predicts accurately and studying the reasons for the accurate (or inaccurate) predictions, the predictive tool can shed insightful light on possible explanations, linking results to existing psychological theories and giving ideas for new ones. Then comes the "close the circle", where the predictive modeling is combined with explanatory modeling. For testing the explanatory power of words on psychological state, the way to go is experiments. And indeed, the book describes several such experiments investigating the causal effect of words on psychological state, which seem to indicate that there is no causal relationship.

[Thanks to my text-mining-expert colleague Nitin Indurkhya for introducing me to the book!]

Thursday, September 15, 2011

Mining health-related data: How to benefit scientific research

Image from KDnuggets.com

While debates over privacy issues related to electronic health records are still ongoing, predictive analytics are beginning to being used with administrative health data (available to health insurance companies, aka, "health provider networks"). One such venue are large data mining contests. Let me describe a few and then get to my point about their contribution to pubic health, medicine and to data mining research.

The latest and grandest is the ongoing $3 million prize contest by Hereitage Provider Network, which opened in 2010 and lasts 2 years. The contest's stated goal is to create "an algorithm that predicts how many days a patient will spend in a hospital in the next year". Participants get a dataset of de-identified medical records of 100,000 individuals, on which they can train their algorithms. The article in KDNuggets.com suggests that this competition's goal is "to spur development of new approaches in the analysis of health data and create new predictive algorithms."

The 2010 SAS Data Mining Shootout contest was also health-related. Unfortunately, the contest webpage is no longer available (the problem description and data were previously available here), and I couldn't find any information on the winning strategies. From an article in KDNuggets:

"analyzing the medical, demographic, and behavioral data of 50,788 individuals, some of whom had diabetes. The task was to determine the economic benefit of reducing the Body Mass Indices (BMIs) of a selected number of individuals by 10% and to determine the cost savings that would accrue to the Federal Government's Medicare and Medicaid programs, as well as to the economy as a whole"

In 2009, the INFORMS data mining contest was co-organized by IBM Research and Health Care Intelligence, focused on "health care quality". Strangely enough, this contest website is also gone. A brief description by the organizer (Claudia Perlich) is given on KDNuggets.com, stating the two goals :

modeling of a patient transfer guideline for patients with a severe medical condition from a community hospital setting to tertiary hospital provider and
assessment of the severity/risk of death of a patient's condition.

What about presentations/reports from the winners? I had a hard time finding any (here is a deck of slides by a group competing in the 2011 SAS Shootout, also health-related). But photos holding awards and checks abound.

If these health-related data mining competitions are to promote research and solutions in these fields, the contest webpages with problem description, data, as well as presentations/reports by the winners should continue to be publicly available (as for the annual KDD Cup competitions by the ACM). Posting only names and photos of the winners makes data mining competitions look more like a consulting job where the data provider is interested in solving one particular problem for its own (financial or other) benefit. There is definitely scope for a data mining group/organization to collect all this info while it is live and post it in one central website.

Tuesday, September 06, 2011

"Predict" or "Forecast"?

What is the difference between "prediction" and "forecasting"? I heard this being asked quite a few times lately. The Predictive Analytics World conference website has a Predictive Analytics Guide page with the following Q&A:

How is predictive analytics different from forecasting?

Predictive analytics is something else entirely, going beyond standard forecasting by producing a predictive score for each customer or other organizational element. In contrast, forecasting provides overall aggregate estimates, such as the total number of purchases next quarter. For example, forecasting might estimate the total number of ice cream cones to be purchased in a certain region, while predictive analytics tells you which individual customers are likely to buy an ice cream cone.

In a recent interview on "Data Analytics", Prof Ram Gopal asked me a similar question. I have a slightly different view of the difference: the term "forecasting" is used when it is a time series and we are predicting the series into the future. Hence "business forecasts" and "weather forecasts". In contrast, "prediction" is the act of predicting in a cross-sectional setting, where the data are a snapshot in time (say, a one-time sample from a customer database). Here you use information on a sample of records to predict the value of other records (which can be a value that will be observed in the future). That's my personal distinction.

While forecasting has traditionally focused on providing "overall aggregate estimates", that has long changed, and methods of forecasting are commonly used to provide individual estimates. Think again of weather forecasts -- you can get forecasts for very specific areas. Moreover, daily (and even minute-by-minute) weather forecasts are generated for many different geographical areas. Another example is SKU-level forecasting for inventory management purposes. Stores and large companies often use forecasting to predict every product they carry. These are not aggregate values, but individual-product forecasts.

"Old fashioned" forecasting has indeed been around for a long time, and has been taught in statistics and operations research programs and courses. While some forecasting models require a lot of statistical expertise (such as ARIMA, GARCH and other acronyms), there is a terrific and powerful set of data-driven, computationally fast, automated methods that can be used for forecasting even at the individual product/service level. Forecasting, in my eyes, is definitely part of predictive analytics.

Wednesday, August 17, 2011

Where computer science and business meet

Data mining is taught very differently at engineering schools and at business schools. At engineering schools, data mining is taught more technically, deciphering how different algorithms work. In business schools the focus is on how to use algorithms in a business context.

Business students with a computer science background can now enjoy both worlds: take a data mining course with a business focus, and supplement it with the free course materials from Stanford Engineering school's Machine Learning course (including videos of lectures and handouts by Prof Andrew Ng). There are a bunch of other courses with free materials as part of the Stanford Engineering Everywhere program.

Similarly, computer science students with a business background can take advantage of MIT's Sloan School of Management Open Courseware program, and in particular their Data Mining course (last offered in 2003 by Prof Nitin Patel). Unfortunately, there are no lecture videos, but you do have access to handouts.

And for instructors in either world, these are great resources!

Thursday, August 04, 2011

The potential of being good

Yesterday I happened to hear talks by two excellent speakers, both on major data mining applications in industry. One common theme was that both speakers gave compelling and easy to grasp examples of what data mining algorithms and statistics can do beyond human intelligence, and how the two relate.

The first talk, by IBM's Global Services Christer Johnson, was given at the 2011 INFORMS Conference on Business Analytics and Operations Research (see video). Christer Johnson described the idea behind Watson, the artificial intelligence computer system developed by IBM that beat two champions of the Jeopardy quiz show. Two main points in the talk about the relationship between humans and data mining methods that I especially liked are:

Data analytics methods are designed not only to give an answer, but also to evaluate how confident they are about the answer. In answering the jeopardy questions, the data mining approach tells you not only what is the most likely answer, but also how confident you are about that answer.
Building trust in an analytics tool occurs when you see it make mistakes and learn from those mistakes.

The second talk, "The Art and Science of Matching Items to Users" was given by Deepak Agarwal , a Yahoo! principle research scientist and fellow statistician, was webcasted at ISB's seminar series. You can still catch it on Aug 10 at Yahoo!'s Big Thinker Series in Bangalore. The talk was about recommender systems and their use within Yahoo!. Among various approaches used by Yahoo! to improve recommendations, Deepak described a main idea for improving the customization of news item displays on news.yahoo.com.

On the relation between human intelligence and automation, the process of choosing which items to display on Yahoo! is a two-step process, where first human editors create a pool of potential interesting news items, and then automated machine-learning algorithms choose which individual items to display from that pool.

Like Christer Johnson's point #2, Deepak illustrated the difference between "the answer" (what we statisticians call a point estimate) and "the potential of it being good" (what we call the confidence in the estimate, AKA variability) in a very cool way: Consider two news items of which one will be displayed to a user. The first item was already shown to 100 users and 2 users clicked on links from that page. The second was shown to 10,000 users and 250 users clicked on links. Which news item should you show to maximize clicks? (yes, this is about ad revenues...) Although the first item has a lower click-through-rate (2%), it is also less certain, in the sense that it is based on less data than item 2. Hence, it is potentially good. He then took this one step further: Combine the two! "Exploit what is known to be good, explore what is potentially good".

So what do we have here? Very practical and clear examples of why we care about variance, the weakness of point estimates, and expanding the notion of diversification to combining certain good results with uncertain not-that-good results.

Wednesday, July 27, 2011

Analytics: You want to be in Asia

Business Intelligence and Data Mining have become hot buzzwords in the West. Using Google Insights for Search to "see what the world is searching for" (see image below), we can see that the popularity of these two terms seems to have stabilized (if you expand the search to 2007 or earlier, you will see the earlier peak and also that Data Mining was hotter for a while). Click on the image to get to the actual result, with which you can interact directly. There are two very interesting insights from this search result:

Looking at the "Regional Interest" for these terms, we see that the #1 country searching for these terms is India! Hong Kong and Singapore are also in the top 5. A surge of interest in Asia!
Adding two similar terms that have the term Analytics, namely Business Analytics and Data Analytics, unveils a growing interest in Analytics (whereas the two non-analytics terms have stabilized after their peak).

What to make of this? First, it means Analytics is hot. Business Analytics and Data Analytics encompass methods for analyzing data that add value to a business or any other organization. Analytics includes a wide range of data analysis methods, from visual analytics to descriptive and explanatory modeling, and predictive analytics. From statistical modeling, to interactive visualization (like the one shown here!), to machine-learning algorithms and more. Companies and organizations are hungry for methods that can turn their huge and growing amounts of data into actionable knowledge. And the hunger is most pressing in Asia.

Click on the image to refresh the Google Insight for Search result (in a new window)

Monday, June 20, 2011

Got Data?!

The American Statistical Association's store used to sell cool T-shirts with the old-time beggar-statistician question "Got Data?" Today it is much easier to find data, thanks to the Internet. Dozens of student teams taking my data mining course have been able to find data from various sources on the Internet for their team projects. Yet, I often receive queries from colleagues in search of data for their students' projects. This is especially true for short courses, where students don't have sufficient time to search and gather data (which is highly educational in itself!).

One solution that I often offer is data from data mining competitions. KDD Cup is a classic, but there are lots of other data mining competitions that make huge amounts of real or realistic data available: past INFORMS Data Mining Contests (2008, 2009, 2010), ENBIS Challenges, and more. Here's one new competition to add to the list:

The European Network for Business and Industrial Statistics (ENBIS) announced the 2011 Challenge (in collaboration with SAS JMP). The title is "Maximising Click Through Rates on Banner Adverts: Predictive Modeling in the On Line World". It's a bit complicated to find the full problem description and data on the ENBIS website (you'll find yourself clicking-through endless "more" buttons - hopefully these are not data collected for the challenge!), so I linked them up.

It's time for T-shirts saying "Got Data! Want Knowledge?"