Thursday, October 04, 2012

Flipping and virtualizing learning

Adopting new technology for teaching has been one of my passions, and luckily my students have been understanding even during glitches or choices that turn out to be ineffective (such as the mobile/Internet voting technology that I wrote about last year). My goal has been to use technology to make my courses more interactive: I use clickers for in-class polling (to start discussions and assess understanding, not for grading!); last year, after realizing that my students were constantly on Facebook, I finally opened a Facebook account and ran a closed FB group for out-of-class discussions; In my online courses on statistics.com I created interactive lessons (slides with media, quizzes, etc.) using Udutu.com. On the pedagogical side, I have tried to focus on hands-on learning: team projects took over exams, in-class presentations and homework that get your hands dirty.

But all these were just baby steps, preparing me for the big leap. In the last month, I have been immersed in a complete transformation of one of my on-ground courses: The new approach is a combination of a new technology and a recent pedagogical movement. The pedagogical side is called 'flipping the classroom', where class time is not spent on one-directional lecturing but rather on discussions and other interactive activities. The technological leap is the move towards a Massive Open Online Course (MOOC) – but in my case a "moderate open online course". As a first step, the course will be open only to the community of the Indian School of Business (students, alumni, faculty and staff). The long term plan is to open it up globally.

The course Business Analytics using Data Mining is opening in less than two weeks. I've been working round-the-clock creating content for the online and on-ground components, figuring out the right technologies that can support all the requirements, and collaborating with colleagues at CrowdANALYTIX and at Hansa Cequity to integrate large local datasets and a platform for running data mining contests into the course.

Here are the ingredients that I found essential:
  • You need strong support from the university! Luckily, ISB is a place that embraces innovation and is willing to evaluate cutting-edge teaching approaches.
  • A platform that is easy for a (somewhat tech-savvy instructor) instructor to design, to upload materials, to update, to interact with participants, and in general, to run. If you are a control freak like me, the last thing you want is to need to ask someone else to upload, edit, or change things. After researching many possibilities, I decided to use the Google platform. Not the new Google Course Builder platform (who has time for programming in Javascript?), but rather a unique combination of Google Sites, Google Drive, Google Groups, YouTube embedding, etc. The key is Google Sites, which is an incredibly versatile tool (and free! thanks Google!). Another advantage of Google Sites is that you have the solid backbone of Google behind you. If your university uses Google Apps for Education, all the better (we hope to move there soon...)
  • Definitely worthwhile to invest in a good video editing software. This was a painful experience. After starting with one software that was causing grief, I switched to Camtasia Studio, and very quickly purchased a license. It is an incredibly powerful yet simple to use software for recording video+screen+audio and then editing (cutting out coughs, for instance)
  • Hardware for lecture videos: Use a good webcam that also has a good mic. I learned that audio quality is the biggest reason for people to stop watching a video. Getting the Thimphu street dogs to stop barking is always a challenge. If you're in a power-outage prone area, make sure to get a back-up battery (UPS).
  • Have several people go over the course platform to make sure that all the links work, the videos stream, etc. Also, get someone to assist with participants' technical queries. There are always those who need hand-holding.
The way the course will work at ISB is that the ISB community can join the online component (lecture videos, guided reading, online forum, contests). Registered students will also attend on-ground meetings that will focus on discussions, project-based learning, and other interactive activities. 

We opened registration to the community today and there are already more than 200 registrants. I guess everyone is curious! Whether the transformation will be a huge success or will die out quietly is yet to be seen. But for sure, there will be insights and learning for all of us.


Wednesday, September 19, 2012

Self-publishing to the rescue

The new Coursera course by Princeton Professor Mung Chiang was so popular that Amazon and the publisher ran out of copies of the textbook before the course even started (see "new website features" announcement; requires login). I experienced a stockout of my own textbook ("Data Mining for Business Intelligence") a couple of years ago, which caused grief and slight panic to both students and instructors.

With stockouts in mind, and recognizing the difficulty of obtaining textbooks outside of North America (unavailable, too expensive, or long/costly shipping), I decided to take things into my own hands and self-publish a "Practical Analytics" series of textbooks. Currently, the series has three books. All are available in soft-cover editions and Kindle editions. I used CreateSpace.com, an Amazon company, for publishing the soft-cover editions. This reduces the stockout problem due to a print-on-demand model. I used Amazon KDP for publishing the Kindle editions, so definitely no stockouts there. Amazon makes the books available on its global websites and so reachable in many places worldwide (the Indian Flipkart also avails the books). Finally, since I got to set the prices, I made sure to keep them affordable (for example, in India the e-books are even cheaper than in the USA).

How has this endeavor fared? Well, more than 1000 copies were sold since March 2011. Several instructors adopted books for their courses. And from reader emails and ratings on Amazon, it looks like I'm on the right track.

To celebrate the power and joy of self-publishing as well as accessible and affordable knowledge, I am running a "free e-book promotion" next week. The following e-books will be available for free:

Both promotions will commence a little after midnight, Pacific Standard Time, and will last for 24 hours. To download each of the e-books, just go to the Amazon website during the promotion period and search for the title. You will then be able to download the book for free.

Enjoy, and feel free to share!

Saturday, September 01, 2012

Trees in pivot table terminology

Recently, I've been requested by non-data-mining colleagues to explain how Classification and Regression Trees work. While a detailed explanation with examples exists in my co-authored textbook Data Mining for Business Intelligence, I found that the following explanation worked well with people who are familiar with Excel's Pivot Tables:

Classification tree for predicting vulnerability to famine
Suppose the goal is to generate predictions for some variable, numerical or categorical, given a set of predictors. The idea behind trees is to create groups of records with similar profiles in terms of their predictors, and then average the outcome variable of interest to generate a prediction.

Here's an interesting example from the paper Identifying Indicators of Vulnerability to Famine and Chronic Food Insecurity by Yohannes and Webb, showing predictors of vulnerability to famine based on a survey of households. The image shows all the predictors that were identified by the tree, which appear below each circle. Each predictor is a binary variable and you go right or left depending on the value of the predictor. It is easiest to start reading from the top, with an household in mind.

Our goal is to generate groups of households with similar profiles, where profiles are the combination of answers to different survey questions. 
Using the language of pivot tables, our predictions will be in the Values field, and we can use the Row (or Column) Labels to break down the predictors. What does the tree do? Here's a "pivot table" description:

  1. Drag the outcome of interest into the Values area
  2. Find the first predictor that best splits the profiles and drag it into the Row Label field*.
  3. Given the first predictor, find the next predictor to further split the profiles, and drag into the Row Label field** .
  4. Given the first two splits, find the next predictor to further split the profiles (could also be one of the earlier variables) and drag into the Row Label field***
  5. Continue this process until some over-fitting criterion is reached
You might imagine the final result as a really crowded Pivot Table, with multiple predictors in the Row Label fields. This is indeed quite close, except for two slight differences:

* Each time a predictor is dragged into the Row or Column Labels fields, it is converted into a binary variable, creating only two classes. For example, 
  • Gender would not change (Female/Male)
  • Country could be turned into "India/Other". 
  • noncereal yield was discretized into "Above/below 4.7".

** After a predictor is dragged, the next predictor is actually dragged only into one of the two splits of the first predictor. In our example, after dragging noncereal yield (Above/Below 4.7), the predictor oxen owned (Above/Below 1.5) only applies to noncereal yield Below 4.7.

*** We also note that a tree can "drag" a predictor more than once into the Row Labels fields. For example, TLU/capita appears twice in the tree, so theoretically in the pivot table we'd drag TLU/capita after oxen owned and again after crop diversity.

So where is the "intelligence" of a tree over an elaborate pivot table? First, it automatically determines which predictor is the best one to use at each stage. Second, it automatically determines the value on which to split. Third, it knows when to stop, to avoid over-fitting the data. In a pivot table, the user would have to determine which predictors to include, their order, and what are the critical values to split on. And finally, this complex process going on behind the scenes is easily interpretable by a tree chart.

Tuesday, August 07, 2012

The mad rush: Masters in Analytics programs

The recent trend among mainstream business schools is opening a graduate program or a concentration in Business Analytics (BA). Googling "MS Business Analytics" reveals lots of big players offering such programs. A few examples (among many others) are:

These programs are intended (aside from making money) to bridge the knowledge gap between the "data or IT team" and the business experts. Graduates should be able to lead analytics teams in companies, identifying opportunities where analytics can add value, understanding pitfalls, being able to figure out the needed human and technical resources, and most importantly -- communicating analytics with top management. Unlike "marketing analytics" or other domain-specific programs, Business Analytics programs are "tools" oriented.

As a professor of statistics, I feel a combination of excitement and pain. The word Analytics is clearly more attractive than Statistics. But it is also broader in two senses. First, it combines methods and tools from a wider set of disciplines: statistics, operations research, artificial intelligence, computer science. Second, although technical skills are required to some degree, the focus is on the big picture and how the tools fit into the business process. In other words, it's about Business Analytics.

I am excited about the trend of BA programs because finally they are able to force disciplines such as statistics into considering the large picture and fitting in both in terms of research and teaching. Research is clearly better guided by real problems. The top research journals are beginning to catch up: Management Science has an upcoming special issue on Business Analytics. As for teaching, it is exciting to teach students who are thirsty for analytics. The challenge is for instructors with PhDs in statistics, operations, computer science or other disciplines to repackage the technical knowledge into a communicable, interesting and useful curriculum. Formulas or algorithms, as beautiful as they might appear to us, are only tolerated when their beauty is clearly translated into meaningful and useful knowledge. Considering the business context requires a good deal of attention and often modifying our own modus operandi (we've all been brainwashed by our research discipline).

But then, there's the painful part of the missed opportunity for statisticians to participate as major players (or is it envy?). The statistics community seems to be going through this cycle of "hey, how did we get left behind?". This happened with data mining, and is now happening with data analytics. The great majority of Statistics programs continuously fail to be the leaders of the non-statistics world. Examining the current BA trend, I see that

  1. Statisticians are typically not the leaders of these programs. 
  2. Business schools who lack statistics faculty (and that's typical) are either hiring non-research statisticians as adjunct faculty to teach statistics and data mining courses or else these courses are taught by faculty from other areas such as information systems and operations.
  3. "Data Analytics" or "Analytics" degrees are still not offered by mainstream Statistics departments. For example, North Carolina State U has an Institute for Advanced Analytics that offers an MS in Analytics degree. Yet, this does not appear to be linked to the Statistics Department's programs. Carnegie Mellon's Heinz Business College offers a Master degree with concentration in BI and BA, yet the Statistics department offers a Masters in Statistical Practice.
My greatest hope is that a new type of "analytics" research faculty member evolves. The new breed, while having deep knowledge in one field, will also posses more diverse knowledge and openness to other analytics fields (statistical modeling, data mining, operations research methods, computing, human-computer visualization principles). At the same time, for analytics research to flourish, the new breed academic must have a foot in a particular domain, any domain, be it in the social sciences, humanities, engineering, life-sciences, or other. I can only imagine the exciting collaboration among such groups of academics, as well as the value that they bring to research, teaching and knowledge dissemination to other fields.

Monday, July 30, 2012

Launched new book website for Practical Forecasting book

Last week I launched a new website for my textbook Practical Time Series Forecasting. The website offers resources such as the datasets used in the book, a block with news that pushes posts to the book Facebook page, information about the book and author, for instructors an online form for requesting an evaluation copy and another for requesting access to solutions, etc.

I am already anticipating my colleagues' question "what platform did you use?". Well, I did not hire a web designer, nor did I spend three months putting the website together using HTML. Instead, I used Google Sites. This is a great solution for those who like to manage their book website on their own (whether you're self-publishing or not). Very readable, clean design, integration with other Google Apps components (such as forms), and as hack-proof as it gets. Not to mention easy to update and maintain, and free hosting.

Thanks to the tools and platforms offered by Google and Amazon, self-publishing is not only a good realistic option for authors. It also allows a much closer connection between the author and the book users -- instructors, students and "independent" readers.


Wednesday, July 25, 2012

Explain/Predict in Epidemiology

Researchers in various fields have been sending me emails and reactions after reading my 2010 paper "To Explain or To Predict?". While I am aware of research methodology in a few areas, I'm learning in more detail about the scientific challenges caused by "predictive-less" areas.

In an effort to further disseminate this knowledge, I'll be posting these reactions in this blog (with the senders' approval, of course).

In a recent email, Stan YoungAssistant Director for Bioinformatics at NISS, commented about the explain/predict situation in epidemiology:
"I enjoyed reading your paper... I am interested in what I think is [epidemiologists] lack of clarity on explain/predict. They seem to take the position that no matter how many tests they compute, that any p-value <0.05 is a strong indication of something real (=explain) and that everyone should follow their policies (=predict) when, given all their analysis problems, they at the very best should consider their claims as hypothesis generating."
In a talk by epidemiology Professor Uri Goldbourt, who was a discussant in a recent "Explain or Predict" panel, I learned that modeling in epidemiology is nearly entirely descriptive. Unlike explanatory modeling, there is little underlying causal theory. And there is no prediction or evaluation of predictive power going on. Modeling typically focuses on finding correlations between measurable variables in observational studies that generalize to the population (and hence the wide use of inference, and unfortunately, a huge issue of multiple testing).

Predictive modeling has a huge potential to advance research in epidemiology. Among many benefits (such as theory validation), it would bring the field closer to today's "personalized" environment. Not only concentrating on "average patterns", but also generating personalized predictions for individuals.

I'd love to hear more from epidemiologists! Please feel free to post comments or to email me directly.

Tuesday, July 24, 2012

Linear regression for binary outcome: even better news

I recently attended the 8th World Congress in Probability and Statistics, where I heard an interesting talk by Andy Tsao. His talk "Naivity can be good: a theoretical study of naive regression" (Abstract #0586) was about the use of Naive Regression, which is the application of linear regression to a categorical outcome, treating the outcome as numerical. He asserted that predictions from Naive Regression will be quite good. My last post was about the "goodness" of a linear regression applied to a binary outcome in terms of the estimated coefficients. That's what explanatory modeling is about. What Dr. Tsao alerted me to, is that the predictions (or more correctly, classifications) too, will be good. In other words, it's useful for predictive modeling! In his words:
"This naivity is not blessed from current statistical or machine learning theory. However, surprisingly, it delivers good or satisfactory performances in many applications."
Note that to derive a classification from naive regression, you treat the prediction as the class probability (although it might be negative or >1), and apply a cutoff value as in any other classification method.


Dr. Tsao pointed me to the good old The Elements of Statistical Learning, which has a section called Linear Regression of an Indicator Matrix. There are two interesting takeaway from Dr. Tsao's talk:
  1. Naive Regression and Linear Discriminant Analysis will have the same ROC curve, meaning that the ranking of predictions will be identical.
  2. If the two groups are of equal size (n1=n2), then Naive Regression and Discriminant Analysis are equivalent and therefore produce the same classifications.

Monday, May 28, 2012

Linear regression for a binary outcome: is it Kosher?

Regression models are the most popular tool for modeling the relationship between an outcome and a set of inputs. Models can be used for descriptive, causal-explanatory, and predictive goals (but in very different ways! see Shmueli 2010 for more).

The family of regression models includes two especially popular members: linear regression and logistic regression (with probit regression more popular than logistic in some research areas). Common knowledge, as taught in statistics courses, is: use linear regression for a continuous outcome and logistic regression for a binary or categorical outcome. But why not use linear regression for a binary outcome? the two common answers are: (1) the linear regression can produce predictions that are not binary, and hence "nonsense" and (2) inference based on the linear regression coefficients will be incorrect.

I admit that I bought into these "truths" for a long time, until I learned never to take any "statistical truth" at face value. First, let us realize that problem #1 relates to prediction and #2 to description and causal explanation. In other words, if issue #1 can be "fixed" somehow, then I might consider linear regression for prediction even if the inference is wrong (who cares about inference if I am only interested in predicting individual observations?). Similarly, if there is a fix for issue #2, then I might consider linear regression as a kosher inference mechanism even if it produces "nonsense" predictions.

The 2009 paper Linear versus logistic regression when the dependent variable is a dichotomy by Prof. Ottar Hellevik from Oslo University de-mystifies some of these issues. First, he gives some tricks that help avoid predictions outside the [0,1] range. The author identifies a few factors that contribute to "nonsense predictions" by linear regression:

  • interactions that are not accounted for in the regression
  • non-linear relationships between a predictor and the outcome
The suggested remedy for these issues is including interaction terms for categorical variables, and if numerical predictors are involved, then bucket them into bins and include those as dummies + interactions. So, if the goal is predicting a binary outcome, linear regression can be modified and used.

Now to the inference issue. "The problem with a binary dependent variable is that the homoscedasticity assumption (similar variation on the dependent variable for units with different values on the independent variable) is not satisfied... This seems to be the main basis for the widely held opinion that linear regression is inappropriate with a binary dependent variable". Statistical theory tells us that violating the homoscedasticity assumption results in biased standard errors for the coefficients, and that the coefficients might not be the most precise in terms of variance. Yet, the coefficients themselves remain unbiased (meaning that with a sufficiently large sample they are "on target"). Hence, with a sufficiently large sample we need not worry! Precision is not an issue in very large samples, and hence the on-target coefficients are just what we need.
I will add that another concern is that the normality assumption is violated: the residuals from a regression model on a binary outcome will not look very bell-shaped... Again, with a sufficiently large sample, the distribution does not make much difference, since the standard errors are so small anyway.

Chart from Hellevik (2009)
Hellevik's paper pushes the envelope further in an attempt to explore "how small can you go" with your sample before getting into trouble. He uses simulated data and compares the results from logistic and linear regression for fairly small samples. He finds that the differences are minuscule.

The bottom line: linear regression is kosher for prediction if you take a few steps to accommodate non-linear relationships (but of course it is not guaranteed to produce better predictions than logistic regression!). For inference, for a sufficiently large sample where standard errors are tiny anyway, it is fine to trust the coefficients, which are in any case unbiased.

Tuesday, May 22, 2012

Policy-changing results or artifacts of big data?

The New York Times article Big Study Links Good Teachers to Lasting Gain covers a research study coming out of Harvard and Columbia on "The Long-Term Impacts of Teachers: Teacher Value-Added and Student Outcomes in Adulthood". The authors used sophisticated econometric models applied to data from a million students to conclude:
"We find that students assigned to higher VA [Value-Added] teachers are more successful in many dimensions. They are more likely to attend college, earn higher salaries, live in better neighborhoods, and save more for retirement. They are also less likely to have children as teenagers."
When I see social scientists using statistical methods in the Big Data realm I tend to get a little suspicious, since classic statistical inference behaves differently with large samples than with small samples (which are more typical in the social sciences). Let's take a careful look at some of the charts from this paper to figure out the leap from the data to the conclusions.


How much does a "value added" teacher contribute to a person's salary at age 28?


Figure 1: dramatic slope? largest difference is less than $1,000
The slope in the chart (Figure 1) might look quite dramatic. And I can tell you that, statistically speaking, the slope is not zero (it is a "statistically significant" effect). Now look closely at the y-axis amounts. Note that the data fluctuate only by a very small annual amount! (less than $1,000 per year). The authors get around this embarrassing magnitude by looking at the "lifetime value" of a student ("On average, having such a [high value-added] teacher for one year raises a child's cumulative lifetime income by $50,000 (equivalent to $9,000 in present value at age 12 with a 5% interest rate)."


Here's another dramatic looking chart:


What happens to the average student test score as a "high value-added teacher enters the school"?


The improvement appears to be huge! But wait, what are those digits on the y-axis? the test score goes up by 0.03 points!

Reading through the slides or paper, you'll find various mentions of small p-values, which indicate statistical significance ("p<0.001" and similar notations). This by no means says anything about the practical significance or the magnitude of the effects.

If this were a minor study published in a remote journal, I would say "hey, there are lots of those now." But when a paper covered by the New York Times and is published as in the serious National Bureau of Economic Research Working Paper series (admittedly, not a peer-reviewed journal), then I am worried. I am very worried.

Unless I am missing something critical, I would only agree with one line in the executive summary: "We find that when a high VA teacher joins a school, test scores rise immediately in the grade taught by that teacher; when a high VA teacher leaves, test scores fall." But with one million records, that's not a very interesting question. The interesting question which should drive policy is by how much?

Big Data is also becoming the realm in social sciences research. It is critical that researchers are aware of the dangers of applying small-sample statistical models and inference in this new era. Here is one place to start.

Tuesday, April 17, 2012

Google Scholar -- you're not alone; Microsoft Academic Search coming up in searches

In searching for a few colleagues' webpages I noticed a new URL popping up in the search results. It either included the prefix academic.microsoft.com or the IP address 65.54.113.26. I got curious and checked it out to discover Microsoft Academic Search (Beta) -- a neat presentation of the author's research publications and collaborations. In addition to the usual list of publications, there are nice visualizations of publications and citations over time, a network chart of co-authors and citations, and even an Erdos Number graph. The genealogy graph claims that it is based on data mining so "might not be perfect".



All this is cool and helpful. But there is one issue that really bothers me: who owns my academic profile?


I checked my "own" Microsoft Academic Search page. Microsoft's software tried to guess my details (affiliation, homepage, papers, etc.) and was correct on some details but wrong on others. To correct the details required me to open a Windows Live ID account. I was able to avoid opening such an account until now (I am not a fan of endless accounts) and would have continued to avoid it, had I not been forced to do so: Microsoft created an academic profile page for me, without my consent, with wrong details. Guessing that this page will soon come up in user searches, I was compelled to correct the inaccurate details.

The next step was even more disturbing: once I logged in with my verified Window Live ID, I tried to correct my affiliation and homepage and added a photo. However, I received the message that the affiliation (Indian School of Business) is not recognized (!) and that Microsoft will have to review all my edits before changing them.

So who "owns" my academic identity? Since obviously Microsoft is crawling university websites to create these pages, it would have been more appropriate to find the authors' academic email addresses and email them directly to notify them of the page (with an "opt out" option!) and allow them to make any corrections without Microsoft's moderation.

Tuesday, April 03, 2012

New Google Consumer Surveys: revolutionizing academic data collection?

Surveys are a key data collection tool in several academic research areas. As opposed to experiments or field studies that yield observational data, surveys can give access to attitudes, reaching "inside the head" of people rather than observing their behavior.

Technological advances in survey tool development now offer "poor academics" sufficiently powerful online survey tools, such as surveymonkey.com and Google forms. Yet, obtaining access to a large pool of potential respondents from a particular population remains a challenge. Another challenge is getting fast responses -- how do you reach people quickly and get many of them to respond quickly?

We may now have a solution that is affordable for academic research: A few days ago Google announced a new service called "Google Consumer Surveys". Similar to Ad Sense, where Google places ads on websites of publishers (and pays the publishers a commission), with Consumer Surveys, Google places a single-question survey (=poll) on websites of publishers. The publishers require website users to complete the poll to get access to premium content.

Google Consumer Surplus: How it works (from their website)

The good:

  • Very affordable: the charge for each response is $0.10 (=only $100 for the magic number of 1,000 responses). Or, for an audience targeted by demographics or some trait, it is $.50 per response (more here).
  • Fast: Google will likely post the polls on pages with high traffic.
  • Google presents the results with attractive charts
  • Getting IRB permission may be easier, given the stringent policies that Google mandates
The bad:
  • You can only post one question at a time. For a longer survey, breaking it up into single questions means that not the same person is answering all the questions. Also, each additional question increases the cost exponentially.
  • Google does not supply the poll creator with the raw data. You only get aggregated data. You can choose the aggregation (inferred age, gender, urban density, geography, or income). This is likely to be a huge "bad" for researchers who need access to the raw data for more advanced analyses than those provided by Google. 
  • Currently Google only offers this service for websites in the US. To collect information from users visiting non-US website we will all have to continue holding our breath.
A curious anecdote: I filled in the support contact form to ask a few extra questions. I received speedy and helpful answers (within 24 hours), but they all landed in my Google Spam folder!

Monday, April 02, 2012

The world is flat? Only for US students

Learning and teaching has become a global endeavor with lots of online resources and technologies. Contests are an effective way to engage a diverse community from around the world. In the past I have written several posts about contests and competitions in data mining, statistics and more. And now about a new one.

Tableau is a US-based company that sells a cool data visualization tool (there's a free version too). The company has recently seen huge growth with lots of new adopters in industry and academia. Their "Tableau for teaching" (TfT) program is intended to assist instructors and teachers by providing software and resources for data visualization courses. The program is promoted as global "Tableau for Teaching Around the World" (see the interactive dashboard at the bottom of this post). As part of this program, a student contest was recently launched where students are provided with real data and are challenged to produce good visualizations that tell compelling stories. The data are from Lesotho, Africa (given by the NGO CARE) and the prizes are handsome. I was almost getting excited about this contest (non-US data, visualization, nice prizes for students) when I read the draconian contest eligibility rules:
ELIGIBILITY: The Tableau Student Data Challenge Contest (“The Awards,” “Contest” or “Promotion”) is offered and open only to legal residents of the 50 United States and the District of Columbia (“United States”) who at time of entry (a) are the legal age of majority in their state of residence; (b) physically reside in the United States; (c) are enrolled as a college or university accredited in the United States; and (d) are not an Ineligible Person
I was deeply disappointed. Not only does the contest exclude non-US students (even branches of US universities outside of the US are excluded!), but more disturbing is the fact that only US residents can win a prize for telling a story about lives of people in Lesotho. Condescending? Wouldn't local Lesotho students (or at least students in the region) be the most knowledgeable about the meaning of the data? Wouldn't they be the ones most qualified to tell the story of Lesotho people that emerges from the data? Wouldn't they be the first to identify surprising patterns or exceptions and even wrong data?

While one country "telling the story" of another country is common at the political level, there is no reason that open-minded private visualization software companies should endorse the same behavior. If the problem of awarding cash prizes to non-US citizens is tax-related, I am sure there are creative ways, such as giving free software licenses, to offer prizes that can be distributed to any enthusiastic and talented student of visualization around the world. In short, I call Tableau to change the rules and follow CARE's motto "Defending Dignity".


Tuesday, March 13, 2012

Data liberation via visualization

"Data democratization" movements try to make data, and especially government-held data, publicly available and accessible. A growing number of technological efforts are devoted to such efforts and especially the accessibility part. One such effort is by data visualization companies. A recent trend is to offer a free version (or at least free for some period) that is based on sharing your visualization and/or data to the Web. The "and/or" here is important, because in some cases you cannot share your data, but you would like to share the visualizations with the world. This is what I call "data liberation via visualization". This is the case with proprietary data, and often even if I'd love to make data publicly available, I am not allowed to do so by binding contracts.

As part of a "data liberation via visualization" initiative, I went in search of a good free solution for disseminating interactive visualization dashboards while protecting the actual data. Two main free viz players in the market are TIBCO Spotfire Silver (free 1-year license Personal version), and Tableau Public (free). Both allow *only* public posting of your visualizations (if you want to save visualizations privately you must get the paid versions). That's fine. However, public posting of visualizations with these tools comes with a download button that make your data public as well.

I then tried MicroStrategy Cloud Personal (free Beta version), which does allow public (and private!) posting of visualizations and does not provide a download button. Of course, in order to make visualizations public, the data must sit on a server that can be reached from the visualization. All the free public-posting tools keep your data on the company's servers, so you must trust the company to protect the confidentiality and safety of your data. MicroStrategy uses a technology where the company itself cannot download your data (your Excel sheet is converted to in-memory cubes that are stored on the server). Unfortunately, the tool lacks the ability to create dashboards with multiple charts (combining multiple charts into a fully-linked interactive view).

Speaking of features, Tableau Public is the only one that has full-fledged functionality like its cousin paid tools. Spotfire Silver Personal is stripped from highly useful charts such as scatterplots and boxplots. MicroStrategy Cloud Personal lacks multi-view dashboards and for now accepts only Excel files as input.

Sunday, March 11, 2012

Big Data: The Big Bad Wolf?

"Big Data" is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call "fat data"), others of huge numbers of records ("long data"), and some talk of both. How much is big? Again, depends who you ask.

As a statistician who's (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of "just get a good sample and get it over with", and later recognizing that "fitting the data to the toolkit" (or, "to a hammer everything looks like a nail") is straight-jacketing some great opportunities.

The LinkedIn group Advanced Business Analytics, Data Mining and Predictive Modeling reacted passionately to a the question "What is the value of Big Data research vs. good samples?" posted by a statistician and analytics veteran Michael Mout. Respondents have been mainly from industry - statisticians and data miners. I'd say that the sentiment analysis would come out mixed, but slightly negative at first ("at some level, big data is not necessarily a good thing"; "as statisticians, we need to point out the disadvantages of Big Data"). Over time, sentiment appears to be more positive, but not reaching anywhere close to the huge Big Data excitement in the media.

I created a Wordle of the text in the discussion until today (size represents frequency). It highlights the main advantages and concerns of Big Data. Let me elaborate:
  • Big data permit the detection of complex patterns (small effects, high order interactions, polynomials, inclusion of many features) that are invisible with small data sets
  • Big data allow studying rare phenomena, where a small percentage of records contain an event of interest (fraud, security)
  • Sampling is still highly useful with big data (see also blog post by Meta Brown); with the ability to take lots of smaller samples, we can evaluate model stability, validity and predictive performance
  • Statistical significance and p-values become meaningless when statistical models are fitted to very large samples. It is then practical significance that plays the key role.
  • Big data support the use of algorithmic data mining methods that are good at feature selection. Of course, it is still necessary to use domain knowledge to avoid "garbage-in-garbage-out"
  • Such algorithms might be black-boxes that do not help understand the underlying relationship, but are useful in practice for predicting new records accurately
  • Big data allow the use of many non-parametric methods (statistical and data mining algorithms) that make much less assumptions about data (such as independent observations)
Thanks to social media, we're able to tap on many brains that have experience, expertise and... some preconceptions. The data collected from such forums can help us researchers to focus our efforts on the needed theoretical investigation of Big Data, to help move from sentiments to theoretically-backed-and-practically-useful knowledge.

Wednesday, March 07, 2012

Forecasting + Analytics = ?

Quantitative forecasting is an age-old discipline, highly useful across different functions of an organization: from  forecasting sales and workforce demand to economic forecasting and inventory planning.

Business schools have offered courses with titles such as "Time Series Forecasting", "Forecasting Time Series Data", "Business Forecasting",  more specialized courses such as "Demand Planning and Sales Forecasting" or even graduate programs with title "Business and Economic Forecasting". Simple "Forecasting" is also popular. Such courses are offered at the undergraduate, graduate and even executive education. All these might convey the importance and usefulness of forecasting, but they are far from conveying the coolness of forecasting.

I've been struggling to find a better term for the courses that I teach on-ground and online, as well as for my recent book (with the boring name Practical Time Series Forecasting). The name needed to convey that we're talking about forecasting, particularly about quantitative data-driven forecasting, plus the coolness factor. Today I discovered it! Prof Refik Soyer from GWU's School of Business will be offering a course called "Forecasting for Analytics". A quick Google search did not find any results with this particular phrase -- so the credit goes directly to Refik. I also like "Forecasting Analytics", which links it to its close cousins "Predictive Analytics" and "Visual Analytics", all members of the Business Analytics family.


Monday, February 20, 2012

Explain or predict: simulation

Some time ago, when I presented the "explain or predict" work, my colleague Avi Gal asked where simulation falls. Simulation is a key method in operations research, as well as in statistics. A related question arose in my mind when thinking of Scott Nestler's distinction between descriptive/predictive/prescriptive analytics. Scott defines prescriptive analytics as "what should happen in the future? (optimization, simulation)".

So where does simulation fall? Does it fall in a completely different goal category, or can it be part of the explain/predict/describe framework?

My opinion is that simulation, like other data analytics techniques, does not define a goal in itself but is rather a tool to achieve one of the explain/predict/describe goals. When the purpose is to test causal hypotheses, simulation can be used to study what-if the causal effect was true, by simulating data from the "causally-true" hypothesis and comparing it to data from "causally-false" scenarios. In predictive and forecasting tasks, where the purpose is to predict new or future data, simulation can be used to generate predictions. It can also be used to evaluate the robustness of predictions under different scenarios (that would have been very useful in recent years economic forecasts!). In descriptive tasks, where the purpose is to approximate data and quantify relationships, simulation can be used to check the sensitivity of the quantified effects to various model assumptions.

On a related note, Scott challenged me on a post from two years ago where I stated that the term data mining used by operations research (OR) does not really mean data mining. I still hold that view, although I believe that the terminology has now changed: INFORMS now uses the term Analytics in place of data mining. This term is indeed a much better choice, as it is an umbrella term covering a variety of data analytics methods, including data mining, statistical models and OR methods. David Hardoon, Principal Analytics at SAS Singapore, has shown me several terrific applications that combine methods from these different toolkits. As in many cases, combining methods from different disciplines is often the best way to add value.