Monday, April 02, 2012

The world is flat? Only for US students

Learning and teaching has become a global endeavor with lots of online resources and technologies. Contests are an effective way to engage a diverse community from around the world. In the past I have written several posts about contests and competitions in data mining, statistics and more. And now about a new one.

Tableau is a US-based company that sells a cool data visualization tool (there's a free version too). The company has recently seen huge growth with lots of new adopters in industry and academia. Their "Tableau for teaching" (TfT) program is intended to assist instructors and teachers by providing software and resources for data visualization courses. The program is promoted as global "Tableau for Teaching Around the World" (see the interactive dashboard at the bottom of this post). As part of this program, a student contest was recently launched where students are provided with real data and are challenged to produce good visualizations that tell compelling stories. The data are from Lesotho, Africa (given by the NGO CARE) and the prizes are handsome. I was almost getting excited about this contest (non-US data, visualization, nice prizes for students) when I read the draconian contest eligibility rules:
ELIGIBILITY: The Tableau Student Data Challenge Contest (“The Awards,” “Contest” or “Promotion”) is offered and open only to legal residents of the 50 United States and the District of Columbia (“United States”) who at time of entry (a) are the legal age of majority in their state of residence; (b) physically reside in the United States; (c) are enrolled as a college or university accredited in the United States; and (d) are not an Ineligible Person
I was deeply disappointed. Not only does the contest exclude non-US students (even branches of US universities outside of the US are excluded!), but more disturbing is the fact that only US residents can win a prize for telling a story about lives of people in Lesotho. Condescending? Wouldn't local Lesotho students (or at least students in the region) be the most knowledgeable about the meaning of the data? Wouldn't they be the ones most qualified to tell the story of Lesotho people that emerges from the data? Wouldn't they be the first to identify surprising patterns or exceptions and even wrong data?

While one country "telling the story" of another country is common at the political level, there is no reason that open-minded private visualization software companies should endorse the same behavior. If the problem of awarding cash prizes to non-US citizens is tax-related, I am sure there are creative ways, such as giving free software licenses, to offer prizes that can be distributed to any enthusiastic and talented student of visualization around the world. In short, I call Tableau to change the rules and follow CARE's motto "Defending Dignity".


Tuesday, March 13, 2012

Data liberation via visualization

"Data democratization" movements try to make data, and especially government-held data, publicly available and accessible. A growing number of technological efforts are devoted to such efforts and especially the accessibility part. One such effort is by data visualization companies. A recent trend is to offer a free version (or at least free for some period) that is based on sharing your visualization and/or data to the Web. The "and/or" here is important, because in some cases you cannot share your data, but you would like to share the visualizations with the world. This is what I call "data liberation via visualization". This is the case with proprietary data, and often even if I'd love to make data publicly available, I am not allowed to do so by binding contracts.

As part of a "data liberation via visualization" initiative, I went in search of a good free solution for disseminating interactive visualization dashboards while protecting the actual data. Two main free viz players in the market are TIBCO Spotfire Silver (free 1-year license Personal version), and Tableau Public (free). Both allow *only* public posting of your visualizations (if you want to save visualizations privately you must get the paid versions). That's fine. However, public posting of visualizations with these tools comes with a download button that make your data public as well.

I then tried MicroStrategy Cloud Personal (free Beta version), which does allow public (and private!) posting of visualizations and does not provide a download button. Of course, in order to make visualizations public, the data must sit on a server that can be reached from the visualization. All the free public-posting tools keep your data on the company's servers, so you must trust the company to protect the confidentiality and safety of your data. MicroStrategy uses a technology where the company itself cannot download your data (your Excel sheet is converted to in-memory cubes that are stored on the server). Unfortunately, the tool lacks the ability to create dashboards with multiple charts (combining multiple charts into a fully-linked interactive view).

Speaking of features, Tableau Public is the only one that has full-fledged functionality like its cousin paid tools. Spotfire Silver Personal is stripped from highly useful charts such as scatterplots and boxplots. MicroStrategy Cloud Personal lacks multi-view dashboards and for now accepts only Excel files as input.

Sunday, March 11, 2012

Big Data: The Big Bad Wolf?

"Big Data" is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call "fat data"), others of huge numbers of records ("long data"), and some talk of both. How much is big? Again, depends who you ask.

As a statistician who's (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of "just get a good sample and get it over with", and later recognizing that "fitting the data to the toolkit" (or, "to a hammer everything looks like a nail") is straight-jacketing some great opportunities.

The LinkedIn group Advanced Business Analytics, Data Mining and Predictive Modeling reacted passionately to a the question "What is the value of Big Data research vs. good samples?" posted by a statistician and analytics veteran Michael Mout. Respondents have been mainly from industry - statisticians and data miners. I'd say that the sentiment analysis would come out mixed, but slightly negative at first ("at some level, big data is not necessarily a good thing"; "as statisticians, we need to point out the disadvantages of Big Data"). Over time, sentiment appears to be more positive, but not reaching anywhere close to the huge Big Data excitement in the media.

I created a Wordle of the text in the discussion until today (size represents frequency). It highlights the main advantages and concerns of Big Data. Let me elaborate:
  • Big data permit the detection of complex patterns (small effects, high order interactions, polynomials, inclusion of many features) that are invisible with small data sets
  • Big data allow studying rare phenomena, where a small percentage of records contain an event of interest (fraud, security)
  • Sampling is still highly useful with big data (see also blog post by Meta Brown); with the ability to take lots of smaller samples, we can evaluate model stability, validity and predictive performance
  • Statistical significance and p-values become meaningless when statistical models are fitted to very large samples. It is then practical significance that plays the key role.
  • Big data support the use of algorithmic data mining methods that are good at feature selection. Of course, it is still necessary to use domain knowledge to avoid "garbage-in-garbage-out"
  • Such algorithms might be black-boxes that do not help understand the underlying relationship, but are useful in practice for predicting new records accurately
  • Big data allow the use of many non-parametric methods (statistical and data mining algorithms) that make much less assumptions about data (such as independent observations)
Thanks to social media, we're able to tap on many brains that have experience, expertise and... some preconceptions. The data collected from such forums can help us researchers to focus our efforts on the needed theoretical investigation of Big Data, to help move from sentiments to theoretically-backed-and-practically-useful knowledge.

Wednesday, March 07, 2012

Forecasting + Analytics = ?

Quantitative forecasting is an age-old discipline, highly useful across different functions of an organization: from  forecasting sales and workforce demand to economic forecasting and inventory planning.

Business schools have offered courses with titles such as "Time Series Forecasting", "Forecasting Time Series Data", "Business Forecasting",  more specialized courses such as "Demand Planning and Sales Forecasting" or even graduate programs with title "Business and Economic Forecasting". Simple "Forecasting" is also popular. Such courses are offered at the undergraduate, graduate and even executive education. All these might convey the importance and usefulness of forecasting, but they are far from conveying the coolness of forecasting.

I've been struggling to find a better term for the courses that I teach on-ground and online, as well as for my recent book (with the boring name Practical Time Series Forecasting). The name needed to convey that we're talking about forecasting, particularly about quantitative data-driven forecasting, plus the coolness factor. Today I discovered it! Prof Refik Soyer from GWU's School of Business will be offering a course called "Forecasting for Analytics". A quick Google search did not find any results with this particular phrase -- so the credit goes directly to Refik. I also like "Forecasting Analytics", which links it to its close cousins "Predictive Analytics" and "Visual Analytics", all members of the Business Analytics family.


Monday, February 20, 2012

Explain or predict: simulation

Some time ago, when I presented the "explain or predict" work, my colleague Avi Gal asked where simulation falls. Simulation is a key method in operations research, as well as in statistics. A related question arose in my mind when thinking of Scott Nestler's distinction between descriptive/predictive/prescriptive analytics. Scott defines prescriptive analytics as "what should happen in the future? (optimization, simulation)".

So where does simulation fall? Does it fall in a completely different goal category, or can it be part of the explain/predict/describe framework?

My opinion is that simulation, like other data analytics techniques, does not define a goal in itself but is rather a tool to achieve one of the explain/predict/describe goals. When the purpose is to test causal hypotheses, simulation can be used to study what-if the causal effect was true, by simulating data from the "causally-true" hypothesis and comparing it to data from "causally-false" scenarios. In predictive and forecasting tasks, where the purpose is to predict new or future data, simulation can be used to generate predictions. It can also be used to evaluate the robustness of predictions under different scenarios (that would have been very useful in recent years economic forecasts!). In descriptive tasks, where the purpose is to approximate data and quantify relationships, simulation can be used to check the sensitivity of the quantified effects to various model assumptions.

On a related note, Scott challenged me on a post from two years ago where I stated that the term data mining used by operations research (OR) does not really mean data mining. I still hold that view, although I believe that the terminology has now changed: INFORMS now uses the term Analytics in place of data mining. This term is indeed a much better choice, as it is an umbrella term covering a variety of data analytics methods, including data mining, statistical models and OR methods. David Hardoon, Principal Analytics at SAS Singapore, has shown me several terrific applications that combine methods from these different toolkits. As in many cases, combining methods from different disciplines is often the best way to add value.

Tuesday, December 20, 2011

Trading and predictive analytics

I attended today's class in the course Trading Strategies and Systems offered by Prof Vasant Dhar from NYU Stern School of Business. Luckily, Vasant is offering the elective course here at the Indian School of Business, so no need for transatlantic travel.

The topic of this class was the use of news in trading. I won't disclose any trade secrets (you'll have to attend the class for that), but here's my point: Trading is a striking example of the distinction between explanation and prediction. Generally, techniques are based on correlations and on "blackbox" predictive models such as neural nets. In particular, text mining and sentiment analysis are used for extracting information from (often unstructured) news articles for the purpose of prediction.

Vasant mentioned the practical advantage of a machine-learning approach for extracting useful content from text over linguistics know-how. This reminded me of a famous comment by Frederick Jelinek, a prominent
Natural Language Processing researcher who passed away recently:
"Whenever I fire a linguist our system performance improves" (Jelinek, 1998)
This comment was based on Jelinek's experience at IBM Research, while working on computer speech recognition and machine translation.

Jelinek's comment did not make linguists happy. He later defended this claim in a paper entitled "Some of My Best Friends are Linguists" by commenting,
"We all hoped that linguists would provide us with needed help. We were never reluctant to include linguistic knowledge or intuition into our systems; if we didn't succeed it was because we didn't fi nd an effi cient way to include it."
Note: there are some disputes regarding the exact wording of the quote ("Anytime a linguist leaves the group the recognition rate goes up") and its timing -- see note #1 in the Wikipedia entry.

Wednesday, December 07, 2011

Polleverywhere.com -- how it worked out

Following up on my earlier post about the use of polleverywhere.com for polling in class, here is a summary of my experience using it in a data mining elective course @ ISB (38 students, after four sessions):
  • Creating polls: After a few tries and with a few very helpful tips from a PE representative, I was able to create polls and embed them into my Power Point slides. This is relatively easy and user-friendly. One feature that is currently missing in PE, which I use a lot, is the inclusion of a figure on the poll slide (for example, a snippet of some software output). Although you can paste the image on the PPT, it takes a bit of testing to place it so that it does not overlap on the poll. Also, if you need to use the poll in a browser instead of the PPT (see below), the image won't be there...
  • Operation in class: PE requires good Internet connection for the instructor and for all the users with laptops or using the wireless with a different device. Although wireless is generally operational in the classroom that I used, I did encounter a few times when it was flaky, which is very disruptive (the poll does not load; students cannot respond). Secondly, I found that voting takes much longer with mobile/laptops than with clickers. What would have taken 30 seconds with clickers can take several minutes with PE voting.
  • Student adoption: During the first session students were curious and quickly figured out how to vote. Students could either vote using a browser (I created the page pollev.com/profgalit where live polls would show up) or those lacking Internet access used their mobiles to tweet via SMS (Airtel free SMS to 53000; other carriers SMS to Bangalore number 09243000111 via smstweet.in). As the sessions progressed, the number of voters started dropping drastically. I suspected that this might be a result of my changing the settings to allow only registered users to vote. So I switched back to "anyone can vote", yet the voting percentage remained very low.
I have never graded voting, and rather use it as a fun active learning tool. With clickers response rate was typically around 80-90%, while with PE it is currently lower than 50%. Given our occasional Internet challenge, the longer voting time, and especially the low response rate I will be going back to clickers for now.

I foresee that PE would work nicely in a setting such as a one-time talk at a large conference, or a one-day workshop for execs. I will also mention the excellent and timely support by PE. And, of course, the low price!