Monday, August 29, 2011

Active learning: going mobile in India

I've been using "clickers" since 2002 in all my courses. Clickers are polling devices that students use during class to answer multiple-choice questions that I include in my slides. They encourage students to participate (even the shy ones), they give the teacher immediate feedback about students' knowledge, and are a great ice-breaker for generating interesting discussions. Of course, clickers are also fun. Most students love this active learning technology (statistically speaking, around 90% love it and 10% don't).

Clicker technology has greatly evolved since 2002. Back then, my students would watch me (in astonishment) climbing on chairs before class to place receivers above the blackboard, to allow their infra-red, line-of-sight clickers (the size of TV remotes) to reach the receivers. The receivers were the size of a large matchbox. Slowly the clickers and receivers started shrinking in size and weight...


A few years later came the slick credit-card-size radio-frequency (RF) clickers that did not require line-of-sight. My receiver shrunk to the size of an obese USB stick.

I still love clickers, but am finding their price (hardware and software) unreasonable for education purposes. The high prices ($40/clicker in the USA) are also applicable in India, as I've discovered (a quote of over $4,000 for a set of 75 clickers and a receiver raised my eyebrows to my hairline). In addition, now that everyone carries around this gadget called a mobile phone, why burden my students with yet-more-hardware?


This brought me to research using mobiles for polling. I discovered www.polleverywhere.com, which offers a facility for creating polls via their website, then embedding the polls into slides (Power Point etc.). Students can respond with their mobile phones by sending an SMS, tweeting, or using the Internet. I am especially interested in the mobile option, to avoid needing wireless Internet connection, smartphones, or laptops in class.


So, how does this work in India?
The bad news: While in the USA and Canada the SMS option is cheap (local number), polleverywhere.com does not have a local number for India (you must text an Australian number).

The good news: Twitter! Students with Bharti Airtel plans can tweet to respond to a poll (that is, send an SMS to a local number in India). I just tested this from Bhutan, and tweeting works beautifully.

The even-better news: Those using other Indian carriers can still tweet using the cool workaround provided by www.smstweet.in. This allows tweeting to a number in Bangalore.

The cost? A fraction to the university (around $700/year for 200 students using the system in parallel) and only local SMS cost to the students. How well will this system work in practice? I am planning to try it out in my upcoming course Business Intelligence Using Data Mining @ ISB, and will post about my experience.

Wednesday, August 17, 2011

Where computer science and business meet

Data mining is taught very differently at engineering schools and at business schools. At engineering schools, data mining is taught more technically, deciphering how different algorithms work. In business schools the focus is on how to use algorithms in a business context.

Business students with a computer science background can now enjoy both worlds: take a data mining course with a business focus, and supplement it with the free course materials from Stanford Engineering school's Machine Learning course (including videos of lectures and handouts by Prof Andrew Ng). There are a bunch of other courses with free materials as part of the Stanford Engineering Everywhere program.

Similarly, computer science students with a business background can take advantage of MIT's Sloan School of Management Open Courseware program, and in particular their Data Mining course (last offered in 2003 by Prof Nitin Patel). Unfortunately, there are no lecture videos, but you do have access to handouts.

And for instructors in either world, these are great resources!


Thursday, August 04, 2011

The potential of being good

Yesterday I happened to hear talks by two excellent speakers, both on major data mining applications in industry. One common theme was that both speakers gave compelling and easy to grasp examples of what data mining algorithms and statistics can do beyond human intelligence, and how the two relate.

The first talk, by IBM's Global Services Christer Johnson, was given at the 2011 INFORMS Conference on Business Analytics and Operations Research (see video). Christer Johnson described the idea behind Watson, the artificial intelligence computer system developed by IBM that beat two champions of the Jeopardy quiz show. Two main points in the talk about the relationship between humans and data mining methods that I especially liked are:
  1. Data analytics methods are designed not only to give an answer, but also to evaluate how confident they are about the answer. In answering the jeopardy questions, the data mining approach tells you not only what is the most likely answer, but also how confident you are about that answer.
  2. Building trust in an analytics tool occurs when you see it make mistakes and learn from those mistakes.
The second talk, "The Art and Science of Matching Items to Users" was given by Deepak Agarwal , a Yahoo! principle research scientist and fellow statistician, was webcasted at ISB's seminar series. You can still catch it on Aug 10 at Yahoo!'s Big Thinker Series in Bangalore. The talk was about recommender systems and their use within Yahoo!. Among various approaches used by Yahoo! to improve recommendations, Deepak described a main idea for improving the customization of news item displays on news.yahoo.com.

On the relation between human intelligence and automation, the process of choosing which items to display on Yahoo! is a two-step process, where first human editors create a pool of potential interesting news items, and then automated machine-learning algorithms choose which individual items to display from that pool.

Like Christer Johnson's point #2, Deepak illustrated the difference between "the answer" (what we statisticians call a point estimate) and "the potential of it being good" (what we call the confidence in the estimate, AKA variability) in a very cool way: Consider two news items of which one will be displayed to a user. The first item was already shown to 100 users and 2 users clicked on links from that page. The second was shown  to 10,000 users and 250 users clicked on links. Which news item should you show to maximize clicks? (yes, this is about ad revenues...) Although the first item has a lower click-through-rate (2%), it is also less certain, in the sense that it is based on less data than item 2. Hence, it is potentially good. He then took this one step further: Combine the two! "Exploit what is known to be good, explore what is potentially good".

So what do we have here? Very practical and clear examples of why we care about variance, the weakness of point estimates, and expanding the notion of diversification to combining certain good results with uncertain not-that-good results.

Wednesday, July 27, 2011

Analytics: You want to be in Asia

Business Intelligence and Data Mining have become hot buzzwords in the West. Using Google Insights for Search to "see what the world is searching for" (see image below), we can see that the popularity of these two terms seems to have stabilized (if you expand the search to 2007 or earlier, you will see the earlier peak and also that Data Mining was hotter for a while). Click on the image to get to the actual result, with which you can interact directly. There are two very interesting insights from this search result:
  1. Looking at the "Regional Interest" for these terms, we see that the #1 country searching for these terms is India! Hong Kong and Singapore are also in the top 5. A surge of interest in Asia!
  2. Adding two similar terms that have the term Analytics, namely Business Analytics and Data Analytics, unveils a growing interest in Analytics (whereas the two non-analytics terms have stabilized after their peak).
What to make of this? First, it means Analytics is hot. Business Analytics and Data Analytics encompass methods for analyzing data that add value to a business or any other organization. Analytics includes a wide range of data analysis methods, from visual analytics to descriptive and explanatory modeling, and predictive analytics. From statistical modeling, to interactive visualization (like the one shown here!), to machine-learning algorithms and more. Companies and organizations are hungry for methods that can turn their huge and growing amounts of data into actionable knowledge. And the hunger is most pressing in Asia.
Click on the image to refresh the Google Insight for Search result (in a new window)

Thursday, July 14, 2011

Designing an experiment on a spatial network: To Explain or To Predict?

Image from
http://www.slews.de
Spatial data are inherently important in environmental applications. An example is collecting data from air or water quality sensors. Such data collection mechanisms introduce dependence in the collected data due to their spatial proximity/distance. This dependence must be taken into account not only in the data analysis stage (and there is a good statistical literature on spatial data analysis methods), but also in the design of experiments stage. One example of a design question is where to locate the sensors and how many sensors are needed?

Where does explain vs. predict come into the picture? An interesting 2006 article by Dale Zimmerman called "Optimal network design for spatial prediction, covariance parameter estimation, and empirical prediction" tells the following story:
"...criteria for network design that emphasize the utility of the network for prediction (kriging) of unobserved responses assuming known spatial covariance parameters are contrasted with criteria that emphasize the estimation of the covariance parameters themselves. It is shown, via a series of related examples, that these two main design objectives are largely antithetical and thus lead to quite different “optimal” designs" 
(Here is the freely available technical report).

Monday, June 20, 2011

Got Data?!

The American Statistical Association's store used to sell cool T-shirts with the old-time beggar-statistician question "Got Data?" Today it is much easier to find data, thanks to the Internet. Dozens of student teams taking my data mining course have been able to find data from various sources on the Internet for their team projects. Yet, I often receive queries from colleagues in search of data for their students' projects. This is especially true for short courses, where students don't have sufficient time to search and gather data (which is highly educational in itself!).

One solution that I often offer is data from data mining competitions. KDD Cup is a classic, but there are lots of other data mining competitions that make huge amounts of real or realistic data available: past INFORMS Data Mining Contests (200820092010), ENBIS Challenges, and more. Here's one new competition to add to the list:

The European Network for Business and Industrial Statistics (ENBIS) announced the 2011 Challenge (in collaboration with SAS JMP). The title is "Maximising Click Through Rates on Banner Adverts: Predictive Modeling in the On Line World". It's a bit complicated to find the full problem description and data on the ENBIS website (you'll find yourself clicking-through endless "more" buttons - hopefully these are not data collected for the challenge!), so I linked them up.

It's time for T-shirts saying "Got Data! Want Knowledge?"

Friday, June 17, 2011

Scatter plots for large samples

While huge datasets have become ubiquitos in fields such as genomics, large datasets are now also becoming to infiltrate research in the social sciences. Data from eCommerce sites, online dating sites, etc. are now collected as part of research in information systems, marketing and related fields. We can now find social science research papers with hundreds of thousands of observations and more.

A common type of research question in such studies is about the relationship between two variables. For example, how does the final price of an online auction relate to the seller's feedback rating? A classic exploratory tool for examining such questions (before delving into formal data analysis) is the scatter plot. In small sample studies, scatter plots are used for exploring relationships and detecting outliers.

Image from http://prsdstudio.com/ 
With large samples, however, the scatter plot runs into a few problems. With lots of observations, there is likely to be too much overlap between markers on the scatter plot, even to the point of insufficient pixels to display all the points.

Here are some large-sample strategies to make scatter plots useful:

  1. Aggregation: display groups of observations in a certain area on the plot as a single marker. Size or color can denote the number of aggregated observations.
  2. Small-multiples: split the data into multiple scatter plots by breaking down the data into (meaningful) subsets. Breaking down the data by geographical location is one example. Make sure to use the same axis scales on all plots - this will be done automatically if your software allows "trellising".
  3. Sample: draw smaller random samples from the large dataset and plot them in multiple scatter plots (again, keep the axis scales identical on all plots).
  4. Zoom-in: examine particular areas of the scatter plot by zooming in
Finally, with large datasets it is useful to consider charts that are based on aggregation such as histograms and box plots. For more on visualization, see the Visualization chapter in Data Mining for Business Intelligence.