Thursday, April 08, 2010

Mathematics, statistics, and machine learning

Today's Wall Street Journal featured an article called New Hiring Formula Values Math Pros, talking about how Bay area companies
"are in hot pursuit of a particular kind of employee: those with experience in statistics and other data-manipulation techniques."
Let's take a look at a few more quotes from the article:
"Being a math geek has never been cooler"
"[companies] need more workers with stronger backgrounds in statistics and a related field called machine learning, which involves writing algorithms that get smarter over time by looking for patterns in large data sets"
This article, like many others, is confusing very different fields of expertise. Try the following on any statistician or data miner in your vicinity:

math = statistics = data manipulation = machine learning

Most likely, you were showered with an angry explanation of how these are different. One of the comments on the WSJ article was "I wish there wasn't such a tendency to think that "machine learning" and "applied statistics" are the same thing. There are plenty of other ways to apply statistics than to use the techniques of machine learning."

Although there are many views on the subject, here's my take on the basic differences:

  1. Mathematics is the farthest from the other 3. Although math is used as one of the basics for these fields (as in physics), its goal is very different from the data-driven goals of statistics and machine learning. Most mathematicians never touch real data in their lives. I will make one caveat: While I'm talking about applied statistics, some theoretical statisticians might be close to mathematicians. I once sat on a PhD committee with theoretical statisticians. I suggested to the student to justify the need for the complicated methodological development that she worked on, when a simple solution could do a decent job. Her advisor stormed in saying "here we like to prove theorems!"
  2. Applied statistics is a sophisticated set of technologies to quantify uncertainty from data in real life situations for the purpose of theory testing, description, or prediction. Applied statisticians start from a real problem with real data, and apply (and develop) methods for solving the problem.
  3. Data Manipulation -- this is a very strange term. Statisticians might use this term to refer to initial data cleaning, or other initial data operations such as taking transformations or removing outliers. Data miners might use this term to describe the creation of new variables from existing ones, or converting some variables into new format. I have no idea what a mathematician would think of this term ("what do you mean by data?")
  4. Machine learning, also often termed data mining, is a sophisticated set of technologies that are usually aimed at predicting new data, by automatically learning patterns from existing data. Lots of existing data. The field uses a variety of algorithms and methods from statistics and artificial intelligence. Although statistical methods (such as linear and logistic regression) are commonly used in machine learning, they are most often used in a purely predictive way. Many of the artificial intelligence-based algorithms are "black-box" in that they provide excellent results (e.g., accurate predictions) but do not rely on formal causal models and do not shed direct light on the causal relationship between the inputs and the outputs.
Sometimes computer scientists use the term "data mining" to refer to data warehousing and data querying. That is completely different from #4.

And finally, there is one last "equivalence" that irritates me: data mining = operations research (OR). When OR colleagues use the term "data mining" they usually are talking about an optimization problem. They do not typically use machine learning or statistical methods. Or, if they do mean "data mining" as in #4 above, then it has nothing to do with OR. For example, see Professor Michael Trick's Operations Research Blog entry Data Mining, Operations Research, and Predicting Murder. How is this related to OR?