Thursday, April 08, 2010

Mathematics, statistics, and machine learning

Today's Wall Street Journal featured an article called New Hiring Formula Values Math Pros, talking about how Bay area companies
"are in hot pursuit of a particular kind of employee: those with experience in statistics and other data-manipulation techniques."
Let's take a look at a few more quotes from the article:
"Being a math geek has never been cooler"
"[companies] need more workers with stronger backgrounds in statistics and a related field called machine learning, which involves writing algorithms that get smarter over time by looking for patterns in large data sets"
This article, like many others, is confusing very different fields of expertise. Try the following on any statistician or data miner in your vicinity:

math = statistics = data manipulation = machine learning

Most likely, you were showered with an angry explanation of how these are different. One of the comments on the WSJ article was "I wish there wasn't such a tendency to think that "machine learning" and "applied statistics" are the same thing. There are plenty of other ways to apply statistics than to use the techniques of machine learning."

Although there are many views on the subject, here's my take on the basic differences:

  1. Mathematics is the farthest from the other 3. Although math is used as one of the basics for these fields (as in physics), its goal is very different from the data-driven goals of statistics and machine learning. Most mathematicians never touch real data in their lives. I will make one caveat: While I'm talking about applied statistics, some theoretical statisticians might be close to mathematicians. I once sat on a PhD committee with theoretical statisticians. I suggested to the student to justify the need for the complicated methodological development that she worked on, when a simple solution could do a decent job. Her advisor stormed in saying "here we like to prove theorems!"
  2. Applied statistics is a sophisticated set of technologies to quantify uncertainty from data in real life situations for the purpose of theory testing, description, or prediction. Applied statisticians start from a real problem with real data, and apply (and develop) methods for solving the problem.
  3. Data Manipulation -- this is a very strange term. Statisticians might use this term to refer to initial data cleaning, or other initial data operations such as taking transformations or removing outliers. Data miners might use this term to describe the creation of new variables from existing ones, or converting some variables into new format. I have no idea what a mathematician would think of this term ("what do you mean by data?")
  4. Machine learning, also often termed data mining, is a sophisticated set of technologies that are usually aimed at predicting new data, by automatically learning patterns from existing data. Lots of existing data. The field uses a variety of algorithms and methods from statistics and artificial intelligence. Although statistical methods (such as linear and logistic regression) are commonly used in machine learning, they are most often used in a purely predictive way. Many of the artificial intelligence-based algorithms are "black-box" in that they provide excellent results (e.g., accurate predictions) but do not rely on formal causal models and do not shed direct light on the causal relationship between the inputs and the outputs.
Sometimes computer scientists use the term "data mining" to refer to data warehousing and data querying. That is completely different from #4.

And finally, there is one last "equivalence" that irritates me: data mining = operations research (OR). When OR colleagues use the term "data mining" they usually are talking about an optimization problem. They do not typically use machine learning or statistical methods. Or, if they do mean "data mining" as in #4 above, then it has nothing to do with OR. For example, see Professor Michael Trick's Operations Research Blog entry Data Mining, Operations Research, and Predicting Murder. How is this related to OR?


tmorito said...

This topic really helpful for me to understand deeply the "Data Mining". I have had a lot of chances to hear and read the word "Data Mining" in finance courses. In terms of its concept, it seems to be used properly as the strong tools to prove the theories and formulas. Interestingly, in the field of quantitative equity portfolio management, "Data Mining" is considered to be a baddie, highly inappropriate practice, though. Data Mining only develop the many statistic relationships in historical data and help to pick the one which only explains the past stock return most accurately. Hence it is concluded that it has very little ability to predict future stock returns. It may be out of the scope of this topic, but I should learn when is the best opportunity to apply the skills of "Data mining" in real world, and should know the real possibility of the power of "Data Mining". Therefore, this kind of definements, who define and how they define "Data Mining", is helpful for my awareness. Thanks.

Galit Shmueli said...

Thank you for your comment and peak into "data mining" in finance. The term "data mining" is indeed often (erroneously) used to describe "data dredging", where you find random patterns in the data and claim that they are non-random. It is interesting to learn that this is the case in the field of quantitative equity portfolio. Of course, data mining is not the same as data dredging, and in fact, over-fitting is considered the biggest danger to predictive accuracy and data mining methodology contains tools to assess and avoid over-fitting (e.g., evaluating performance on a holdout set and comparing the performance on the training and holdout sets).