Tuesday, February 23, 2010

Online data collection

Online data are a huge resources for research as well as in practice. Although it is often tempting to "scrape everything" using technologies like web-crawling, it is extremely important to keep the goal of the analysis in mind. Are you trying to build a predictive model? A descriptive model? How will the model be used? Deployed to new records? etc.

Dean Tau from Co-soft recently posted an interesting and useful comment in the Linked-in group Data Mining, Statistics, and Data Visualization. With his permission, I am reproducing his post:

What you need to do before online data collection?

Data colllection is collecting useful intelligence for making decisions such as product price determination. Nowadays, available on websites, directories, B2B/B2C platforms, e-books, e-newspaper, yellow page, official data, accessible databases, vast and updated information online encourages more people to collect data from the Internet. Before data mining, you still need to be well prepared, as the ancient Chinese saying “Preparedness ensures success, unpreparedness spells failure.”

  1. Why do you want to collect intelligence or what's your objective? What will you do with this intelligence after collection? Making a description of your project can help the data mining team have a better understanding of your aim. Taking an example, an objective can be I want to collect enough intelligence to determine a competitive price for my product.
  2. What type of information you need to collect to support your final analysis / decision? Such as, if you want to collect the prices of similar product, product specification are necessary to collect for comparison of the same one. The external factors like coupon, gifts or tax also need to be considered for accuracy.
  3. Where? General searching using keywords or gathering data from specific resources or database depends on project nature. The information from e-commerce websites would be a great avenue for price gathering and product specification.
  4. Who? Will you collect the data by using the resources of your own or outside resources? Outsourcing of online research work to lower wages countries with the accessibility of internet capabilities and vast English educated personnel like China would be an option for cutting cost. The people who are going to do the work need training and necessary resources on that.
  5. How? Always remember your purpose of collecting data to improve the collection process. The methodology and process need to be defined to ensure accurate and reliable data. Decisions making on wrong data would result in serious problems.
I've summarized 5 tips from myself and my clients' experiences, hopefully to provide some insights for you. If you have any opinion or experience in online data gathering or outsourcing, please share with us or contact me directly.

Friday, February 12, 2010

Over-fitting analogies

To explain the danger of model over-fitting in prediction to data mining newcomers, I often use the following analogy:
Say you are at the tailor's, who will be sewing an expensive suit (or dress) for you. The tailor takes your measurements and asks whether you'd like the suit to fit you exactly, or whether there should be some "wiggle room". What would you choose?
The answer is, "it depends how you plan to use the suit". If you are getting married in a few days, then probably a close fit is desirable. In contrast, if you plan to wear the suit to work throughout the next few years, you'd most likely want some "wiggle room"... The latter case is similar to prediction, where you want to make sure to accommodate new records (your body's measurements during the next few years) that are not exactly identical to the current data. Hence, you want to avoid over-fitting. The wedding scenario is similar to models built for causal explanation, where you do want the model to fit the data well (back to the explanation vs. prediction distinction).

I just found some nice terminology, by Bruce Ratner (GenIQ.net), explaining the idea of over-fitting:
A model is built to represent a training data... not to reproduce a training data. [Otherwise], a visitor from the validation data will not feel at home. The visitor encounters an uncomfortable fit in the model because s/he probabilistically does not look like a typical data-point from the training data. Thus, the misfit visitor takes a poor prediction