Tuesday, February 23, 2010

Online data collection

Online data are a huge resources for research as well as in practice. Although it is often tempting to "scrape everything" using technologies like web-crawling, it is extremely important to keep the goal of the analysis in mind. Are you trying to build a predictive model? A descriptive model? How will the model be used? Deployed to new records? etc.

Dean Tau from Co-soft recently posted an interesting and useful comment in the Linked-in group Data Mining, Statistics, and Data Visualization. With his permission, I am reproducing his post:

What you need to do before online data collection?

Data colllection is collecting useful intelligence for making decisions such as product price determination. Nowadays, available on websites, directories, B2B/B2C platforms, e-books, e-newspaper, yellow page, official data, accessible databases, vast and updated information online encourages more people to collect data from the Internet. Before data mining, you still need to be well prepared, as the ancient Chinese saying “Preparedness ensures success, unpreparedness spells failure.”

  1. Why do you want to collect intelligence or what's your objective? What will you do with this intelligence after collection? Making a description of your project can help the data mining team have a better understanding of your aim. Taking an example, an objective can be I want to collect enough intelligence to determine a competitive price for my product.
  2. What type of information you need to collect to support your final analysis / decision? Such as, if you want to collect the prices of similar product, product specification are necessary to collect for comparison of the same one. The external factors like coupon, gifts or tax also need to be considered for accuracy.
  3. Where? General searching using keywords or gathering data from specific resources or database depends on project nature. The information from e-commerce websites would be a great avenue for price gathering and product specification.
  4. Who? Will you collect the data by using the resources of your own or outside resources? Outsourcing of online research work to lower wages countries with the accessibility of internet capabilities and vast English educated personnel like China would be an option for cutting cost. The people who are going to do the work need training and necessary resources on that.
  5. How? Always remember your purpose of collecting data to improve the collection process. The methodology and process need to be defined to ensure accurate and reliable data. Decisions making on wrong data would result in serious problems.
I've summarized 5 tips from myself and my clients' experiences, hopefully to provide some insights for you. If you have any opinion or experience in online data gathering or outsourcing, please share with us or contact me directly.


Unknown said...

Given the short time-frames that many projects in the workplace tend to receive, i.e. "the deadline for this needed to be yesterday" I agree outsourcing research at a low cost seems like the best solution - at first. The risks, as the article points out, are non-trivial. It all depends on whether your object is to get the work done or do the work well. I would certainly hope that any company I am involved with would choose the latter.

Galit Shmueli said...

Thanks Amy. "Outsourcing research at low cost" does indeed sound flaky. The question is whether the low cost is due to low quality or due to economic disparities. In the latter category I'd include high-quality statistical consulting that is done in, say, India.

If the company does have the analytic expertise in house, but not the technical staffing and/or expertise to collect online data, then perhaps outsourcing the data collection task can be a practical solution. Although ideally the data collection and analysis are done by close entities, practical considerations (available funds, expertise, etc.) could lead to outsourcing one or the other. It would therefore be important to have a carefully drafted requirements document, which Tau's 5 rules can help in generating. It would also be important to have a direct and open communication channel between the in-house analysts and the data collectors. Most likely the "outsourced" data collectors will need more information and/or the analysts will realize that they need further data.