Tuesday, March 13, 2012

Data liberation via visualization

"Data democratization" movements try to make data, and especially government-held data, publicly available and accessible. A growing number of technological efforts are devoted to such efforts and especially the accessibility part. One such effort is by data visualization companies. A recent trend is to offer a free version (or at least free for some period) that is based on sharing your visualization and/or data to the Web. The "and/or" here is important, because in some cases you cannot share your data, but you would like to share the visualizations with the world. This is what I call "data liberation via visualization". This is the case with proprietary data, and often even if I'd love to make data publicly available, I am not allowed to do so by binding contracts.

As part of a "data liberation via visualization" initiative, I went in search of a good free solution for disseminating interactive visualization dashboards while protecting the actual data. Two main free viz players in the market are TIBCO Spotfire Silver (free 1-year license Personal version), and Tableau Public (free). Both allow *only* public posting of your visualizations (if you want to save visualizations privately you must get the paid versions). That's fine. However, public posting of visualizations with these tools comes with a download button that make your data public as well.

I then tried MicroStrategy Cloud Personal (free Beta version), which does allow public (and private!) posting of visualizations and does not provide a download button. Of course, in order to make visualizations public, the data must sit on a server that can be reached from the visualization. All the free public-posting tools keep your data on the company's servers, so you must trust the company to protect the confidentiality and safety of your data. MicroStrategy uses a technology where the company itself cannot download your data (your Excel sheet is converted to in-memory cubes that are stored on the server). Unfortunately, the tool lacks the ability to create dashboards with multiple charts (combining multiple charts into a fully-linked interactive view).

Speaking of features, Tableau Public is the only one that has full-fledged functionality like its cousin paid tools. Spotfire Silver Personal is stripped from highly useful charts such as scatterplots and boxplots. MicroStrategy Cloud Personal lacks multi-view dashboards and for now accepts only Excel files as input.

Sunday, March 11, 2012

Big Data: The Big Bad Wolf?

"Big Data" is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call "fat data"), others of huge numbers of records ("long data"), and some talk of both. How much is big? Again, depends who you ask.

As a statistician who's (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of "just get a good sample and get it over with", and later recognizing that "fitting the data to the toolkit" (or, "to a hammer everything looks like a nail") is straight-jacketing some great opportunities.

The LinkedIn group Advanced Business Analytics, Data Mining and Predictive Modeling reacted passionately to a the question "What is the value of Big Data research vs. good samples?" posted by a statistician and analytics veteran Michael Mout. Respondents have been mainly from industry - statisticians and data miners. I'd say that the sentiment analysis would come out mixed, but slightly negative at first ("at some level, big data is not necessarily a good thing"; "as statisticians, we need to point out the disadvantages of Big Data"). Over time, sentiment appears to be more positive, but not reaching anywhere close to the huge Big Data excitement in the media.

I created a Wordle of the text in the discussion until today (size represents frequency). It highlights the main advantages and concerns of Big Data. Let me elaborate:
  • Big data permit the detection of complex patterns (small effects, high order interactions, polynomials, inclusion of many features) that are invisible with small data sets
  • Big data allow studying rare phenomena, where a small percentage of records contain an event of interest (fraud, security)
  • Sampling is still highly useful with big data (see also blog post by Meta Brown); with the ability to take lots of smaller samples, we can evaluate model stability, validity and predictive performance
  • Statistical significance and p-values become meaningless when statistical models are fitted to very large samples. It is then practical significance that plays the key role.
  • Big data support the use of algorithmic data mining methods that are good at feature selection. Of course, it is still necessary to use domain knowledge to avoid "garbage-in-garbage-out"
  • Such algorithms might be black-boxes that do not help understand the underlying relationship, but are useful in practice for predicting new records accurately
  • Big data allow the use of many non-parametric methods (statistical and data mining algorithms) that make much less assumptions about data (such as independent observations)
Thanks to social media, we're able to tap on many brains that have experience, expertise and... some preconceptions. The data collected from such forums can help us researchers to focus our efforts on the needed theoretical investigation of Big Data, to help move from sentiments to theoretically-backed-and-practically-useful knowledge.

Wednesday, March 07, 2012

Forecasting + Analytics = ?

Quantitative forecasting is an age-old discipline, highly useful across different functions of an organization: from  forecasting sales and workforce demand to economic forecasting and inventory planning.

Business schools have offered courses with titles such as "Time Series Forecasting", "Forecasting Time Series Data", "Business Forecasting",  more specialized courses such as "Demand Planning and Sales Forecasting" or even graduate programs with title "Business and Economic Forecasting". Simple "Forecasting" is also popular. Such courses are offered at the undergraduate, graduate and even executive education. All these might convey the importance and usefulness of forecasting, but they are far from conveying the coolness of forecasting.

I've been struggling to find a better term for the courses that I teach on-ground and online, as well as for my recent book (with the boring name Practical Time Series Forecasting). The name needed to convey that we're talking about forecasting, particularly about quantitative data-driven forecasting, plus the coolness factor. Today I discovered it! Prof Refik Soyer from GWU's School of Business will be offering a course called "Forecasting for Analytics". A quick Google search did not find any results with this particular phrase -- so the credit goes directly to Refik. I also like "Forecasting Analytics", which links it to its close cousins "Predictive Analytics" and "Visual Analytics", all members of the Business Analytics family.