Showing posts with label data analytics. Show all posts
Showing posts with label data analytics. Show all posts

Tuesday, April 26, 2016

Statistical software should remove *** notation for statistical significance

Now that the emotional storm following the American Statistical Association's statement on p-values is slowing down (is it? was there even a storm outside of the statistics area?), let's think about a practical issue. One that greatly influences data analysis in most fields: statistical software. Statistical software influences which methods are used and how they are reported. Software companies thus affect entire disciplines and how they progress and communicate.
Star notation for p-value thresholds in statistical software

No matter whether your field uses SAS, SPSS (now IBM), STATA, or another statistical software package, you're likely to have seen the star notation (this isn't about hotel ratings). One star (*) means p-value<0.05, two stars (**) mean p-value<0.01, and three stars (***) mean p-value<0.001.

According to the ASA statement, p-values are not the source of the problem, but rather their discretization. The ASA recommends:

"P-values, when used, would be reported as values, rather than inequalities (p = .0168, rather than p < 0.05). Indeed, we envision there being better recognition that measurement of the strength of evidence really is continuous, rather than discrete."
This statement is a strong signal to the statistical software companies: continuing to use the star notation, even if your users are addicted to them, is in violation of the ASA recommendation. Will we be seeing any change soon?


Wednesday, August 19, 2015

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.

From http://blog.excelmasterseries.com
Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,..., Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) Then, we include m-1 of the dummy variables in the regression model. The major point is to exclude one of the m dummy variables to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

In linear and logistic regression models, including all m variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

The only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.

Saturday, February 07, 2015

Teaching spaces: "Analytics in a Studio"

My first semester at NTHU has been a great learning experience. I introduced and taught two new courses in our new Business Analytics concentration (data mining and forecasting). Both courses met once a week for a 3-hour session for a full semester (18 weeks). Although I've taught these courses in different forms, in different countries, and to different audiences, I had a special discovery this time. I discovered the critical role of the learning space on the quality of teaching and learning. Specifically for a topic that combines technical, creativity and communication skills.

"Case study" classroom
In my many years of experience as a student and later as a professor at multiple universities, I've experienced two types of spaces: a lecture hall and a "case study" classroom. While the latter is more conducive to in-class discussions, both spaces put the instructor (and his/her slides) in the front, separated from most the students, and place the students in rows. In both cases the instructor is typically standing or moving around, while the students are immobile. Not being exposed to alternatives, I am ashamed to say that I never doubted this arrangement. Until this semester.

Like all discoveries, it started from a challenge: the classroom allocated for my courses was a wide room with two long rows, hardly any space for the instructor and no visibility of the slides for most of the students on the sides. My courses had 20-30 students each. My first attempt was to rearrange the tables to create a U-shape, so that students could see each other and the slides. In hindsight, I was trying to create more of a "case study" environment. After one session I realized it didn't work. The U was too long and narrow and there was a feeling of suffocation. And stagnancy. One of my classes was transferred to a case-type classroom. I was relieved. But for the other class there was no such classroom available. I examined a few different classrooms, but they were all lecture halls suitable for larger audiences.

Teams tackle a challenge using a whiteboard
And then, I discovered "the studio". Intended for design workshops, this was a room with no tables or chairs, with walls that are whiteboards plus double-sided whiteboards on wheels. In a corner was a stack of hard sponge blocks and a few low foldable square tables. There's a projector and a screen. I decided to take the plunge with the data mining course, since it is designed as a blended course where class time is devoted to discussions and hands-on assignments and experiences. [Before coming to class, students read and watch videos, take a short quiz, and contribute to an online discussion].

Here is how we used the space: At least half of each session engaged teams of students in a problem/question that they needed to tackle using a whiteboard. The challenges I came up with emerged from the interaction with the students - from the online discussion board, from discussing the online quizzes, and from confusion/difficulties in the pre-designed in-class assignments. After each team worked on their board, we all moved from board to board, the team explained their approach, and I highlighted something about each solution/attempt. This provided great learning for everyone, including myself, since different teams usually approached the problems in different ways. And they encountered different problems or insights.
Students give feedback on other teams' proposals

The setup was also conducive for team project feedback. After each team presented their proposal, the other teams provided them feedback by writing on their "wall" (whiteboard). This personal touch - rather than an email or discussion board - seems to makes a difference in how the feedback is given and perceived.

Smartphones were often used to take photos of different boards - their own and well as others' boards.

Student demos software to others
During periods of the sessions where students needed to work on laptops, many chose to spread out on the floor - a more natural posture for many folks than sitting at a desk. Some used the sponges to place their laptops. A few used a square table where 4 people faced each other.

We also used the space to start class with a little stretching and yoga! The students liked the space. So did two colleagues (Prof. Rob Hyndman and Prof. Joao Moreira) who teach analytics courses at their universities and visited my courses. Some students complained at first about sitting on the hard floor, so I tried to make sure they don't sit for long, or at least not passively. My own "old school" bias made me forget how it feels to be passively sitting.

Visitor Prof. Moreira experiences the studio
Although I could see the incredible advantages during the semester, I waited till its end to write this post. My perspective now is that teaching analytics in a studio is revolutionary. The space supports deeper learning, beneficial collaboration both within groups and across groups, better personalization of the teaching level by stronger communication between the instructor and students, and overall a high-energy and positive experience for everyone. One reason that makes "analytics in a studio" so powerful is the creativity aspect in data analytics. You use statistical and data mining foundations, but the actual problem-solving requires creativity and out-of-the-box thought.

From my experience, the requirements for "analytics in a studio" to work are:
  1. Students must come prepared to class with the needed technical basics (e.g., via reading/video watching/etc.) 
  2. The instructor must be flexible in terms of the specifics taught. I came into class focused on 2-3 main points students needed to learn, I had in-class assignments, and designed teams-on-whiteboards challenges on-the-fly. 
  3. The instructor is no longer physically in the center, but s/he must be an effective integrator, challenger, and guide of the directions taken. This allows students to unleash their abilities, but in a constructive way. It also helps avoid a feeling of "what did we learn?"
How does "analytics in a studio" scale to larger audiences? I am not sure. While class sizes of many Analytics programs are growing to meet the demand, top programs and educators should carefully consider the benefits of smaller class sizes in terms of learning and learning experience. And they should carefully choose their spaces.

Sunday, March 11, 2012

Big Data: The Big Bad Wolf?

"Big Data" is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call "fat data"), others of huge numbers of records ("long data"), and some talk of both. How much is big? Again, depends who you ask.

As a statistician who's (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of "just get a good sample and get it over with", and later recognizing that "fitting the data to the toolkit" (or, "to a hammer everything looks like a nail") is straight-jacketing some great opportunities.

The LinkedIn group Advanced Business Analytics, Data Mining and Predictive Modeling reacted passionately to a the question "What is the value of Big Data research vs. good samples?" posted by a statistician and analytics veteran Michael Mout. Respondents have been mainly from industry - statisticians and data miners. I'd say that the sentiment analysis would come out mixed, but slightly negative at first ("at some level, big data is not necessarily a good thing"; "as statisticians, we need to point out the disadvantages of Big Data"). Over time, sentiment appears to be more positive, but not reaching anywhere close to the huge Big Data excitement in the media.

I created a Wordle of the text in the discussion until today (size represents frequency). It highlights the main advantages and concerns of Big Data. Let me elaborate:
  • Big data permit the detection of complex patterns (small effects, high order interactions, polynomials, inclusion of many features) that are invisible with small data sets
  • Big data allow studying rare phenomena, where a small percentage of records contain an event of interest (fraud, security)
  • Sampling is still highly useful with big data (see also blog post by Meta Brown); with the ability to take lots of smaller samples, we can evaluate model stability, validity and predictive performance
  • Statistical significance and p-values become meaningless when statistical models are fitted to very large samples. It is then practical significance that plays the key role.
  • Big data support the use of algorithmic data mining methods that are good at feature selection. Of course, it is still necessary to use domain knowledge to avoid "garbage-in-garbage-out"
  • Such algorithms might be black-boxes that do not help understand the underlying relationship, but are useful in practice for predicting new records accurately
  • Big data allow the use of many non-parametric methods (statistical and data mining algorithms) that make much less assumptions about data (such as independent observations)
Thanks to social media, we're able to tap on many brains that have experience, expertise and... some preconceptions. The data collected from such forums can help us researchers to focus our efforts on the needed theoretical investigation of Big Data, to help move from sentiments to theoretically-backed-and-practically-useful knowledge.

Wednesday, March 07, 2012

Forecasting + Analytics = ?

Quantitative forecasting is an age-old discipline, highly useful across different functions of an organization: from  forecasting sales and workforce demand to economic forecasting and inventory planning.

Business schools have offered courses with titles such as "Time Series Forecasting", "Forecasting Time Series Data", "Business Forecasting",  more specialized courses such as "Demand Planning and Sales Forecasting" or even graduate programs with title "Business and Economic Forecasting". Simple "Forecasting" is also popular. Such courses are offered at the undergraduate, graduate and even executive education. All these might convey the importance and usefulness of forecasting, but they are far from conveying the coolness of forecasting.

I've been struggling to find a better term for the courses that I teach on-ground and online, as well as for my recent book (with the boring name Practical Time Series Forecasting). The name needed to convey that we're talking about forecasting, particularly about quantitative data-driven forecasting, plus the coolness factor. Today I discovered it! Prof Refik Soyer from GWU's School of Business will be offering a course called "Forecasting for Analytics". A quick Google search did not find any results with this particular phrase -- so the credit goes directly to Refik. I also like "Forecasting Analytics", which links it to its close cousins "Predictive Analytics" and "Visual Analytics", all members of the Business Analytics family.


Monday, February 20, 2012

Explain or predict: simulation

Some time ago, when I presented the "explain or predict" work, my colleague Avi Gal asked where simulation falls. Simulation is a key method in operations research, as well as in statistics. A related question arose in my mind when thinking of Scott Nestler's distinction between descriptive/predictive/prescriptive analytics. Scott defines prescriptive analytics as "what should happen in the future? (optimization, simulation)".

So where does simulation fall? Does it fall in a completely different goal category, or can it be part of the explain/predict/describe framework?

My opinion is that simulation, like other data analytics techniques, does not define a goal in itself but is rather a tool to achieve one of the explain/predict/describe goals. When the purpose is to test causal hypotheses, simulation can be used to study what-if the causal effect was true, by simulating data from the "causally-true" hypothesis and comparing it to data from "causally-false" scenarios. In predictive and forecasting tasks, where the purpose is to predict new or future data, simulation can be used to generate predictions. It can also be used to evaluate the robustness of predictions under different scenarios (that would have been very useful in recent years economic forecasts!). In descriptive tasks, where the purpose is to approximate data and quantify relationships, simulation can be used to check the sensitivity of the quantified effects to various model assumptions.

On a related note, Scott challenged me on a post from two years ago where I stated that the term data mining used by operations research (OR) does not really mean data mining. I still hold that view, although I believe that the terminology has now changed: INFORMS now uses the term Analytics in place of data mining. This term is indeed a much better choice, as it is an umbrella term covering a variety of data analytics methods, including data mining, statistical models and OR methods. David Hardoon, Principal Analytics at SAS Singapore, has shown me several terrific applications that combine methods from these different toolkits. As in many cases, combining methods from different disciplines is often the best way to add value.

Wednesday, July 27, 2011

Analytics: You want to be in Asia

Business Intelligence and Data Mining have become hot buzzwords in the West. Using Google Insights for Search to "see what the world is searching for" (see image below), we can see that the popularity of these two terms seems to have stabilized (if you expand the search to 2007 or earlier, you will see the earlier peak and also that Data Mining was hotter for a while). Click on the image to get to the actual result, with which you can interact directly. There are two very interesting insights from this search result:
  1. Looking at the "Regional Interest" for these terms, we see that the #1 country searching for these terms is India! Hong Kong and Singapore are also in the top 5. A surge of interest in Asia!
  2. Adding two similar terms that have the term Analytics, namely Business Analytics and Data Analytics, unveils a growing interest in Analytics (whereas the two non-analytics terms have stabilized after their peak).
What to make of this? First, it means Analytics is hot. Business Analytics and Data Analytics encompass methods for analyzing data that add value to a business or any other organization. Analytics includes a wide range of data analysis methods, from visual analytics to descriptive and explanatory modeling, and predictive analytics. From statistical modeling, to interactive visualization (like the one shown here!), to machine-learning algorithms and more. Companies and organizations are hungry for methods that can turn their huge and growing amounts of data into actionable knowledge. And the hunger is most pressing in Asia.
Click on the image to refresh the Google Insight for Search result (in a new window)