tag:blogger.com,1999:blog-218313842024-03-08T00:54:42.349+05:30BzST | Business Analytics, Statistics, TeachingA blog by Prof. Galit ShmueliGalit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.comBlogger187125tag:blogger.com,1999:blog-21831384.post-10319319593124179362020-12-03T15:05:00.004+05:302020-12-03T15:05:47.208+05:30Machine learning algorithms surprises at deployment? (article on Medium)Machine
learning (ML) algorithms are being used to generate predictions in
every corner of our decision-making life. Methods range from “simple”
algorithms such as trees, forests, naive Bayes, linear and logistic
regression models, and nearest-neighbor methods, through improvements
such as boosting, bagging, regularization, and ensembling, to
computationally-intensive, blackbox deep Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-45501924901800093972018-12-10T16:28:00.001+05:302018-12-10T17:51:25.644+05:30Forecasting large collections of time seriesWith the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting "at scale"!
Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financial forecasting. The types of methods taught in forecasting courses tends to be discipline-specific:
Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-79228184494472971912018-02-04T19:42:00.000+05:302018-02-04T19:42:00.383+05:30Data Ethics Regulation: Two key updates in 2018This year, two important new regulations will be impacting research with human subjects: the EU's General Data Protection Regulation (GDPR), which kicks in May 2018, and the USA's updated Common Rule, called the Final Rule, is in effect from Jan 2018. Both changes relate to protecting individuals' private information and will affect researchers using behavioral data in terms of data collection, Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-91546815735184478962017-12-25T12:27:00.000+05:302017-12-25T16:42:23.491+05:30Election polls: description vs. predictionMy papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven't expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-42659787125413036922017-11-06T18:19:00.000+05:302017-11-06T18:19:10.448+05:30Statistical test for "no difference"To most researchers and practitioners using statistical inference, the popular hypothesis testing universe consists of two hypotheses:
H0 is the null hypothesis of "zero effect"
H1 is the alternative hypothesis of "a non-zero effect"
The alternative hypothesis (H1) is typically what the researcher is trying to find: a different outcome for a treatment and control group in an experiment, a Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com1tag:blogger.com,1999:blog-21831384.post-3265830154957887362017-09-05T19:50:00.001+05:302017-09-05T19:50:23.297+05:30My videos for “Business Analytics using Data Mining” now publicly available!
Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-2866794992184630202017-03-14T06:45:00.002+05:302017-03-14T06:45:41.567+05:30Data mining algorithms: how many dummies?There's lots of posts on "k-NN for Dummies". This one is about "Dummies for k-NN"
Categorical predictor variables are very common. Those who've taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps:
Convert the categorical variable that has m categories, into m binary dummy Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-50111530593550710092016-12-22T13:48:00.001+05:302016-12-27T19:07:51.263+05:30Key challenges in online experiments: where are the statisticians?
Randomized experiments (or randomized controlled trials, RCT) are a powerful tool for testing causal relationships. Their main principle is random assignment, where subjects or items are assigned randomly to one of the experimental conditions. A classic example is a clinical trial with one or more treatment groups and a no-treatment (control) group, where individuals are assigned at random to Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-24600994397513093922016-10-24T14:32:00.000+05:302016-10-24T18:12:38.019+05:30Experimenting with quantified self: two months hooked up to a fitness bandIt's one thing to collect and analyze behavioral big data (BBD) and another to understand what it means to be the subject of that data. To really understand. Yes, we're all aware that our social network accounts and IoT devices share our private information with large and small companies and other organizations. And although we complain about our privacy, we are forgiving about sharing it, most Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-44417361146712435062016-04-26T12:36:00.001+05:302016-04-26T17:45:37.186+05:30Statistical software should remove *** notation for statistical significanceNow that the emotional storm following the American Statistical Association's statement on p-values is slowing down (is it? was there even a storm outside of the statistics area?), let's think about a practical issue. One that greatly influences data analysis in most fields: statistical software. Statistical software influences which methods are used and how they are reported. Software companies Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-9296642949260299132016-03-24T13:33:00.000+05:302016-03-25T05:37:07.719+05:30A non-traditional definition of Big Data: Big is RelativeI've noticed that in almost every talk or discussion that involves the term Big Data, one of the first slides by the presenter or the first questions to be asked by the audience is "what is Big Data?" The typical answer has to do with some digits, many V's, terms that end with "bytes", or statements about software or hardware capacity.
I beg to differ.
"Big" is relative. It is relative to a Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-22878602477752024802015-12-07T15:19:00.001+05:302015-12-07T15:19:50.824+05:30Predictive analytics in the long termTen years ago, micro-level prediction the way we know it today, was nearly absent in companies. MBAs learned about data analysis mostly in a requires statistics course, which covered mostly statistical inference and descriptive modeling. At the time, I myself was learning my way into the predictive world, and designed the first Data Mining course at University of Maryland's Smith School of Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-83698009583509839452015-08-19T06:40:00.000+05:302015-08-19T06:40:48.148+05:30Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.
From http://blog.excelmasterseries.com
Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-29839021232105102452015-03-02T19:15:00.000+05:302015-03-02T19:15:01.123+05:30Psychology journal bans statistical inference; knocks down server
In its recent editorial, the journal Basic and Applied Social Psychology announced that it will no longer accept papers that use classical statistical inference. No more p-values, t-tests, or even... confidence intervals!
"prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about ‘‘significant’’ differences or Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-90556079061599257722015-02-07T11:56:00.000+05:302015-02-07T11:56:09.334+05:30Teaching spaces: "Analytics in a Studio"My first semester at NTHU has been a great learning experience. I introduced and taught two new courses in our new Business Analytics concentration (data mining and forecasting). Both courses met once a week for a 3-hour session for a full semester (18 weeks). Although I've taught these courses in different forms, in different countries, and to different audiences, I had a special Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-74115891506851477972014-12-19T10:39:00.001+05:302014-12-19T10:44:42.082+05:30New curriculum design guidelines by American Statistical Association: Who will teach?
The American Statistical Association published new "Curriculum Guidelines for Undergraduate Programs in Statistical Science". This is the first update to the guidelines since 2000.
The executive summary lists the key points:
Increased importance of data science
Real applications
More diverse models and approaches
Ability to communicate
This set sounds right on target with what is expected Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-77298416708509159152014-10-16T20:25:00.002+05:302014-10-16T20:31:00.952+05:30What's in a name? "Data" in Mandarin ChineseThe term "data", now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I've been aware of is whether the English term "data" is singular or plural. The tone of an entire article would be different based on the author's decision.
In Hebrew, the word is in plural (Netunim, with the final "im" signifying plural), so no question arises.
Today I Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-38416568361133753002014-09-26T15:34:00.001+05:302014-09-26T20:36:03.322+05:30Humane and Socially Responsible Analytics: A new concentration at National Tsing Hua UniversityThis Fall, I'm introducing two new elective courses at NTHU's Institute of Service Science: Business Analytics using Data Mining and Business Analytics using Forecasting (if you're wondering about the difference, see an earlier post). The two new courses join three other elective courses to form the new concentration in Business Analytics. Courses in this concentration are aimed at getting Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-85256296534289939972014-09-19T14:22:00.000+05:302014-09-19T14:40:58.848+05:30India redefines "reciprocity"; Israeli professionals pay the priceAfter a few years of employment at the Indian School of Business (in 2010 as a visitor and later as a tenured SRITNE Chaired Professor of Data Analytics), the time has come for me to get a new Employment Visa. As an Israeli-American, I decided to apply for the visa using my Israeli passport. I was almost on my way to the Indian embassy when I discovered, to my horror, that the fee is over USD $Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-36815610332335132822014-04-02T21:15:00.001+05:302014-09-19T14:41:20.129+05:30Parallel coordinate plot in Tableau: a workaroundThe parallel coordinate plot is useful for visualizing multivariate data in a dis-aggregated way, where we have multiple numerical measurements for each record. A scatter plot displays two measurements for each record by using the two axes. A parallel coordinate plot can display many measurements for each record, by using many (parallel) axes - one for each measurement.
While not as popular as Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-27191873593141128562014-03-15T14:00:00.000+05:302014-03-15T14:15:11.958+05:30Can women be professors or doctors? Not according to Jet Airways
I am already used to the comical scene at airports in Asia, where a sign-holder with "Professor Galit Shmueli" sees us walk in his/her direction and right away rushes to my husband. Whether or not the stereotype is based on actual gender statistics of professors in Asia is a good question.
What I don't find amusing is when a corporate like Jet Airways, under the guise of "celebrating Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-40071663388884578782014-03-06T10:06:00.001+05:302014-03-06T10:06:20.679+05:30The use of dummy variables in predictive algorithms
Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables:
"If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a "baseline" and leave out its Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-41115885022731353772013-11-28T01:12:00.001+05:302013-11-28T09:01:59.465+05:30Running a data mining contest on Kaggle
Following the success last year, I've decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle.
Setting up a competition on KaggleGalit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-73861941223935944642013-11-21T14:42:00.001+05:302013-11-21T14:54:14.355+05:30The Scientific Value of Testing Predictive Performance
This week's NY Times article Risk Calculator for Cholesterol Appears Flawed and CNN article Does calculator overstate heart attack risk? illustrate the power of evaluating the predictive performance of a model for purposes of validating the underlying theory.
The NYT article describes findings by two Harvard Medical School professors, Ridker and Cook, about extreme over-estimation ofGalit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0tag:blogger.com,1999:blog-21831384.post-66884030814810108572013-11-05T11:54:00.003+05:302013-11-06T17:15:02.978+05:30A Tale of Two (Business Analytics) CoursesI have been teaching two business analytics elective MBA-level courses at ISB. One is called "Business Analytics Using Data Mining" (BADM) and the other, "Forecasting Analytics" (FCAS). Although we share the syllabi for both courses, I often receive the following question, in this variant or the other:
What is the difference between the two courses?
The short answer is: BADM is focused on Galit Shmuelihttp://www.blogger.com/profile/06119270323184007583noreply@blogger.com0