Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Sunday, February 04, 2018

Data Ethics Regulation: Two key updates in 2018

This year, two important new regulations will be impacting research with human subjects: the EU's General Data Protection Regulation (GDPR), which kicks in May 2018, and the USA's updated Common Rule, called the Final Rule, is in effect from Jan 2018. Both changes relate to protecting individuals' private information and will affect researchers using behavioral data in terms of data collection, access, use, applications for ethics committee (IRB) approvals/exemptions, collaborations within the same country/region and beyond, and collaborations with industry.
Both GDPR and the final rule try to modernize what today constitutes "private data" and data subjects' rights and balance it against "free flow of information between EU countries" (GDPR) or . However, the GDPR's approach is much more strongly in favor of protecting private data
Here are a few points to note about GDPR:

  1. "Personal data" (GDPR) or "private information" (final rule) is very broadly defined and includes data on physical, physiological or behavioral characteristics of a person "which allow or confirm the unique identification of that natural person".
  2. The GDPR affects any organization within the EU as well as "external organizations that are trading within the EU". It applies to personal data on any person, not just EU citizens/residents.
  3. The GDPR distinguishes between "data controller" (the entity who has the data, in the eyes of the data subjects, e.g. a hospital) and "data processor" (the entity who operates on the data). Both entities are bound and liable by GDPR.
  4. GDPR distinguishes between "data processing" (any operation related to the data including storage, structuring, record deletion, transfer) and "profiling" (automated processing of personal data to "evaluate personal aspects relating to a natural person". 
  5. The Final Rule now offers an option of relying on broad consent obtained for future research as an alternative to seeking IRB approval to waive the consent requirement.
  6. Domestic collaborations within the US now require a single institutional review board (IRB) approval (for the portion of the research that takes place within the US) - effective 2021.
The Final Rule tries to lower burden for low-risk research. One attempt is new "exemption" categories for secondary research use of identifiable private information (i.e. re-using
identifiable information collected for some other ‘‘primary’’ or ‘‘initial’’ activity) when: 
  • The identifiable private information is publicly available;
  • The information is recorded by the investigator in such a way that the identity of subjects cannot readily be ascertained, and the investigator does not contact subjects or try to re-identify subjects; 
  • The secondary research activity is regulated under HIPAA; or
  • The secondary research activity is conducted by or on behalf of a federal entity and involves the use of federally generated non-research information provided that the original collection was subject to specific federal privacy protections and continues to be protected.
This approach to secondary data, and specifically to observational data from public sources, seems in contrast to the GDPR approach that states that the new regulations also apply when processing historical data for "historical research purposes". Metcalf (2018) criticized the above Final Rule exemption because "these criteria for exclusion focus on the status of the dataset (e.g., is it public? does it already exist?), not the content of the dataset nor what will be done with the dataset, which are more accurate criteria for determining the risk profile of the proposed research".

Monday, December 25, 2017

Election polls: description vs. prediction

My papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven't expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and predictive modeling both share the reliance on correlations, whereas explanatory modeling relies on causality, the former two are in fact different. Descriptive modeling aims to give a parsimonious statistical representation of a distribution or relationship, whereas predictive modeling aims at generating values for new/future observations.

The recent paper Election Polls—A Survey, A Critique, and Proposals by Kenett, Pfeffermann & Steinberg gives a fantastic illustration of the difference between description and prediction: they explain the different goals of election surveys (such as those conducted by Gallup) as compared to survey-based predictive models such as those by Nate Silver's FiveThirtyEight:
"There is a subtle, but important, difference between reflecting current public sentiment and predicting the results of an election. Surveys [election polls] have focused largely on the former—in other words, on providing a current snapshot of voting preferences, even when asking about voting preference as if elections were carried out on the day of the survey. In that regard, high information quality (InfoQ) surveys are accurately describing current opinions of the electorate. However, the public perception is often focused on projecting the survey results forward in time to election day, which is eventually used to evaluate the performance of election surveys. Moreover, the public often focuses solely on whether the polls got the winner right and not on whether the predicted vote shares were close to the true results."
In other words, whereas the goal of election surveys is to capture the public perception at different points in time prior to the election, they are often judged by the public as failed because of low predictive power on elections day. The authors continue to say:
"Providing an accurate current picture and predicting the ultimate winner are not contradictory goals. As the election approaches, survey results are expected to increasingly point toward the eventual election outcome, and it is natural that the success or failure of the survey methodology and execution is judged by comparing the final polls and trends with the actual election results."
Descriptive models differ from predictive models in another sense that can lead to vastly different results: in a descriptive model for an event of interest we can use the past and the future relative to that event time. For example, to describe spikes in pre-Xmas shopping volume we can use data on pre- and post-Xmas days. In contrast, to predict pre-Xmas shopping volume we can only use information available prior to the pre-Xmas shopping period of interest.

As Kenett et al. (2017) write, description and prediction are not contradictory. They are different, yet the results of descriptive models can provide leads for strong predictors, and potentially for explanatory variables (which require further investigation using explanatory modeling).

The new interesting book Everybody Lies by Stephens-Davidowitz is a great example of descriptive modeling that uncovers correlations that might be used for prediction (or even for future explanatory work). The author uncovers behavioral search patterns on Google by examining keywords search volumes using Google Trends and AdWords. For the recent US elections, the author identifies a specific keyword search term that separates between areas of high performance for Clinton vs. Trump:
"Silver noticed that the areas where Trump performed best made for an odd map. Trump performed well in parts of the Northeast and industrial Midwest, as well as the South. He performed notably worse out West. Silver looked for variables to try to explain this map... Silver found that the single factor that best correlated with Donald Trump's support in the Republican primaries was... [areas] that made the most Google searches for ______."
[I am intentionally leaving the actual keyword blank because it is offensive.]
While finding correlations is a dangerous game that can lead to many false discoveries (two measurements can be correlated because they are both affected by something else, such as weather), careful descriptive modeling tightly coupled with domain expertise can be useful for exploratory research, which later should be tested using explanatory modeling.

While I love Stephens-Davidowitz' idea of using Google Trends to uncover behaviors/thoughts/feelings that are hidden otherwise (because what people say in surveys often diverges from what they really do/think/feel), a main question is who do these search results represent? (sampling bias). But that's a different topic altogether.

Tuesday, March 14, 2017

Data mining algorithms: how many dummies?

There's lots of posts on "k-NN for Dummies". This one is about "Dummies for k-NN"

Categorical predictor variables are very common. Those who've taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps:

  1. Convert the categorical variable that has m categories, into m binary dummy variables
  2. Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category)
For example, if we have X={red, yellow, green}, in step 1 we create three dummies:
D_red = 1 if the value is 'red' and 0 otherwise
D_yellow = 1 if the value is 'yellow' and 0 otherwise
D_green = 1 if the value is 'green' and 0 otherwise

In the regression model we might have: Y = b0 + b1 D_red + b2 D_yellow + error
[Note: mathematically, it does not matter which dummy you drop out: the regression coefficients b1, b2
now compare against the left-out category].

When you move to data mining algorithms such as k-NN or trees, the procedure is different: we include all m dummies as predictors when m>2, but in the case m=2, we use a single dummy. Dropping a dummy (when m>2) will distort the distance measure, leading to incorrect distances.
Here's an example, based on X = {red, yellow, green}:

Case 1: m=3 (use 3 dummies)

Here are 3 records, their category (color), and their dummy values on (D_red, D_yellow, D_green):





The distance between each pair of records (in terms of color) should be identical, since all three records are different from each other. Suppose we use Euclidean distance. The distance between each pair of records will be equal to 2. For example:

Distance(#1, #2) = (1-0)^2 + (0-1)^2 + (0-0)^2 = 2.

If we drop one dummy, then the three distances will no longer be identical! For example, if we drop D_green:
Distance(#1, #2) = 1 + 1 = 2
Distance(#1, #3) = 1
Distance(#2, #3) = 1

Case 2: m=2 (use single dummy)

The above problem doesn't happen with m=2. Suppose we have only {red, green}, and use a single dummy. The distance between a pair of records will be 0 if the records are the same color, or 1 if they are different.
Why not use 2 dummies? If we use two dummies, we are doubling the weight of this variable but not adding any information. For example, comparing the red and green records using D_red and D_green would give Distance(#1, #3) = 1 + 1 = 2.

So we end up with distances of 0 or 2 instead of weights of 0 or 1.

Bottom line 

In data mining methods other than regression models (e.g., k-NN, trees, k-means clustering), we use m dummies for a categorical variable with m categories - this is called one-hot encoding. But if m=2 we use a single dummy.

Thursday, October 16, 2014

What's in a name? "Data" in Mandarin Chinese

The term "data", now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I've been aware of is whether the English term "data" is singular or plural. The tone of an entire article would be different based on the author's decision.

In Hebrew, the word is in plural (Netunim, with the final "im" signifying plural), so no question arises.

Today I discovered another "data" duality, this time in Mandarin Chinese. In Taiwan, the term used is 資料 (Zīliào), while in Mainland China it is 數據 (Shùjù). Which one to use? What is the difference? I did a little research and tried a few popularity tests:

  1. Google Translate from Chinese to English translates both terms to "data". But Chinese-to-English translates data to 數據 (Shùjù) with the other term appearing as secondary. Here we also learn that 資料 (Zīliào) means "material".
  2. Chinese Wikipedia's main data article (embarrassingly poor) is for 數據 (Shùjù) and the article for 資料 (Zīliào) redirects you to the main article.
  3. A Google search of each term leads to surprising results on number of hits:

Search results for "data" term Zīliào
Search results for "data" term Shùjù
I asked a few colleagues from different Chinese-speaking countries and learned further that 資料 (Zīliào) translates to information. A Google Images search brings images of "Information". This might also explain the double hit rate. A duality between data and information is especially interesting given the relationship between the two (and my related work on Information Quality with Ron Kenett).


So what about Big Data?  Here too there appear to be different possible terms, yet the most popular seems to be 大数据 (Dà shùjù), which also has a reasonably respectable Wikipedia article.

Thanks to my learned colleagues Chun-houh Chen (Academia Sinica), Khim Yong Goh (National University of Singapore), and Mingfeng Lin (University of Arizona) for their inputs.

Tuesday, May 14, 2013

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates. 

A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors and teachers who design and teach data mining courses build course plans, assignments, and projects based on the assumption that the data will be available. And now, we have to revise all our materials! What a waste of time, resources, and energy. Obviously, after the first time this happens, I think twice whether to use contest data for teaching. I will not try to convince you of the waste of faculty time, the loss to students, or even the loss to self-learners who are now all over the globe. I'll wear my b-school professor hat and speak the ROI language: The companies lose. In the short term, all these students will not be competing in their contest. The bigger loss, however, is to the entire business sector in the longer term: I constantly receive urgent requests from businesses and large corporations for business analytics trained graduates. "We are in desperate need of data-savvy managers! Can you help us find good data scientists?". Yet, some of these same companies are pulling their data out of the hands of instructors and students.

I am perplexed as to why a company would pull their data away after the data was publicly available for a long time. It's not the fear of competitors, since the data were already publicly available. It's probably not the contest platform CEOs - they'd love the traffic. Then why? I smell lawyers...

One example is the beautiful Heritage Health contest that ran on Kaggle for 2 years. Now, there is no way to get the data, even if you were a registered contestant in the past (see screenshots). Yet, the healthcare sector has been begging for help in training "healthcare analytics" professionals. 


Data access is no longer available. Although it was for 2 years.

I'd like to send out an appeal to Heritage Provider Network and other companies that have shared invaluable data through a publicly-available contest platform. Please consider making the data publicly available forever. We will then be able to integrate it into data analytics courses, textbooks, research, and projects, thereby providing industry with not only insights, but also properly trained students who've learned using real data and real problems.

This time, industry can contribute in a major way to reducing the academia-industry gap. Even if only for their own good.

Tuesday, March 13, 2012

Data liberation via visualization

"Data democratization" movements try to make data, and especially government-held data, publicly available and accessible. A growing number of technological efforts are devoted to such efforts and especially the accessibility part. One such effort is by data visualization companies. A recent trend is to offer a free version (or at least free for some period) that is based on sharing your visualization and/or data to the Web. The "and/or" here is important, because in some cases you cannot share your data, but you would like to share the visualizations with the world. This is what I call "data liberation via visualization". This is the case with proprietary data, and often even if I'd love to make data publicly available, I am not allowed to do so by binding contracts.

As part of a "data liberation via visualization" initiative, I went in search of a good free solution for disseminating interactive visualization dashboards while protecting the actual data. Two main free viz players in the market are TIBCO Spotfire Silver (free 1-year license Personal version), and Tableau Public (free). Both allow *only* public posting of your visualizations (if you want to save visualizations privately you must get the paid versions). That's fine. However, public posting of visualizations with these tools comes with a download button that make your data public as well.

I then tried MicroStrategy Cloud Personal (free Beta version), which does allow public (and private!) posting of visualizations and does not provide a download button. Of course, in order to make visualizations public, the data must sit on a server that can be reached from the visualization. All the free public-posting tools keep your data on the company's servers, so you must trust the company to protect the confidentiality and safety of your data. MicroStrategy uses a technology where the company itself cannot download your data (your Excel sheet is converted to in-memory cubes that are stored on the server). Unfortunately, the tool lacks the ability to create dashboards with multiple charts (combining multiple charts into a fully-linked interactive view).

Speaking of features, Tableau Public is the only one that has full-fledged functionality like its cousin paid tools. Spotfire Silver Personal is stripped from highly useful charts such as scatterplots and boxplots. MicroStrategy Cloud Personal lacks multi-view dashboards and for now accepts only Excel files as input.

Monday, June 20, 2011

Got Data?!

The American Statistical Association's store used to sell cool T-shirts with the old-time beggar-statistician question "Got Data?" Today it is much easier to find data, thanks to the Internet. Dozens of student teams taking my data mining course have been able to find data from various sources on the Internet for their team projects. Yet, I often receive queries from colleagues in search of data for their students' projects. This is especially true for short courses, where students don't have sufficient time to search and gather data (which is highly educational in itself!).

One solution that I often offer is data from data mining competitions. KDD Cup is a classic, but there are lots of other data mining competitions that make huge amounts of real or realistic data available: past INFORMS Data Mining Contests (200820092010), ENBIS Challenges, and more. Here's one new competition to add to the list:

The European Network for Business and Industrial Statistics (ENBIS) announced the 2011 Challenge (in collaboration with SAS JMP). The title is "Maximising Click Through Rates on Banner Adverts: Predictive Modeling in the On Line World". It's a bit complicated to find the full problem description and data on the ENBIS website (you'll find yourself clicking-through endless "more" buttons - hopefully these are not data collected for the challenge!), so I linked them up.

It's time for T-shirts saying "Got Data! Want Knowledge?"

Friday, October 09, 2009

SAS On Demand: Enterprise Miner -- Update

Following up on my previous posting about using SAS Enterprise Minder via the On Demand platform: From continued communication with experts at SAS, it turns out that with the EM version 5.3, which is the one available through On Demand, there is no way to work (or even access) non-SAS files. Their suggestion solution is to use some other SAS product like SAS BASE, or even SAS JMP (which is available through the On Demand platform) in order to convert your CSV files to SAS data files...

From both a pedagogical and practical point of view, I am reluctant to introduce SAS EM through On Demand to my MBA students. They will dislike the idea of downloading, learning, and using yet another software package (even if it is a client) just for the purpose of file conversion (from ordinary CSV files into SAS data files).

So at this point it seems as though SAS EM via the On Demand platform may be useful in SAS-based courses that use SAS data files. Hopefully SAS will upgrade the version to the latest, which is supposed to be able to handle non-SAS data files.

Saturday, October 03, 2009

SAS On Demand: Enterprise Miner

I am in the process of trying out SAS Enterprise Miner via the (relatively new) SAS On Demand for Academics. In our MBA data mining course at Smith, we introduce SAS EM. In the early days, we'd get individual student licenses and have each student install the software on their computer. However, the software took too much space and it was also very awkward to circulate a packet of CDs between multiple students. We then moved to the Server option, where SAS EM is available on the Smith School portal. Although it solved the individual installation and storage issues, the portal version is too slow to be practically useful for even a modest project. Disconnects and other problems have kept students away. So now I am hoping that the On Demand service that SAS offers (which they call SODA) will work.

For the benefit of other struggling instructors, here's my experience thus far: I have been unable to access any non-SAS data files, and therefore unable to evaluate the product. The On Demand version installed is EM 5.3, which is still very awkward in terms of importing data, and especially non-SAS data.  It requires uploading files to the SAS server via FTP, and then opening SAS EM, creating a new project, and then inserting a line or two of SAS code into the non-obvious "startup code" tab. The code includes a LIBNAME statement for creating a path to one's library, and a FILENAME statement in order to reach files in that library (thank goodness I learned SAS programming as an undergrad!). Definitely not for the faint of heart, and I suspect that MBAs won't love this either.

I've been in touch with SAS support and thus far we haven't solved the data access issue, although they helped me find the path where my files were sitting in (after logging in to SAS On Demand For Academics, and clicking on your course, click on "how to use this directory").

If you have been successful with this process, please let me know!
I will post updates when I conquer this, one way or another.


Thursday, August 20, 2009

Data Exploration Celebration: The ENBIS 2009 Challenge

The European Network for Business and Industrial Statistics (ENBIS) has released the 2009 ENBIS Challenge. The challenge this time is to use an exploratory data analysis (EDA) tool to answer a bunch of questions regarding sales of laptop computers in London. The data on nearly 200,000 transactions include 3 files: sales data (for each computer sold, with time stamps and zipcode locations of customer and store), computer configuration information, and geographic information linking zipcodes to GIS coordinates. Participants are challenged to answer a set of 11 questions using EDA.

The challenge is sponsored by JMP (by SAS), who are obviously promoting the EDA strengths of JMP (fair enough), yet analysis can be done using any software.

What I love about this competition is that unlike other data-based competitions such as the KDD Cup, INFORMS, or the many forecasting competitiong (e.g. NN3), it focuses solely on exploratory analysis. No data mining, no statistical models. From my experience, the best analyses rely on a good investment of time and energy in data visualization. Some of today's data visualization tools are way beyond static boxplots and histograms. Interactive visualization software such as TIBCO Spotfire (and Tableau, which I haven't tried) allow many operations such as zooming, filtering, panning. They support multivariate exploration via the use of color, shape, panels, etc. and they include specialized visualization tools such as treemaps and parallel coordinate plots.

And finally, although the focus is on data exploration, the business context and larger questions are stated:

In the spirit of a "virtuous circle of learning", the insights gained from this analysis could then used to design an appropriate choice experiment for a consumer panel to determine which characteristics of the various configurations they actually value, thus helping determine product strategy and pricing policies that will maximise Acell's projected revenues in 2009. This latter aspect is not part of the challenge as such.

The Business Objective:
Determine product strategy and pricing policies that will maximise Acell's projected revenues in 2009.

Management's Charter:
Uncover any information in the available data that may be useful in meeting the business objective, and make specific recommendations to management that follow from this (85%). Also assess the relevance of the data provided, and suggest how Acell can make better use of data in 2010 to shape this aspect of their business strategy and operations (15%).

Monday, September 22, 2008

Dr. Doom and data mining

Last month The New York Times featured an article about Dr. Doom: Economics professor "Roubini, a respected but formerly obscure academic, has become a major figure in the public debate about the economy: the seer who saw it coming."

This article caught my statistician eye due to the description of "data" and "models". While economists in the article portray Roubini as not using data and econometric models, a careful read shows that he actually does use data and models, but perhaps unusual data and unusual models!

Here are two interesting quotes:
“When I weigh evidence,” he told me, “I’m drawing on 20 years of accumulated experience using models” — but his approach is not the contemporary scholarly ideal in which an economist builds a model in order to constrain his subjective impressions and abide by a discrete set of data.
Later on, Roubini is quoted:
"After analyzing the markets that collapsed in the ’90s, Roubini set out to determine which country’s economy would be the next to succumb to the same pressures."
This might not be data mining per-se, but note that Roubini's approach is at heart similar to the data mining approach: looking at unusual data (here, taking an international view rather than focus on national only) and finding patterns within them that predict economic downfalls. In a standard data mining framework we would of course include also all those markets that have not-collapsed, and then set up the problem as a "direct marketing" problem: who is most likely to fall?

A final note: As a strong believer in the difference between the goals of explaining and forecasting, I think that econometricians should stop limiting their modeling to explanatory, causality-based models. Good forecasters might not be revealing in terms of causality, but in many cases their forecasts will be far more accurate than those from explanatory models!

Tuesday, February 26, 2008

Data mining competition season

Those who've been following my postings probably recall "competition season" when all of a sudden there are multiple new interesting datasets out there, each framing a business problem that requires the combination of data mining and creativity.

Two such competitions are the SAS Data Mining Shootout and the 2008 Neural Forecasting Competition. The SAS problem concerns revenue management for an airline who wants to improve their customer satisfaction. The NN5 competition is about forecasting cash withdrawals from ATMs.

Here are the similarities between the two competitions: they both provide real data and reasonably real business problems. Now to a more interesting similarity: they both have time series forecasting tasks. From a recent survey on the popularity of types of data mining techniques, it appears that time series are becoming more and more prominent. They also both require registration in order to get access to the data (I didn't compare their terms of use, but that's another interesting comparison), and welcome any type of modeling. Finally, they are both tied to a conference, where competitors can present their results and methods.

What would be really nice is if, like in KDD, the winners' papers would be published online and made publicly available.