BzST | Business Analytics, Statistics, Teaching: SAS

Showing posts with label SAS. Show all posts

Thursday, September 15, 2011

Mining health-related data: How to benefit scientific research

Image from KDnuggets.com

While debates over privacy issues related to electronic health records are still ongoing, predictive analytics are beginning to being used with administrative health data (available to health insurance companies, aka, "health provider networks"). One such venue are large data mining contests. Let me describe a few and then get to my point about their contribution to pubic health, medicine and to data mining research.

The latest and grandest is the ongoing $3 million prize contest by Hereitage Provider Network, which opened in 2010 and lasts 2 years. The contest's stated goal is to create "an algorithm that predicts how many days a patient will spend in a hospital in the next year". Participants get a dataset of de-identified medical records of 100,000 individuals, on which they can train their algorithms. The article in KDNuggets.com suggests that this competition's goal is "to spur development of new approaches in the analysis of health data and create new predictive algorithms."

The 2010 SAS Data Mining Shootout contest was also health-related. Unfortunately, the contest webpage is no longer available (the problem description and data were previously available here), and I couldn't find any information on the winning strategies. From an article in KDNuggets:

"analyzing the medical, demographic, and behavioral data of 50,788 individuals, some of whom had diabetes. The task was to determine the economic benefit of reducing the Body Mass Indices (BMIs) of a selected number of individuals by 10% and to determine the cost savings that would accrue to the Federal Government's Medicare and Medicaid programs, as well as to the economy as a whole"

In 2009, the INFORMS data mining contest was co-organized by IBM Research and Health Care Intelligence, focused on "health care quality". Strangely enough, this contest website is also gone. A brief description by the organizer (Claudia Perlich) is given on KDNuggets.com, stating the two goals :

modeling of a patient transfer guideline for patients with a severe medical condition from a community hospital setting to tertiary hospital provider and
assessment of the severity/risk of death of a patient's condition.

What about presentations/reports from the winners? I had a hard time finding any (here is a deck of slides by a group competing in the 2011 SAS Shootout, also health-related). But photos holding awards and checks abound.

If these health-related data mining competitions are to promote research and solutions in these fields, the contest webpages with problem description, data, as well as presentations/reports by the winners should continue to be publicly available (as for the annual KDD Cup competitions by the ACM). Posting only names and photos of the winners makes data mining competitions look more like a consulting job where the data provider is interested in solving one particular problem for its own (financial or other) benefit. There is definitely scope for a data mining group/organization to collect all this info while it is live and post it in one central website.

Wednesday, November 10, 2010

ASA's magazine: Excel's default charts

Being in Bhutan this year, I have requested the American Statistical Association (ASA) and INFORMS to mail the magazines that come with my membership to Bhutan. Although I can access the magazines online, I greatly enjoy receiving the issues by mail (even if a month late) and leafing through them leisurely. Not to mention the ability to share them with local colleagues who are seeing these magazines for the first time!

Now to the data-analytic reason for my post: The main article in the August 2010 issue of AMSTAT News (the ASA's magazine) on Fellow Award: Revisited (Again) presented an "update to previous articles about counts of fellow nominees and awardees." The article comprised of many tables and line charts. While charts are a great way to present a data-based story, the charts in this article were of low quality (see image below). Apparently, the authors used Excel 2003's defaults, which have poor graphic qualities and too much chart-junk: a dark grey background, horizontal gridlines, line color not very suitable for black-white printing (such as the print issue), a redundant combination of line color and marker shape, and redundant decimals on several of the plot y-axis labels.

As the flagship magazine of the ASA, I hope that the editors will scrutinize the graphics and data visualizations used in the articles, and perhaps offer authors access to a powerful data visualization software such as TIBCO Spotfire, Tableau, or SAS JMP. Major newspapers such as the New York Times and Washington Post now produce high-quality visualizations. Statistics magazines mustn't fall behind!

Wednesday, May 12, 2010

SAS On Demand Take 3: Success!

I am following up on two earlier posts regarding using SAS On Demand for Academics. The version of EM has been upgraded to 6.1, which means that I am now able to upload and reach non-SAS files on the SAS Server - hurray!

The process is quite cumbersome, and I do thank my SAS programming memory from a decade ago. Here's a description for those instructors who want to check it out (it took me quite a while to piece all the different parts and figure out the right code):

Find the directory path for your course on the SAS server. Login into SODA (https://support.sas.com/ctx3/sodareg/user.html). Near the appropriate course that you registered, click on the "info" link. Scroll down to the line starting with "filename sample" and you'll find the directory path.
Upload the file of interest to the SAS server via FTP. Note that you can upload txt and csv files but not xls or xlsx files. The hostname is sascloudftp.sas.com . You will also need your username and password. Upload your file to the path that you found in #1.
To read the file in SAS SODA EM, start a new project. When you click on its name (top left), you should be able to see "Project Start Code" in the left side-bar. Click on the ...
Now enter the SAS code to run for this project. The following code will allow you to access your data. The trick is both to read the file and to put it into a SAS Library where you will be able to reach it for modeling. Let's assume that you uploaded the file sample.csv:

libname mylibrary '/courses/.../'; THIS IS YOUR PATH

filename myfile '/courses/.../sample.csv'; USE THE SAME PATH
data mydata;
infile myfile DLM='2C0D'x firstobs=2 missover;
input x1 x2 x3 ...;
run;
data mylibrary.mydata;
set mydata;
run;

The options in the infile line will make sure that a CSV file is read correctly (commas and the carriage return at the end of the line! tricky!)

You can replace all the names that start with "my" with your favorite names.

Note that only instructors can upload data to the SAS server, not students. Also, if you plan to share data with your students, you might want to set them as read only.

5. The last step is to create a new datasource. Choose SAS Table and find the new library that you created (called "mylibrary"). Double-click on it to see the file ("myfile") and choose it. You can now drag the new datasource to the diagram.

Tuesday, February 26, 2008

Data mining competition season

Those who've been following my postings probably recall "competition season" when all of a sudden there are multiple new interesting datasets out there, each framing a business problem that requires the combination of data mining and creativity.

Two such competitions are the SAS Data Mining Shootout and the 2008 Neural Forecasting Competition. The SAS problem concerns revenue management for an airline who wants to improve their customer satisfaction. The NN5 competition is about forecasting cash withdrawals from ATMs.

Here are the similarities between the two competitions: they both provide real data and reasonably real business problems. Now to a more interesting similarity: they both have time series forecasting tasks. From a recent survey on the popularity of types of data mining techniques, it appears that time series are becoming more and more prominent. They also both require registration in order to get access to the data (I didn't compare their terms of use, but that's another interesting comparison), and welcome any type of modeling. Finally, they are both tied to a conference, where competitors can present their results and methods.

What would be really nice is if, like in KDD, the winners' papers would be published online and made publicly available.