Tuesday, May 14, 2013

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates. 

A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors and teachers who design and teach data mining courses build course plans, assignments, and projects based on the assumption that the data will be available. And now, we have to revise all our materials! What a waste of time, resources, and energy. Obviously, after the first time this happens, I think twice whether to use contest data for teaching. I will not try to convince you of the waste of faculty time, the loss to students, or even the loss to self-learners who are now all over the globe. I'll wear my b-school professor hat and speak the ROI language: The companies lose. In the short term, all these students will not be competing in their contest. The bigger loss, however, is to the entire business sector in the longer term: I constantly receive urgent requests from businesses and large corporations for business analytics trained graduates. "We are in desperate need of data-savvy managers! Can you help us find good data scientists?". Yet, some of these same companies are pulling their data out of the hands of instructors and students.

I am perplexed as to why a company would pull their data away after the data was publicly available for a long time. It's not the fear of competitors, since the data were already publicly available. It's probably not the contest platform CEOs - they'd love the traffic. Then why? I smell lawyers...

One example is the beautiful Heritage Health contest that ran on Kaggle for 2 years. Now, there is no way to get the data, even if you were a registered contestant in the past (see screenshots). Yet, the healthcare sector has been begging for help in training "healthcare analytics" professionals. 


Data access is no longer available. Although it was for 2 years.

I'd like to send out an appeal to Heritage Provider Network and other companies that have shared invaluable data through a publicly-available contest platform. Please consider making the data publicly available forever. We will then be able to integrate it into data analytics courses, textbooks, research, and projects, thereby providing industry with not only insights, but also properly trained students who've learned using real data and real problems.

This time, industry can contribute in a major way to reducing the academia-industry gap. Even if only for their own good.

Wednesday, May 01, 2013

Collaborations of Latex and Word users

The two popular text editors used by researchers in academia are LaTex and Microsoft Word. Or, put differently: Microsoft Word and LaTex. In more technical fields, LaTex is king, while in less technical fields, it is Word. In the business school worlds collide. Coming from a technical background, I am a heavy user of LaTex, for research papers and even for book writing. However, many of my business school collaborators (e.g., from fields of Information Systems and Marketing) are Word users.

While collaboration platforms such as Google Drive and Dropbox have greatly enhanced collaborative possibilities, including co-editing a document, the Word-or-Latex schism still poses a serious challenge. I've had to migrate to Word (and suffer) in some collaborations, while in others I convinced my co-authors to move the document to LaTex, but then I was the one receiving text bits to incorporate back into the document and share the compiled PDF (via Dropbox that's easy).

For someone used to LaTex, Word is quite awkward: handling different document components such as bibliography and sectioning is cumbersome; journal templates are easier to use in LaTex; writing formulas is much easier. For Word-users, LaTex usually seems intimidating, as it is not WYSIWYG (you must click a button to compile the text and then see the resulting PDF in a separate PDF viewer).

One solution is the open-source Lyx package, which has a graphical interface with LaTex "under the hood". I personally found it unsatisfactory, as it is "not here nor there"...

So what to do if you're a LaTex junky and want to move a project from Word into LaTex? Let's start with the initial migration. Here are a few useful tools that I discovered:
  • Tables: To convert a table from Word into a LaTex table, copy-paste into Excel and then use the Excel2Latex tool. Simply download the xla file and open it in Excel. It will add an add-in menu. Choose the table, click the convert button, and you can choose either to copy the LaTex code or to export it to a .tex file.
  • Bibliography: To convert a Word Bibliography file into a LaTex bib file, use the neat Word2Bibtex tool. Download the bibtex.xsl file and follow the directions. Note: the tool will only work if you have administrator privileges on the computer, as it requires copying a file into an "admin only" folder.
  • Figures: unlike Word, where you copy-paste images, in LaTex you'll need them as separate image files (png, jpg, eps, etc.). If you only have a few, right-click each figure in Word, then "Save as image" and choose png or jpg. If you already have a bunch of figures in the Word doc, save the doc as "filtered HTML". This will create a separate folder with all the image files (if they are saved as gif, you'll have to convert them to png or jpg).
Now, to the co-editing of the tex file. I have still not completely resolved the problem of the non-LaTex collaborators editing the file. I always get the question: "can you send me a Word version so that I can edit it?". Here are some options:
  • They can annotate the PDF file using highlighting and sticky notes.
  • They can copy-paste from the PDF file into Word. Figures can be copied using Acrobat Reader's Edit > Take a Snapshot, but they are usually not needed for editing.
  • They can open the .tex file with Wordpad for editing the text.
  • It's also possible to convert back to Word: The Latex2rtf tool should do that (it actually clashed with my TexStudio editor and erased my tex file!)
But then, even if you do convert to Word, what to do with the Word file once the collaborator has done his/her editing?
Another solution is to use a cloud LaTex platform, such as ShareLaTex.com. The advantage is that there's no need to install software and the editor + viewer are nicely set side-by-side with a big green "recompile" button. The free version allows collaborating with one free user. The paid versions are more generous (I like the "coming soon" integrations with Google Drive and Dropbox!). 
  • Catch #1: you must be online to compile. 
  • Catch #2: long documents such as books can take substantially longer to compile online compared to locally. 
  • Catch #3: if the non-LaTex collaborator uses some tex-unfriendly text (such as a $ sign to denote USD), the compilation will fail. So, basic tex knowledge is needed - or babysitting by the LaTex-head collaborator.
Would love to hear from others tackling these collaborative issues and have found good solutions.