Thursday, May 20, 2010

Google's new prediction API

I just learned of the new Prediction API by Google -- in brief, you upload a training set with up to 1 million records and let Google's engine build an algorithm trained on the data. Then, upload a new dataset for prediction, and Google will apply the learned algorithm to score those data.

On the user's side, this is a total blackbox since you have no idea what algorithms are used and which is chosen (probably an ensemble). The predictions can therefore be used for utility (accurate predictions). For researchers, this is a great tool for getting a predictive accuracy benchmark. I foresee future data mining students uploading their data to the Google Prediction API to see how well they could potentially do by mining the data themselves!

From Google's perspective this API presents a terrific opportunity to improve their own algorithms on a wide set of data.

Someone mentioned that there are interesting bits in the FAQ. I like their answer to how accurate are the predictions? which is "more data and cleaner data always triumphs over clever algorithms".

Right now the service is free (if you get an invitation), but it looks like it will eventually be a paid service. Hopefully they will have an "academic version"!

Wednesday, May 12, 2010

SAS On Demand Take 3: Success!

I am following up on two earlier posts regarding using SAS On Demand for Academics. The version of EM has been upgraded to 6.1, which means that I am now able to upload and reach non-SAS files on the SAS Server - hurray!

The process is quite cumbersome, and I do thank my SAS programming memory from a decade ago. Here's a description for those instructors who want to check it out (it took me quite a while to piece all the different parts and figure out the right code):
  1. Find the directory path for your course on the SAS server. Login into SODA ( Near the appropriate course that you registered, click on the "info" link. Scroll down to the line starting with "filename sample" and you'll find the directory path.
  2. Upload the file of interest to the SAS server via FTP. Note that you can upload txt and csv files but not xls or xlsx files. The hostname is . You will also need your username and password. Upload your file to the path that you found in #1.
  3. To read the file in SAS SODA EM, start a new project. When you click on its name (top left), you should be able to see "Project Start Code" in the left side-bar. Click on the ...
  4. Now enter the SAS code to run for this project. The following code will allow you to access your data. The trick is both to read the file and to put it into a SAS Library where you will be able to reach it for modeling. Let's assume that you uploaded the file sample.csv:
libname mylibrary '/courses/.../'; THIS IS YOUR PATH
    filename myfile '/courses/.../sample.csv'; USE THE SAME PATH
      data mydata;
        infile myfile DLM='2C0D'x firstobs=2 missover;
          input x1 x2 x3 ...;
              data mylibrary.mydata;
                set mydata;
                  The options in the infile line will make sure that a CSV file is read correctly (commas and the carriage return at the end of the line! tricky!)

                  You can replace all the names that start with "my" with your favorite names.

                  Note that only instructors can upload data to the SAS server, not students. Also, if you plan to share data with your students, you might want to set them as read only.

                  5. The last step is to create a new datasource. Choose SAS Table and find the new library that you created (called "mylibrary"). Double-click on it to see the file ("myfile") and choose it. You can now drag the new datasource to the diagram.

                  Saturday, May 08, 2010

                  Short data mining videos

                  I just discovered a short set of videos (currently 35) on different data mining methods on the StatSoft website. This accompanies their neat free online book (I admit, I did end up buying the print copy). The videos show up at the top of various data mining topics in the online book. You can also subscribe to the video series.