For anyone who isn’t familiar with Kaggle, it is a website that provides data and modeling challenges from various organizations to data scientists who want to practice their skills, and occasionally win money.  I recently participated in a Kaggle challenge sponsored by the African Soil Information Service entitled African Soil Property Prediction.  The goal was to predict five soil properties (soil organic carbon, pH, calcium, phosphorus and sand content) based on diffuse reflectance infrared spectroscopy measurements along with a few spacial predictors for each soil sample.  The spectroscopy measurements were discretized into 3578 points, and adding the 16 spacial predictors resulted in 3594 columns in the data.  There were 1157 soil samples in the training set, and 727 in the testing set that was provided.

I like to have at least 10 samples per each variable, so my first step was to attempt to reduce the number of columns to be more in line with the number of samples I had, ideally from 3594 to around 115 (if anyone else has thoughts on rules of thumb for sample:predictor ratios please share in the comments!).  To attempt to reduce the number of columns to be more in line with the number of samples I used a Haar wavelet transform to clean up and reduce the size of the spectroscopy signal.  Here is a visual of how the wavelet transform cleaned the signal (original signal on the left, cleaned signal on the right):

Plot1Plot2Plot3Plot4Plot5Plot6

With each run of this transform, the signal decreased in size by half, so in the last frame the signal has been reduced to 111 data points from the original 3578, which is much more in line with the 10 samples per variable ratio I was aiming for.  The wavelet transform reduced the size and cleaned out much of the noise, however most of the characteristics of the signal itself are still intact.  I then fed those points into a few different algorithms using the Caret package in R.  Unfortunately none of the methods that I tried resulted in very good performance, and in the end I did not finish with a very impressive ranking.

At the end of the contest, the winner is required to post his/her code.  In this contest, I was surprised to see that the winner used this same methodology to pre-process his data considering how poorly it worked for me.  However, the winner used this methodology in combination with other pre-processing methodologies and various different algorithms in a 50 or so algorithm ensemble model.

I have to applaud his efforts, it was certainly an impressive model, however I don’t necessarily feel that it should have been the winning solution.   A 50 algorithm ensemble is not really a practical solution to the problem.  Many of the contests on Kaggle are run for fun and for flexing analytical muscle, in which case more power to the person who can construct the most models in the least amount of time.  However for a contest in which the sponsor intends to use and hopefully implement the result, this method isn’t ideal.

For example, take the Netflix prize.  This was a competition to improve upon the Netflix recommendation engine by 10%.  The competition ran for three years before someone finally achieved the 10% improvement, and the winning team was awarded $1 million.  However, the winning team’s algorithm was never implemented by Netflix, because the winning team created a 100 algorithm ensemble.  Netflix ended up implementing only two of the algorithms that the team came up with.  As far as the rest of the algorithms in the ensemble, Netflix made this statement in a blog post about the competition:

“…the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”

Personally I would be in favor of some sort of complexity penalty that could be incorporated into the evaluation for contests in which the sponsor would like to actually implement the results.  This way, the final performance measure wouldn’t come down to just model accuracy (which generally increases as you increase the number of algorithms in your ensemble), but instead it would be a measure of which algorithm (and pre-processing steps) can provide the most accuracy with the least complexity, making it more likely that it would be possible to implement in a production environment.

It would be a challenge to come up with a good way to measure complexity of a solution, and automate this measure so that scoring competition entries would be as simple as submitting your code.  Any thoughts on this would be very welcome in the comments!