Tortured Data

If you torture the data long enough, it will confess

Wavelet Transforms and the Practicality of Ensemble Models

For anyone who isn’t familiar with Kaggle, it is a website that provides data and modeling challenges from various organizations to data scientists who want to practice their skills, and occasionally win money.  I recently participated in a Kaggle challenge sponsored by the African Soil Information Service entitled African Soil Property Prediction.  The goal was to predict five soil properties (soil organic carbon, pH, calcium, phosphorus and sand content) based on diffuse reflectance infrared spectroscopy measurements along with a few spacial predictors for each soil sample.  The spectroscopy measurements were discretized into 3578 points, and adding the 16 spacial predictors resulted in 3594 columns in the data.  There were 1157 soil samples in the training set, and 727 in the testing set that was provided.

I like to have at least 10 samples per each variable, so my first step was to attempt to reduce the number of columns to be more in line with the number of samples I had, ideally from 3594 to around 115 (if anyone else has thoughts on rules of thumb for sample:predictor ratios please share in the comments!).  To attempt to reduce the number of columns to be more in line with the number of samples I used a Haar wavelet transform to clean up and reduce the size of the spectroscopy signal.  Here is a visual of how the wavelet transform cleaned the signal (original signal on the left, cleaned signal on the right):

Plot1Plot2Plot3Plot4Plot5Plot6

With each run of this transform, the signal decreased in size by half, so in the last frame the signal has been reduced to 111 data points from the original 3578, which is much more in line with the 10 samples per variable ratio I was aiming for.  The wavelet transform reduced the size and cleaned out much of the noise, however most of the characteristics of the signal itself are still intact.  I then fed those points into a few different algorithms using the Caret package in R.  Unfortunately none of the methods that I tried resulted in very good performance, and in the end I did not finish with a very impressive ranking.

At the end of the contest, the winner is required to post his/her code.  In this contest, I was surprised to see that the winner used this same methodology to pre-process his data considering how poorly it worked for me.  However, the winner used this methodology in combination with other pre-processing methodologies and various different algorithms in a 50 or so algorithm ensemble model.

I have to applaud his efforts, it was certainly an impressive model, however I don’t necessarily feel that it should have been the winning solution.   A 50 algorithm ensemble is not really a practical solution to the problem.  Many of the contests on Kaggle are run for fun and for flexing analytical muscle, in which case more power to the person who can construct the most models in the least amount of time.  However for a contest in which the sponsor intends to use and hopefully implement the result, this method isn’t ideal.

For example, take the Netflix prize.  This was a competition to improve upon the Netflix recommendation engine by 10%.  The competition ran for three years before someone finally achieved the 10% improvement, and the winning team was awarded $1 million.  However, the winning team’s algorithm was never implemented by Netflix, because the winning team created a 100 algorithm ensemble.  Netflix ended up implementing only two of the algorithms that the team came up with.  As far as the rest of the algorithms in the ensemble, Netflix made this statement in a blog post about the competition:

“…the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”

Personally I would be in favor of some sort of complexity penalty that could be incorporated into the evaluation for contests in which the sponsor would like to actually implement the results.  This way, the final performance measure wouldn’t come down to just model accuracy (which generally increases as you increase the number of algorithms in your ensemble), but instead it would be a measure of which algorithm (and pre-processing steps) can provide the most accuracy with the least complexity, making it more likely that it would be possible to implement in a production environment.

It would be a challenge to come up with a good way to measure complexity of a solution, and automate this measure so that scoring competition entries would be as simple as submitting your code.  Any thoughts on this would be very welcome in the comments!

6 Comments

  1. “However for a contest in which the sponsor intends to use and hopefully implement the result, this method isn’t ideal.”

    Depends completely on the application – in some environments, longer model training and scoring times aren’t acceptable; in others they are.

    In the Higgs challenge, the sponsors awarded the best performing simple model separately. Perhaps this will become a trend.

    “This way, the final performance measure wouldn’t come down to just model accuracy (which is almost always increased as you increase the number of algorithms in your ensemble)”

    This is false. Ensemble performance is not reduced to higher quantity equals higher performance – it’s more complicated than that. I recommend Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou.

    • Hi Dean,

      Thanks for reading!

      In response to your first point about the application – I wasn’t
      referring to training and scoring times when I said that this method
      isn’t ideal, I was referring to actually getting a model implemented
      at all. Many of the companies I’ve worked with are implementing
      models by re-creating them in etl code, which would make a large
      ensemble all but impossible to productionalize, even if longer wait
      times for scores were acceptable for their particular application.
      I’m sure analytics-mature companies have integrated analytics
      platforms, but even in that case there may be overhead in implementing
      a Kaggle solution (ie if the winner created his/her ensemble in R, but
      the company works with SPSS Modeler).

      I am not familiar with the Higgs challenge, but I like their thought
      of awarding both the best performing model and the model
      that best optimized simplicity and performance, I hope that does catch on!

      Valid point that ensemble perfomance is more complex than simply
      algorithm quantity. I changed my statement to, “…model accuracy
      (which generally increases as you increase the number of algorithms in
      your ensemble).” There are certainly caveats, but if all other factors are
      held constant I think it is fair to say that an ensemble with more algorithms would
      perform better than one with less.

      Thanks for the book recommendation, I will check it out!

      • ” Many of the companies I’ve worked with are implementing
        models by re-creating them in etl code, which would make a large
        ensemble all but impossible to productionalize, even if longer wait
        times for scores were acceptable for their particular application.”

        But these companies aren’t bringing their problems to Kaggle! Ensembles aren’t useful for companies that can’t utilize ensembles is a tautology.

        “I’m sure analytics-mature companies have integrated analytics
        platforms, but even in that case there may be overhead in implementing
        a Kaggle solution (ie if the winner created his/her ensemble in R, but
        the company works with SPSS Modeler).”

        This is just anecdotal, but I seem to notice competition sponsors specifying which software is not allowed. e.g. SAS is not valid for the Click-through competition now running.

        “Valid point that ensemble perfomance is more complex than simply
        algorithm quantity. I changed my statement to, “…model accuracy
        (which generally increases as you increase the number of algorithms in
        your ensemble).” There are certainly caveats, but if all other factors are
        held constant I think it is fair to say that an ensemble with more algorithms would
        perform better than one with less. ”

        This still isn’t true; you’re being mislead by selection bias because you notice that Kaggle winners tend to have quite a few base models. Increasing the number of algorithms in your ensemble will only make it perform better if each additional model “plays well” with the other models. If the errors of an additional model are highly correlated with an existing model then performance will suffer. The key is finding a panel of experts (models) that complement each other – Kaggle winners in general are good at this.

        • I would disagree with your assertion that companies who are implementing models in etl aren’t bringing their data to Kaggle. Per the Kaggle website (this is their blurb to persuade companies to sponsor a competition):

          “Many organizations don’t have access to the advanced machine learning that provides the maximum predictive power from their data. Meanwhile, data scientists and statisticians crave real-world data to develop their techniques. Kaggle offers companies a cost-effective way to harness this ‘cognitive surplus’ of the world’s best data scientists.”

          Kaggle’s pitch is aimed at helping companies who can’t help themselves. I’m sure that many of the companies who do decide to work with Kaggle are analytically savvy, but Kaggle doesn’t exclude those that are not, and in fact they are marketing to that audience.

          I would say my argument would be better phrased as:

          – Large ensembles are not useful to companies who don’t have state-of-the-art analytics platforms integrated into their database systems due to high implementation costs

          – Many companies do not have state-of-the-art analytics platforms integrated into their database systems

          – Therefore large ensembles are not useful for many companies

          Lastly, I have acknowledged that there are caveats to my statement that increasing the number of algorithms will increase ensemble performance, and I would prefer not to list all of those caveats in the comments. I still believe that as a general rule my statement is true, and we may just have to agree to disagree on that point.

  2. This was a multivariate model you were trying to run because you are predicting 5 dependent variables at once, right? What methods did you employ to produce this type of model and any lessons learned? Thanks!

    • I tried a few different algorithms for this (all through the Caret package) and for each type of algorithm that I tried I built 5 separate models, one for each dependant variable. I tried linear regression, partial least squares and MARS, none of which were particularly successful for me. I’m sure if I had spent more time on tuning them or on the pre-processing I could have gotten better results, but this contest was running during my wedding, so that sort of derailed my efforts. Lesson learned, don’t get married mid-Kaggle competition!

Leave a Reply

Your email address will not be published.

*

© 2017 Tortured Data

Theme by Anders NorenUp ↑