In praise of gradient boosting

February 2015

A friend of mine who’s taking an introductory machine learning course recently emailed me with a bunch of questions about my experience doing machine learning. I thought I’d post parts of my response here to get other people’s opinions on them. Here’s one fragment.

When I’m trying to build a good predictive model for a dataset, I usually find it quite hard to do better than a fairly off-the-shelf gradient boosted decision tree machine. These work, in brief, as follows:

  1. Start by fitting a constant model to minimize some loss function (e.g. squared error for regression, log loss for classification).

  2. Fit a decision tree to the gradient of the loss function at each data point (optionally, on a random subset of the data). Use line-search to find the optimal value to put in each leaf of the tree.

  3. Add that decision tree, times some shrinkage parameter known as the “learning rate,” to your estimator. Repeat a number of times.

Gradient trees have a number of great attributes.

Even in competitions, gradient boosting often does quite well (though it seems you still need more complex algorithms to get that extra 0.1% performance boost). For instance, a recent implementation called xgboost got very competitive results on the Higgs boson detection competition.

The fact that GBMs work so well seems fairly well-known to machine learning practitioners, but I haven’t found much discussion of it online. In fact, it seems to be generally hard to find discussion about the practicalities of machine learning; people seem much more interested in explaining how algorithms work, or giving tutorials on software libraries, than talking about the actual process they use when trying to answer data-related questions. I’d be interested to hear other folks’ perspective on their “workhorse” models and how they go about analyzing data.

Enjoyed this post? Get notified of new ones via email or RSS. Or comment:

Publish (as ) submit via email ⤇


It similar for me - trying several models but GBM and Cubist coming up on top. Oddly, while mostly tied in CV, somehow Cubist up to now always won out on the test set.

When I’m starting out, I always do some standard linear modelling, it is fast and allows me to get to know the data.

I also always include a neural network (a non-deep one), because people just expect to see a neural network…

xgboost is new for me, will look into it. Also hoping to get some first experiencest in deep learning.