In praise of gradient boosting

A friend of mine who’s taking an introductory machine learning course recently emailed me with a bunch of questions about my experience doing machine learning. I thought I’d post parts of my response here to get other people’s opinions on them. Here’s one fragment.

When I’m trying to build a good predictive model for a dataset, I usually find it quite hard to do better than a fairly off-the-shelf gradient boosted decision tree machine. These work, in brief, as follows:

Start by fitting a constant model to minimize some loss function (e.g. squared error for regression, log loss for classification).
Fit a decision tree to the gradient of the loss function at each data point (optionally, on a random subset of the data). Use line-search to find the optimal value to put in each leaf of the tree.
Add that decision tree, times some shrinkage parameter known as the “learning rate,” to your estimator. Repeat a number of times.

Gradient trees have a number of great attributes.

They train relatively quickly compared to most things with competitive performance.
Like many tree-based methods, they’re very robust to parameter settings.
It’s very easy to regularize by turning up the shrinkage, and compensate for this by adding more trees to the model.
Even though the individual trees are piecewise-constant, the machine can pick up fairly complex response shapes because it adds together a bunch of different trees.
If you allow the trees to be 2-3 levels deep you can pick up on fairly complex interactions.
Although the algorithm that I just described doesn’t parallelize, there are distributed and parallelized variants that scale fairly well.

Even in competitions, gradient boosting often does quite well (though it seems you still need more complex algorithms to get that extra 0.1% performance boost). For instance, a recent implementation called xgboost got very competitive results on the Higgs boson detection competition.

The fact that GBMs work so well seems fairly well-known to machine learning practitioners, but I haven’t found much discussion of it online. In fact, it seems to be generally hard to find discussion about the practicalities of machine learning; people seem much more interested in explaining how algorithms work, or giving tutorials on software libraries, than talking about the actual process they use when trying to answer data-related questions. I’d be interested to hear other folks’ perspective on their “workhorse” models and how they go about analyzing data.

Comments

Robert

April 2015

It similar for me - trying several models but GBM and Cubist coming up on top. Oddly, while mostly tied in CV, somehow Cubist up to now always won out on the test set.

When I’m starting out, I always do some standard linear modelling, it is fast and allows me to get to know the data.

I also always include a neural network (a non-deep one), because people just expect to see a neural network…

xgboost is new for me, will look into it. Also hoping to get some first experiencest in deep learning.

Related

Is treating a cold with zinc still evidence-backed?

Startup options are much better than they look

Why do linear models work?

Comments