Avoiding bugs in machine learning code

June 2014

At my work I’ve been writing a lot of machine learning code. Some of it is machine learning code responsible for moving around a whole lot of money, so it behooves us to be really careful when writing and testing it to make sure no bugs make it into our production systems. Unfortunately, machine learning bugs are often quite hard to catch, for a couple reasons.

Language design

First, machine-learning code often depends on complex data-structure manipulation and linear algebra. That means that doing ML in a less-expressive programming language is a huge drag–you end up writing a lot of boilerplate for loops, null-value checks, etc. On the other hand, more-expressive languages tend to be less safe to work with, because it’s harder to shoehorn good expressivity into a robust type system.

First, it’s very hard to use statically typed languages for machine learning, because your algorithms need to be generic across many different data types, which you might not know until runtime. Try imagining a type-safe version of R’s Data Frame (or the Python equivalent) to see what I mean.

Relatedly, machine learning involves converting so frequently between different data types (probabilities to generic floating-point numbers, enums to integers, booleans to probabilities, etc.) that one must sacrifice strong typing even if it is dynamic. For instance, although Python is usually quite strongly-typed, all of the data science libraries I’ve used (numpy, scikit-learn and pandas) do an unsettling amount of type coercion.

Finally, even if we could use strong or static typing to catch bugs, machine learning algorithms often involve using a lot of objects that are the same data type (i.e. floating-point numbers), so large classes of errors can get past any sort of type-checking.


The typical answer to such errors is to catch them with unit testing, but this too is unusually difficult in machine learning. For one thing, it’s difficult to concoct test data to run your code on. For instance, if I were writing medical imaging diagnostics, I might want to test that my software classified certain “obvious” tumors as cancerous, but I’d have to construct such images first and assure myself that they were really so obvious that no possible algorithm should misclassify them.

Even once you have this test data, it’s sometimes even difficult to determine what exactly counts as a “bug”. If my support vector machine has an area under the receiver operating characteristic curve of 0.58 instead of 0.63 like I thought it should, maybe it’s just because SVMs won’t work very well on my dataset–but maybe it’s because I messed up my data pipeline somehow and the SVM is getting the wrong features.


Finally, even once you’ve identified a bug in your code it can be hard to figure out exactly what’s going on. Typically programmers find thorny bugs by pausing the code while it’s running and inspecting the value of various variables to see if they make sense. But in an ML algorithm, it’s quite hard to inspect the values like this, both because there are too many of them to just read through and because it’s hard to tell what “makes sense” or not to the naked eye. It essentially becomes a data science/engineering problem just to understand what’s going on inside your algorithm well-enough to debug it.


Because of these problems, it’s very hard to get the defect rate in machine learning code satisfactorily low with the typical tools for writing high-assurance software. So I’m trying to think of other, less conventional techniques and processes to catch bugs before they make it to production. In a future post, I’ll talk about some of the many exciting types of bugs I’ve experienced so far, and speculate about what can be done to mitigate (some of) them.

Enjoyed this post? Get notified of new ones via email or RSS. Or comment:

email me replies

format comments in markdown.


Can you try harder with types?

I don’t quite understand DataFrame from giving its docs a once-over, but it doesn’t sound crazy to me that the type of DataFrame could essentially be ‘tuple of arrays’ where you specify the types of elements, and the ranges, of each sub-array.

Obviously this would be work – potentially a lot of work – but it sounds like it might pay dividends, for the reasons that you present in the rest of the post (harder to unit test and harder to inspect).



This is a topic close to my heart!

Some suggestions:

–The highest RoI is to do as much of your data manipulation as possible in a database such as PostgreSQL, and do everything you can there. Use something like pyodbc to query tables from your database, rather than doing data manipulation in python. This lets you keep the advantages of types (and the many other advantages of a relational database) for as long as possible. It’s obviously a major investment, but one that pays very high dividends.

–Concretely, while my job title is “Data Scientist”, the vast majority of my work is in SQL. I’d estimate I spend at least 10 hours in SQL for every hour in Python.

–Add a lot of statistical tests to your code, and use them as unit tests. For example, anytime you run an SVM, also run a linear or logistic regression (depending on the problem) and make sure the SVM outperforms the simpler model. Add tests to check for outliers in all variables (or to remove them.) If you suspect that a lot of variables in your data should be highly correlated, add tests there. Etc.

–Similarly, run tests on the output of your models. I find it helpful to create several different summaries of my models, e.g. for feature coefficients and importance.



Lincoln, the problem with DataFrames is that the columns are heterogeneous (some might be boolean, some enum, some integer, some float…) and need to be accessed by name. I don’t know of any practical language (except possibly Haskell, depending on what you mean by “practical”) whose type system is rich enough to capture that, for instance, the following code is valid:

df = DataFrame({'foo':[1.2,3.4], 'bar':[5,6]})


df['bar'][0] = 7

but it wouldn’t be valid if you replaced the second 'bar' with 'foo', or the 7 with 7.8.

Satvik, those are all great points! I didn’t realize putting stuff in SQL would be such a win, so that’s good to know. The statistical tests too–more diagnostics would definitely be a good investment.

What I’m really worried about is the subtle bugs like writing a formula that’s approximately correct but not exactly, so the SVM or whatever is still better than logistic regression, just not as good as it could be… I’ve been doing some stuff with boosting algorithms, where this is especially problematic, because such a bug just means that your “weak learners” are slightly weaker than they would otherwise be!



I think Haskell would be powerful enough to implement your example. I’m definitely not a Haskell expert but it might be worth talking to one. I can probably find a Haskell person for you to talk to, if you want.

I do like the SQL idea. My main problem is maintaining the queries. Satvik, given that you’ve done this a bunch, would you write your queries in SQL directly? I wrote an entire machine learning system in SQL once and it was hellish to maintain, and some of the queries were 60-line spaghetti SQL. I could imagine it being slightly easier using a DSL like SQLAlchemy where you can factor bits of your query out into Python functions.



I maintain the raw data in SQL and the transformations to turn those into useful tables, but not the actual machine learning algorithms. So e.g., I might get raw data in the form of CSVs or API calls from several different sources, have stored procedures that transform said raw data into usable tables, and then query the usable tables from python using pyodbc. Then I would use python to write the actual machine learning algorithms. You can theoretically use Machine Learning algorithms in SQL Server using Microsoft Analysis Services, but I’ve tried this and it was truly an awful experience.

Agree that Haskell is worth looking into, and there are definitely Haskell lovers amongst CFAR alumni (e.g. Nathan Bouscal.)



Lincoln, a good trick is to use CREATE TEMPORARY VIEW or Postgres’ WITH queries to split up large queries. Unfortunately you occasionally need to undo this for Postgres’ optimiser to understand the whole query, but it usually makes queries clearer. Other database engines have similar features.



I get a lot of mileage out of implementing algorithms such that they can be easily tested on small standalone examples, then coming up with small intuitive tests where I can pretty easily see whether the result is sensible or not. For example, run an SVM in 2D where you can explicitly visualize the decision boundary. Although it’s hard to write down strict unittests, it’s easy to throw some data points onto a plane and check that the decision boundaries look reasonable.



WRT Haskell, I am reasonably experienced with the language (understand most of the abstractions, have implemented a couple small projects) and don’t see a way to get the types to check out appropriately. There may be some crazy GHC extension that does what I want, but at that point I think we’re definitely beyond the boundaries of what I could convince my bosses to use :P

My impression about the typing seems to be backed up by this discussion:

However, as with a good data frame-like, certain obstacles come up partly because if we insist a type safe way to do things while being at least as high level as R or python, the absence of row types for frame column names makes specifying linear models that are statically well formed (as in only referencing column names that are actually in the underlying data frame) bit tricky, and while there are approaches that do work some of the time, theres not really a good general purpose way (as far as I can tell) for that small problem of trying to resolve names as early as possible. Or at the very least I don’t see a simple approach that i’m happy with.

@Alex: we do do a bit of that (and we should do more; more graphical diagnostics are on our admittedly long to-do list). I guess I was hoping for something a bit more automated than that, but I’m not sure there’s a very good way. (If we were going to be really ridiculous, we could test that PAC-learning bounds held or something…)


Robbie Eginton

I’m probably gravely misunderstanding something here, but what would be the problem with using Haskell records or equivalent? It seems like you know the fields at compile/writing time.

If the issue is that then you lose the ability to iterate over the fields / otherwise refer to them not by name, i.e. by position… this is the lisper in me talking, but my instinct is: write some (very careful!) macro code so that when you’re defining your record type, you’re also defining a (not-sub)type polymorphic implementation of a generic accessor function to dress the whole shebang up as a tuple, so that you get bounds checking and types for free when you access by slice? I’m pretty sure someone has written macros for the GHC. It’s a bit of a hack; you would probably need macros for this although there might be some kind of function that takes types and returns types that could do make writing this less repetitive. I’m insufficiently familiar with Haskell’s type system to say for certain.

Anyway, my instinct about the typing is that it should be solvable if you just write your code using more encapsulation around your accesses to the data, since what you’re really asking for is a name-access structure that sometimes pretends to be a positional-accessing structure (which, as is pointed out in that discussion, is basically what a sql table is). Strongly typed languages have had records (structs). The problem is making it convenient to construct type-safe positional accessors.



Robbie, if you’re doing interactive data munging/feature engineering you actually don’t know the fields at compile time, or at least, expressing them at compile time slows down the interactive feedback loop by a lot and is a huge pain. There might be some way to get around this using sufficiently crazy macros, but it just seems painful to me.