The Weirdest Bugs I've Encountered in ML

December 12, 2025

Subscribe to the Effie Labs Newsletter

Debugging machine learning code is different from debugging regular code. In regular code, bugs are usually obvious: wrong output, crashes, infinite loops. In ML, bugs are subtle. Your code runs fine, produces numbers, but those numbers are wrong in ways that take hours to notice.

Here are some of the weirdest bugs I've encountered, and what I learned from them.

The NaN that wasn't a NaN

I was training a neural network and getting NaN losses. Classic problem, right? Usually means a division by zero or exploding gradients. I added gradient clipping, checked for zeros, added epsilon values everywhere. Still NaN.

Turns out, one of my data files had the string "NaN" in it. Not an actual NaN value—the literal text "NaN". Pandas was reading it as a string, not converting it to a float NaN. So when I tried to do math on it, everything broke in weird ways.

The fix was simple: df.replace('NaN', np.nan). But finding it took me three days. The lesson: always check your raw data, not just the processed version.

The model that only worked on Tuesdays

I had a model that performed perfectly in development but failed in production. Not just worse—completely broken. After weeks of debugging, I realized: I was only testing on weekdays. The production system ran 24/7, including weekends.

The issue? One of my features was "day of week" encoded as 0-6. But I had a bug where I was using Python's datetime.weekday(), which returns 0 for Monday, 6 for Sunday. My training data only had weekdays (0-4), so the model never learned what to do with Saturday (5) or Sunday (6).

The model wasn't wrong—it was doing exactly what I trained it to do. I just didn't train it on the right data. Always test with production-like data distributions.

The floating point precision trap

I was comparing model predictions to ground truth. Simple equality check, right? if pred == target. Except it never matched, even when the numbers looked identical.

Floating point precision strikes again. Two numbers that look the same (like 0.30000000000000004 and 0.3) aren't equal in Python. This is especially fun when you're doing operations that should be reversible but aren't due to precision loss.

The fix: use np.isclose() or check if the absolute difference is below a threshold. Never use == for floats. Never.

A universal truth

If your ML code works sometimes but not others, and you can't figure out why, it's probably a data issue. Or floating point precision. Or both.

The batch size bug

I was using batch normalization in a model. Training worked fine, but inference was broken. Predictions were way off. After digging into it, I realized: I was using different batch sizes for training and inference.

Batch normalization computes statistics (mean and variance) over the batch. During training, it uses the batch statistics. During inference, it should use running statistics computed during training. But if your inference batch size is 1 (common for single predictions), and your training batch size was 32, the statistics don't match.

The model was technically working correctly—it was just using different normalization than during training. The fix: make sure your model is in eval mode during inference, and use consistent batch sizes when possible.

The random seed that wasn't random

I set a random seed for reproducibility. Made sense. But then my model's performance was suspiciously consistent across different hyperparameters. Too consistent.

I had set the seed once at the top of my script, but I was also shuffling data and initializing weights in multiple places. Each operation consumed random numbers, so by the time I got to training, the "random" initialization was deterministic based on how many random numbers I'd already used.

The lesson: if you're using random seeds, set them right before each operation that needs randomness. Or better yet, use separate random number generators for different purposes.

The feature that disappeared

I had a model with 50 features. Added one more. Model performance dropped. Removed it. Performance came back. Added it again. Dropped again. What was going on?

The new feature had a different scale than the others. I was using standardization (mean 0, std 1), but I was computing the mean and std on the training set, then applying it to both training and test. When I added the new feature, its distribution in the test set was different, so the standardization was wrong.

The fix: always fit your scalers on training data only, then transform both training and test. And make sure you're not accidentally refitting on the full dataset.

Debugging tips that actually help (from someone who's been there)

After encountering all these bugs, here's what I've learned:

  • Check your data first: Print actual values, not just shapes. Look for weird strings, unexpected types, or distribution shifts.
  • Use assertions: Add checks like assert not np.isnan(x).any() everywhere. Fail fast.
  • Test edge cases: What happens with empty batches? Single examples? All zeros? Models break in weird ways on edge cases.
  • Reproduce the bug: If you can't reproduce it consistently, it's probably a data or randomness issue.
  • Simplify: Remove features, reduce model complexity, use smaller data. If the bug disappears, you know where to look.

ML bugs are sneaky. They hide in data preprocessing, random seeds, floating point math, and assumptions you didn't know you were making. The best defense is skepticism: if something seems too good (or too bad) to be true, it probably is.