The Crisis of Confidence in Statistical Purity

Alright, we’ve journeyed through the surprising generalization of huge neural nets (first part) and navigated the murky waters of imbalanced data, using techniques like SMOTE that felt necessary but theoretically a bit… dodgy (second part). We tweaked our data $\mathcal{S}’$ or our loss function $L’$ to get better performance on metrics like F1 score, especially for those pesky rare classes. Great! Mission accomplished? Well, hold on. When our fraud detection model, trained on SMOTE’d data, now confidently outputs $P(\text{fraud} | x) = 0....

November 19, 2025 · 11 min · Pablo Olivares

When Data Doesn't Play Fair

Remember that cozy assumption from basic ML theory? That our training data $\mathcal{S} = \lbrace{(x_1, y_1), \dots, (x_N, y_N)\rbrace}$ is drawn independently and identically distributed (i.i.d.) from some underlying true distribution $\mathcal{D}$? And that minimizing the empirical risk $\mathcal{R}_{emp}(f)$ is a good proxy for minimizing the true risk $\mathcal{R}(f)$ ? It sounds so clean, so elegant. But what happens when $\mathcal{D}$ itself is… lopsided? Welcome to the Imbalance Zone Think about real-world problems:...

July 23, 2025 · 10 min · Pablo Olivares

Is Statistical Learning 'All We Need'?

Hey everyone! Let’s talk about machine learning. It’s everywhere now, right? From your phone unlocking with your face to recommending weirdly specific t-shirts online. But what is it, fundamentally? At its core, it’s about teaching computers to do stuff by showing them examples, rather than programming explicit rules for every conceivable situation. Think about teaching a kid what a dog is. You don’t list out “has four legs, barks, wags tail, etc....

April 12, 2025 · 12 min · Pablo Olivares