When Data Doesn't Play Fair

Remember that cozy assumption from basic ML theory? That our training data $\mathcal{S} = \lbrace{(x_1, y_1), \dots, (x_N, y_N)\rbrace}$ is drawn independently and identically distributed (i.i.d.) from some underlying true distribution $\mathcal{D}$? And that minimizing the empirical risk $\mathcal{R}_{emp}(f)$ is a good proxy for minimizing the true risk $\mathcal{R}(f)$ ? It sounds so clean, so elegant. But what happens when $\mathcal{D}$ itself is… lopsided? Welcome to the Imbalance Zone Think about real-world problems:...

July 23, 2025 · 10 min · Pablo Olivares

Is Statistical Learning 'All We Need'?

Hey everyone! Let’s talk about machine learning. It’s everywhere now, right? From your phone unlocking with your face to recommending weirdly specific t-shirts online. But what is it, fundamentally? At its core, it’s about teaching computers to do stuff by showing them examples, rather than programming explicit rules for every conceivable situation. Think about teaching a kid what a dog is. You don’t list out “has four legs, barks, wags tail, etc....

April 12, 2025 · 12 min · Pablo Olivares