When Data Doesn't Play Fair

Remember that cozy assumption from basic ML theory? That our training data $\mathcal{S} = \lbrace{(x_1, y_1), \dots, (x_N, y_N)\rbrace}$ is drawn independently and identically distributed (i.i.d.) from some underlying true distribution $\mathcal{D}$? And that minimizing the empirical risk $\mathcal{R}_{emp}(f)$ is a good proxy for minimizing the true risk $\mathcal{R}(f)$ ? It sounds so clean, so elegant. But what happens when $\mathcal{D}$ itself is… lopsided? Welcome to the Imbalance Zone Think about real-world problems:...

July 23, 2025 · 10 min · Pablo Olivares