Optimization Theory for Machine Learning

Why training works, when it fails, and what gradients are really doing.

Optimization in one sentence

Given a loss function \\(L(\\theta)\\), optimization finds parameters \\(\\theta^*\\) that minimize expected error. In practice, we minimize empirical risk on sampled batches, which introduces stochasticity.

Convex vs non-convex landscapes

Convex objectives have a single basin-like geometry where any local minimum is global. Deep networks are non-convex, with many saddle points and flat valleys, yet first-order methods still perform surprisingly well.

Gradient descent family

Batch GD: full dataset per update; stable but costly.
SGD: noisy updates from one example or mini-batch.
Momentum/Nesterov: accumulate velocity to smooth noise.
Adam/AdamW: per-parameter adaptive steps with momentum.

Learning rates and schedules

The learning rate determines step size in parameter space. Too small: slow convergence. Too large: divergence or chaotic oscillation. Warmup, cosine decay, and step schedules control exploration and stabilization across training phases.

Regularization as optimization geometry

Weight decay, early stopping, dropout, and data augmentation change the effective objective and often bias solutions toward flatter minima, which are associated with stronger out-of-sample performance.

Takeaway: Practical ML optimization is less about solving for exact minima and more about following noisy gradients toward solutions that generalize.