Gradient descent, the smallest possible mental model

Almost everything intimidating about training neural networks is bookkeeping — shapes, batches, autodiff plumbing. The idea underneath is embarrassingly simple, and you can hold all of it in your head at once.

The whole idea

You have a function that measures how wrong you are: the loss. It depends on some parameters. You want it small. So you ask, locally, “which way is down?” — that’s the gradient — and you take a small step that way. Repeat.

That’s it. A network with a billion parameters is doing the same thing as a ball on a hillside, just in a billion dimensions instead of one.

Two things that bite everyone

Step size (learning rate). Too small and training crawls. Too large and you overshoot the valley and bounce — or fly off entirely.
Where you start. A bumpy loss surface has more than one valley. Where you begin decides which one you fall into.

Both are much easier to feel than to read about. The demo below is a one- dimensional loss surface with two valleys — drag the sliders and watch:

step0

x-1.700

loss f(x)0.388

gradient-3.063

learning rate: 0.050start x: -1.70

Crank the learning rate past ~0.45 to see it overshoot and bounce; drop it low to watch it crawl.

Interactive — runs entirely in your browser.