The Training Example Lie Bracket

April 9, 2026
Close-up of a weathered metal clamp in a lush, outdoor industrial setting.
Photo by Сергей ЮССтудия on Pexels

A small twist with big consequences

It has been reported that a recent write-up explains why the order of training examples can matter for neural networks — and not just a little. From a Bayesian ideal, data are unordered; updates should commute. In practice, gradient-based training of neural nets does not obey that rule. The punchline: swapping two examples can nudge the model in measurably different directions. Who knew a tiny reorder could be the butterfly flapping inside your weights?

The math in plain English

Think of each training example as a vector field over parameter space: the per-example gradient points the way the parameters want to move. The Lie bracket of two such vector fields measures their non-commutativity — in short, how much updating on x then y differs from y then x. A Taylor expansion shows the difference in parameters is O(ε^2) and is governed by that bracket; swap two minibatches and the net effect averages over all pairwise brackets. Slight change. Not subtle.

The experiment

It has been reported that the author trained an MXResNet (no attention) on CelebA for 5,000 steps with batch size 32 using Adam, and then computed Lie brackets at checkpoints. To keep disk use sane they computed brackets only between the first six test examples; each bracket tensor is as large as a full checkpoint, so cost adds up fast. The results show how swapping two examples perturbs all 40 attribute logits across a 32-example batch — visualized with an interactive slider on the page. Peek under the hood and you’ll see concrete, parameter-level fingerprints of order dependence.

So what now?

Order dependence is more than a math quirk. It has been reported that early literature — Dherin 2023 — linked these brackets to implicit bias in training; this work takes that a step further by computing the brackets on a real convnet. Practically: shuffling strategy, curriculum learning, reproducibility and debugging all get a new wrinkle. Should you panic? No — but maybe pay attention. Tiny asymmetries can compound, and that’s the emotional core here: a small swap, a different model. Explore the interactive results and you might never look at a data loader the same way again.

Sources: pbement.com, Hacker News