Background: Understanding Learned Optimizers

What are Learned Optimizers?

Traditional optimization algorithms like Adam, SGD, and Adafactor rely on hand-designed update rules that have been carefully crafted by researchers over decades. While these optimizers work well across many tasks, they come with limitations:

  • They require careful hyperparameter tuning (learning rate, momentum, weight decay)

  • They use the same update rule for all parameters and all tasks

Learned optimizers take a different approach: they use machine learning to discover optimization algorithms. Instead of hand-designing update rules, we train a small neural network (the “learned optimizer”) to predict good parameter updates based on the training history.

The key insight is that optimization itself is a learnable problem. By training an optimizer on thousands of different tasks, it can learn patterns about how to efficiently navigate loss landscapes that generalize to new, unseen tasks.

How Do Learned Optimizers Work?

The Core Idea

At each training step, a learned optimizer processes rich information about the current state of training and predicts how to update each parameter. Let’s compare the approaches:

Traditional Optimizer (Adam):

# Simple hand-designed update rule
m = β₁ × m + (1 - β₁) × gradient        # momentum
v = β₂ × v + (1 - β₂) × gradient²       # second moment
update = learning_rate × m / (√v + ε)
parameter = parameter - update

Learned Optimizer:

# Neural network predicts the update
features = construct_features(gradient, momentum, second_moment,
                               parameter, timestep, ...)
features = normalize(features)
direction, magnitude = MLP(features)
update = direction × exp(magnitude × α) × β
parameter = parameter - update

The learned optimizer’s MLP is small (typically 2-3 hidden layers with 32-64 units) but has been meta-trained on thousands of optimization tasks to make good predictions.

Feature Construction

Learned optimizers construct a rich set of features to inform their decisions:

Accumulator-based features:

  • Momentum accumulators (like Adam’s first moment): m_t = β × m_{t-1} + (1-β) ×

  • Second moment estimates (like Adam’s second moment): v_t = β × v_{t-1} + (1-β) × ∇²

  • Factored second moments (inspired by Adafactor): row and column factors for efficient memory usage

Derived features:

  • Normalized momentum: m / √v

  • Reciprocal square roots: 1 / √v

  • Interaction features: combinations of gradients with momentum and factored terms

  • Parameter values: the current weights being optimized

  • Gradient values: raw and clipped gradient information

Temporal features:

  • Time-based features using tanh(t / x) for various time scales

  • These help the optimizer adapt its behavior over the course of training

Note

See Appendix G of the paper for complete feature tables for VeLO and small_fc_lopt.

The Optimization Pipeline

Here’s what happens during a single optimizer step:

Method
1. Update Accumulators
   ├─→ momentum: m = β₁m + (1-β₁)∇
   ├─→ second moment: v = β₂v + (1-β₂)∇²
   └─→ factored moments: r, c (row/column factors)

2. Construct Features
   ├─→ Raw values: ∇, θ, m, v, r, c
   ├─→ Combinations: m/√v, ∇·√(r⊗c), etc.
   └─→ Time features: tanh(t/1), tanh(t/10), ..., tanh(t/10⁵)

3. Normalize Features
   └─→ Divide by √(mean(feature²) + ε)

4. Apply Learned Optimizer MLP
   ├─→ hidden = ReLU(Linear1(features))
   ├─→ hidden = ReLU(Linear2(hidden))
   └─→ direction, magnitude = Linear3(hidden)

5. Compute and Apply Update
   └─→ θ ← θ - direction × exp(magnitude × α) × β

Meta-Training: How Learned Optimizers Learn

Learned optimizers are trained through a process called meta-learning:

  1. Sample a random task (e.g., train a small CNN on CIFAR-10)

  2. Unroll optimization for K steps using the learned optimizer

  3. Compute meta-loss based on the final validation/ training loss

  4. Update the optimizer’s weights via gradient estimation strategies such as Evolution Strategies.

  5. Repeat for thousands of diverse tasks

This is computationally expensive and SOTA learned optimizer VeLO was meta-trained for 4000 TPU-months! but needs to be done only once. The resulting learned optimizer can then be applied to new tasks without retraining. Similar to how pretraining is done in LLMs.

Important

You don’t need to meta-train learned optimizers yourself. PyLO provides pre-trained optimizers ready to use. Think of it like using a pre-trained language model: the expensive training has already been done.

References and Further Reading

Key Papers:

Codebases: