Key Concepts and Glossary

This page provides definitions and explanations of key terms used in learned optimization and PyLO.

Core Concepts

Learned Optimizer

A neural network (typically a small MLP) that predicts parameter updates during training. Instead of using hand-designed update rules (like Adam’s momentum and adaptive learning rates), learned optimizers are trained via meta-learning to generate good updates across many tasks.

Example: small_fc_lopt uses a 2-3 layer MLP with ~32-64 hidden units to predict updates for each parameter.

Meta-Learning (Learning to Learn)

The process of training an optimizer by optimizing its performance across many different tasks. The meta-objective is to find optimizer parameters that lead to good performance when the optimizer is applied to new, unseen tasks.

Key idea: Instead of hand-designing optimization rules, we learn them from data.

Meta-training process:

  1. Sample a random task (e.g., train a CNN on CIFAR-10)

  2. Apply the learned optimizer for K steps

  3. Compute meta-loss (e.g., final validation loss,sum of training loss)

  4. Update the optimizer’s weights via gradient estimation strategies such as Evolution Strategies.

  5. Repeat for huge number of tasks

Optimizee

The model being optimized. In the context of learned optimization:

  • Optimizee: The neural network you’re training (e.g., a ResNet, GPT model)

  • Optimizer: The learned optimizer doing the training

Example: When you train a Vision Transformer with VeLO, the ViT is the optimizee and VeLO is the optimizer.

Accumulator

State variables maintained by the optimizer across training steps. These store historical information about gradients and parameters.

Common accumulators:

  • Momentum (m): Exponential moving average of gradients

  • Second moment (v): Exponential moving average of squared gradients

  • Row/column factors (r, c): Factored second moment estimates (Adafactor-style)

Why they matter: Learned optimizers construct features from these accumulators to make informed update decisions.

Features

Input values fed to the learned optimizer’s neural network. Features are constructed from:

  • Current gradients

  • Parameter values

  • Accumulator states

  • Derived quantities (e.g., normalized momentum)

  • Time information

VeLO uses 29 features, small_fc_lopt uses 39 features.

Example features:

  • gradient - gradient

  • 1 / sqrt(second_moment) - Rsqrt of second moment

  • tanh(step / 1000) - Temporal feature indicating training progress

Feature Normalization

The process of scaling features to have unit variance before feeding them to the MLP. This helps the learned optimizer generalize across different parameter scales.

Normalization formula:

normalized_feature = feature / sqrt(mean(feature²) + ε)

Computed separately for each feature dimension across all parameters.

Training Horizon

The number of optimization steps the learned optimizer is expected to run for. Different learned optimizers are meta-trained for different horizons:

  • VeLO: Long horizons (150K steps) - suitable for large-scale pre-training

  • small_fc_lopt: Short-medium horizons (1K-10K steps)

Why it matters: Using an optimizer outside its meta-training horizon can lead to instability or divergence.

Optimizer Components

MLP (Multi-Layer Perceptron)

The neural network inside a learned optimizer that predicts parameter updates. Typically:

  • Architecture: 2-3 hidden layers with 32-64 units

  • Activations: ReLU

  • Output: 2 values per parameter (direction and magnitude)

Size: Small (~100 parameters) compared to the models being optimized (millions to billions).

Direction and Magnitude

The two outputs of a learned optimizer’s MLP for each parameter:

  • Direction: Suggests the sign and relative direction of the update

  • Magnitude: Suggests the scale of the update

Update formula:

update = direction × exp(magnitude × α) × β
parameter = parameter - update

Where α and β are fixed hyperparameters (typically 0.01).

Momentum

An exponential moving average of past gradients, helping optimization “build up speed” in consistent directions.

Update rule:

momentum_t = β × momentum_{t-1} + (1 - β) × gradient_t

Typical value: β ≈ 0.9-0.99

Used by Adam, RMSprop, SGD with momentum, and learned optimizers.

Second Moment

An exponential moving average of squared gradients, used for adaptive learning rates.

Update rule:

second_moment_t = β × second_moment_{t-1} + (1 - β) × gradient_t²

Typical value: β ≈ 0.999

Used by Adam, RMSprop, and learned optimizers for gradient normalization.

Factored Second Moment (Adafactor-style)

An efficient approximation of the full second moment matrix using row and column factors. Instead of storing an m×n matrix, store:

  • Row factors (r): Vector of length m

  • Column factors (c): Vector of length n

Memory savings: O(m×n) → O(m + n)

Approximation: second_moment[i,j] row_factor[i] × col_factor[j]

Used by Adafactor and learned optimizers for memory efficiency.

Performance Concepts

CUDA Kernel

A function that runs on the GPU. Each kernel launch has overhead, so reducing kernel count improves performance.

Kernel Fusion

Combining multiple operations into a single CUDA kernel to reduce memory traffic and launch overhead.

Example: PyLO fuses feature construction, normalization, MLP evaluation, and parameter update into one kernel.

Memory Hierarchy

The different levels of memory on a GPU, with varying speeds and sizes:

Memory Type

Size

Latency

Usage in PyLO

Registers

~256 KB/SM

1 cycle

Features, MLP activations

Shared Memory

48-164 KB/SM

~20 cycles

Normalization stats

L2 Cache

6-40 MB

~200 cycles

MLP weights

Global Memory

40-80 GB

~400 cycles

Parameters, gradients

Optimization goal: Keep frequently accessed data in fast memory (registers, shared memory).

Memory Bandwidth

The rate at which data can be transferred to/from GPU memory. Often the bottleneck for learned optimizers.


The ratio of active warps to maximum possible warps on a GPU. Higher occupancy generally means better GPU utilization.

Grid-Stride Loop

A CUDA programming pattern that allows a kernel to process more elements than there are threads.

for (int i = blockIdx.x * blockDim.x + threadIdx.x;
     i < total_elements;
     i += blockDim.x * gridDim.x)
{
    process_element(i);
}

Benefits: Flexible thread configuration, better instruction-level parallelism.

PyLO-Specific Terms

VeLO (Versatile Learned Optimizer)

A learned optimizer meta-trained by Google for 4000 TPU-months on diverse tasks. The most robust learned optimizer currently available.

Key properties:

  • 29 input features

  • Meta-trained for long horizons (150K steps)

  • Works well without hyperparameter tuning

Paper: VeLO: Training Versatile Learned Optimizers by Scaling Up

small_fc_lopt

A small mlp based learned optimizer

Key properties:

  • 39 input features

  • Meta-trained for short horizons

  • More memory-efficient during meta-training

Paper: Practical tradeoffs between memory, compute, and performance in learned optimizers

Distributed Optimizer Step

An optimization for multi-GPU training that distributes optimizer computation across devices instead of redundantly computing on all devices.

How it works:

  1. Reduce-scatter gradients (instead of all-reduce)

  2. Each GPU computes optimizer step for its shard

  3. All-gather updated parameters

Decoupled Weight Decay

Weight decay applied directly to parameters rather than through gradients. This separates regularization from gradient-based optimization.

Standard weight decay (L2 regularization):

gradient = gradient + λ × parameter

Decoupled weight decay:

parameter = (1 - λ) × parameter - learning_rate × gradient

Why it matters: Decoupled weight decay interacts better with adaptive learning rates and learned optimizers.

Training Concepts

Meta-Training vs Training

  • Meta-training: Training the optimizer itself on many tasks (done once, expensive)

  • Training (or optimization): Using the optimizer to train a model (done for each new model)

Hyperparameter Tuning

The process of searching for good hyperparameters (learning rate, weight decay, etc.) for a given task.

Traditional optimizers: Require extensive hyperparameter tuning

Learned optimizers: Work well with default settings (one of their key benefits!)

Technical Jargon

Warp ~~~

A group of 32 threads on NVIDIA GPUs that execute in lockstep. The fundamental unit of execution.

Warp shuffle: Communication between threads in a warp without using shared memory.

Thread Block (or Block)

A group of threads (typically 128-1024) that can cooperate via shared memory.

Streaming Multiprocessor (SM)

A GPU compute unit. Modern GPUs have 40-140 SMs.

Each SM has:

  • Registers (~256 KB)

  • Shared memory (48-164 KB)

  • L1 cache

  • Tensor cores (on recent GPUs)

Atomic Operation

An operation that completes without interruption. Used for thread-safe updates to shared variables.

Example: atomicAdd(&global_sum, thread_value)