Key Concepts and Glossary

This page provides definitions and explanations of key terms used in learned optimization and PyLO.

Core Concepts

Learned Optimizer

A neural network (typically a small MLP) that predicts parameter updates during training. Instead of using hand-designed update rules (like Adam’s momentum and adaptive learning rates), learned optimizers are trained via meta-learning to generate good updates across many tasks.

Example: small_fc_lopt uses a 2-3 layer MLP with ~32-64 hidden units to predict updates for each parameter.

Meta-Learning (Learning to Learn)

The process of training an optimizer by optimizing its performance across many different tasks. The meta-objective is to find optimizer parameters that lead to good performance when the optimizer is applied to new, unseen tasks.

Key idea: Instead of hand-designing optimization rules, we learn them from data.

Meta-training process:

Sample a random task (e.g., train a CNN on CIFAR-10)
Apply the learned optimizer for K steps
Compute meta-loss (e.g., final validation loss,sum of training loss)
Update the optimizer’s weights via gradient estimation strategies such as Evolution Strategies.
Repeat for huge number of tasks

Optimizee

The model being optimized. In the context of learned optimization:

Optimizee: The neural network you’re training (e.g., a ResNet, GPT model)
Optimizer: The learned optimizer doing the training

Example: When you train a Vision Transformer with VeLO, the ViT is the optimizee and VeLO is the optimizer.

Accumulator

State variables maintained by the optimizer across training steps. These store historical information about gradients and parameters.

Common accumulators:

Momentum (m): Exponential moving average of gradients
Second moment (v): Exponential moving average of squared gradients
Row/column factors (r, c): Factored second moment estimates (Adafactor-style)

Why they matter: Learned optimizers construct features from these accumulators to make informed update decisions.

Features

Input values fed to the learned optimizer’s neural network. Features are constructed from:

Current gradients
Parameter values
Accumulator states
Derived quantities (e.g., normalized momentum)
Time information

VeLO uses 29 features, small_fc_lopt uses 39 features.

Example features:

gradient - gradient
1 / sqrt(second_moment) - Rsqrt of second moment
tanh(step / 1000) - Temporal feature indicating training progress

Feature Normalization

The process of scaling features to have unit variance before feeding them to the MLP. This helps the learned optimizer generalize across different parameter scales.

Normalization formula:

normalized_feature = feature / sqrt(mean(feature²) + ε)

Computed separately for each feature dimension across all parameters.

Training Horizon

The number of optimization steps the learned optimizer is expected to run for. Different learned optimizers are meta-trained for different horizons:

VeLO: Long horizons (150K steps) - suitable for large-scale pre-training
small_fc_lopt: Short-medium horizons (1K-10K steps)

Why it matters: Using an optimizer outside its meta-training horizon can lead to instability or divergence.

Optimizer Components

MLP (Multi-Layer Perceptron)

The neural network inside a learned optimizer that predicts parameter updates. Typically:

Architecture: 2-3 hidden layers with 32-64 units
Activations: ReLU
Output: 2 values per parameter (direction and magnitude)

Size: Small (~100 parameters) compared to the models being optimized (millions to billions).

Direction and Magnitude

The two outputs of a learned optimizer’s MLP for each parameter:

Direction: Suggests the sign and relative direction of the update
Magnitude: Suggests the scale of the update

Update formula:

update = direction × exp(magnitude × α) × β
parameter = parameter - update

Where α and β are fixed hyperparameters (typically 0.01).

Momentum

An exponential moving average of past gradients, helping optimization “build up speed” in consistent directions.

Update rule:

momentum_t = β × momentum_{t-1} + (1 - β) × gradient_t

Typical value: β ≈ 0.9-0.99

Used by Adam, RMSprop, SGD with momentum, and learned optimizers.

Second Moment

An exponential moving average of squared gradients, used for adaptive learning rates.

Update rule:

second_moment_t = β × second_moment_{t-1} + (1 - β) × gradient_t²

Typical value: β ≈ 0.999

Used by Adam, RMSprop, and learned optimizers for gradient normalization.

Factored Second Moment (Adafactor-style)

An efficient approximation of the full second moment matrix using row and column factors. Instead of storing an m×n matrix, store:

Row factors (r): Vector of length m
Column factors (c): Vector of length n

Memory savings: O(m×n) → O(m + n)

Approximation: second_moment[i,j] ≈ row_factor[i] × col_factor[j]

Used by Adafactor and learned optimizers for memory efficiency.

Performance Concepts

CUDA Kernel

A function that runs on the GPU. Each kernel launch has overhead, so reducing kernel count improves performance.

Kernel Fusion

Combining multiple operations into a single CUDA kernel to reduce memory traffic and launch overhead.

Example: PyLO fuses feature construction, normalization, MLP evaluation, and parameter update into one kernel.

Memory Hierarchy

The different levels of memory on a GPU, with varying speeds and sizes:

Memory Type	Size	Latency	Usage in PyLO
Registers	~256 KB/SM	1 cycle	Features, MLP activations
Shared Memory	48-164 KB/SM	~20 cycles	Normalization stats
L2 Cache	6-40 MB	~200 cycles	MLP weights
Global Memory	40-80 GB	~400 cycles	Parameters, gradients

Optimization goal: Keep frequently accessed data in fast memory (registers, shared memory).

Memory Bandwidth

The rate at which data can be transferred to/from GPU memory. Often the bottleneck for learned optimizers.

The ratio of active warps to maximum possible warps on a GPU. Higher occupancy generally means better GPU utilization.

Grid-Stride Loop

A CUDA programming pattern that allows a kernel to process more elements than there are threads.

for (int i = blockIdx.x * blockDim.x + threadIdx.x;
     i < total_elements;
     i += blockDim.x * gridDim.x)
{
    process_element(i);
}

Benefits: Flexible thread configuration, better instruction-level parallelism.

PyLO-Specific Terms

VeLO (Versatile Learned Optimizer)

A learned optimizer meta-trained by Google for 4000 TPU-months on diverse tasks. The most robust learned optimizer currently available.

Key properties:

29 input features
Meta-trained for long horizons (150K steps)
Works well without hyperparameter tuning

Paper: VeLO: Training Versatile Learned Optimizers by Scaling Up

small_fc_lopt

A small mlp based learned optimizer

Key properties:

39 input features
Meta-trained for short horizons
More memory-efficient during meta-training

Paper: Practical tradeoffs between memory, compute, and performance in learned optimizers

Distributed Optimizer Step

An optimization for multi-GPU training that distributes optimizer computation across devices instead of redundantly computing on all devices.

How it works:

Reduce-scatter gradients (instead of all-reduce)
Each GPU computes optimizer step for its shard
All-gather updated parameters

Decoupled Weight Decay

Weight decay applied directly to parameters rather than through gradients. This separates regularization from gradient-based optimization.

Standard weight decay (L2 regularization):

gradient = gradient + λ × parameter

Decoupled weight decay:

parameter = (1 - λ) × parameter - learning_rate × gradient

Why it matters: Decoupled weight decay interacts better with adaptive learning rates and learned optimizers.

Training Concepts

Meta-Training vs Training

Meta-training: Training the optimizer itself on many tasks (done once, expensive)
Training (or optimization): Using the optimizer to train a model (done for each new model)

Hyperparameter Tuning

The process of searching for good hyperparameters (learning rate, weight decay, etc.) for a given task.

Traditional optimizers: Require extensive hyperparameter tuning

Learned optimizers: Work well with default settings (one of their key benefits!)

Technical Jargon

Warp ~~~

A group of 32 threads on NVIDIA GPUs that execute in lockstep. The fundamental unit of execution.

Warp shuffle: Communication between threads in a warp without using shared memory.

Thread Block (or Block)

A group of threads (typically 128-1024) that can cooperate via shared memory.

Streaming Multiprocessor (SM)

A GPU compute unit. Modern GPUs have 40-140 SMs.

Each SM has:

Registers (~256 KB)
Shared memory (48-164 KB)
L1 cache
Tensor cores (on recent GPUs)

Atomic Operation

An operation that completes without interruption. Used for thread-safe updates to shared variables.

Example: atomicAdd(&global_sum, thread_value)