Key Concepts and Glossary
This page provides definitions and explanations of key terms used in learned optimization and PyLO.
Core Concepts
Learned Optimizer
A neural network (typically a small MLP) that predicts parameter updates during training. Instead of using hand-designed update rules (like Adam’s momentum and adaptive learning rates), learned optimizers are trained via meta-learning to generate good updates across many tasks.
Example: small_fc_lopt uses a 2-3 layer MLP with ~32-64 hidden units to predict updates for each parameter.
Meta-Learning (Learning to Learn)
The process of training an optimizer by optimizing its performance across many different tasks. The meta-objective is to find optimizer parameters that lead to good performance when the optimizer is applied to new, unseen tasks.
Key idea: Instead of hand-designing optimization rules, we learn them from data.
Meta-training process:
Sample a random task (e.g., train a CNN on CIFAR-10)
Apply the learned optimizer for K steps
Compute meta-loss (e.g., final validation loss,sum of training loss)
Update the optimizer’s weights via gradient estimation strategies such as Evolution Strategies.
Repeat for huge number of tasks
Optimizee
The model being optimized. In the context of learned optimization:
Optimizee: The neural network you’re training (e.g., a ResNet, GPT model)
Optimizer: The learned optimizer doing the training
Example: When you train a Vision Transformer with VeLO, the ViT is the optimizee and VeLO is the optimizer.
Accumulator
State variables maintained by the optimizer across training steps. These store historical information about gradients and parameters.
Common accumulators:
Momentum (m): Exponential moving average of gradients
Second moment (v): Exponential moving average of squared gradients
Row/column factors (r, c): Factored second moment estimates (Adafactor-style)
Why they matter: Learned optimizers construct features from these accumulators to make informed update decisions.
Features
Input values fed to the learned optimizer’s neural network. Features are constructed from:
Current gradients
Parameter values
Accumulator states
Derived quantities (e.g., normalized momentum)
Time information
VeLO uses 29 features, small_fc_lopt uses 39 features.
Example features:
gradient- gradient1 / sqrt(second_moment)- Rsqrt of second momenttanh(step / 1000)- Temporal feature indicating training progress
Feature Normalization
The process of scaling features to have unit variance before feeding them to the MLP. This helps the learned optimizer generalize across different parameter scales.
Normalization formula:
normalized_feature = feature / sqrt(mean(feature²) + ε)
Computed separately for each feature dimension across all parameters.
Training Horizon
The number of optimization steps the learned optimizer is expected to run for. Different learned optimizers are meta-trained for different horizons:
VeLO: Long horizons (150K steps) - suitable for large-scale pre-training
small_fc_lopt: Short-medium horizons (1K-10K steps)
Why it matters: Using an optimizer outside its meta-training horizon can lead to instability or divergence.
Optimizer Components
MLP (Multi-Layer Perceptron)
The neural network inside a learned optimizer that predicts parameter updates. Typically:
Architecture: 2-3 hidden layers with 32-64 units
Activations: ReLU
Output: 2 values per parameter (direction and magnitude)
Size: Small (~100 parameters) compared to the models being optimized (millions to billions).
Direction and Magnitude
The two outputs of a learned optimizer’s MLP for each parameter:
Direction: Suggests the sign and relative direction of the update
Magnitude: Suggests the scale of the update
Update formula:
update = direction × exp(magnitude × α) × β
parameter = parameter - update
Where α and β are fixed hyperparameters (typically 0.01).
Momentum
An exponential moving average of past gradients, helping optimization “build up speed” in consistent directions.
Update rule:
momentum_t = β × momentum_{t-1} + (1 - β) × gradient_t
Typical value: β ≈ 0.9-0.99
Used by Adam, RMSprop, SGD with momentum, and learned optimizers.
Second Moment
An exponential moving average of squared gradients, used for adaptive learning rates.
Update rule:
second_moment_t = β × second_moment_{t-1} + (1 - β) × gradient_t²
Typical value: β ≈ 0.999
Used by Adam, RMSprop, and learned optimizers for gradient normalization.
Factored Second Moment (Adafactor-style)
An efficient approximation of the full second moment matrix using row and column factors. Instead of storing an m×n matrix, store:
Row factors (r): Vector of length m
Column factors (c): Vector of length n
Memory savings: O(m×n) → O(m + n)
Approximation: second_moment[i,j] ≈ row_factor[i] × col_factor[j]
Used by Adafactor and learned optimizers for memory efficiency.
Performance Concepts
CUDA Kernel
A function that runs on the GPU. Each kernel launch has overhead, so reducing kernel count improves performance.
Kernel Fusion
Combining multiple operations into a single CUDA kernel to reduce memory traffic and launch overhead.
Example: PyLO fuses feature construction, normalization, MLP evaluation, and parameter update into one kernel.
Memory Hierarchy
The different levels of memory on a GPU, with varying speeds and sizes:
Memory Type |
Size |
Latency |
Usage in PyLO |
|---|---|---|---|
Registers |
~256 KB/SM |
1 cycle |
Features, MLP activations |
Shared Memory |
48-164 KB/SM |
~20 cycles |
Normalization stats |
L2 Cache |
6-40 MB |
~200 cycles |
MLP weights |
Global Memory |
40-80 GB |
~400 cycles |
Parameters, gradients |
Optimization goal: Keep frequently accessed data in fast memory (registers, shared memory).
Memory Bandwidth
The rate at which data can be transferred to/from GPU memory. Often the bottleneck for learned optimizers.
The ratio of active warps to maximum possible warps on a GPU. Higher occupancy generally means better GPU utilization.
Grid-Stride Loop
A CUDA programming pattern that allows a kernel to process more elements than there are threads.
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < total_elements;
i += blockDim.x * gridDim.x)
{
process_element(i);
}
Benefits: Flexible thread configuration, better instruction-level parallelism.
PyLO-Specific Terms
VeLO (Versatile Learned Optimizer)
A learned optimizer meta-trained by Google for 4000 TPU-months on diverse tasks. The most robust learned optimizer currently available.
Key properties:
29 input features
Meta-trained for long horizons (150K steps)
Works well without hyperparameter tuning
Paper: VeLO: Training Versatile Learned Optimizers by Scaling Up
small_fc_lopt
A small mlp based learned optimizer
Key properties:
39 input features
Meta-trained for short horizons
More memory-efficient during meta-training
Paper: Practical tradeoffs between memory, compute, and performance in learned optimizers
Distributed Optimizer Step
An optimization for multi-GPU training that distributes optimizer computation across devices instead of redundantly computing on all devices.
How it works:
Reduce-scatter gradients (instead of all-reduce)
Each GPU computes optimizer step for its shard
All-gather updated parameters
Decoupled Weight Decay
Weight decay applied directly to parameters rather than through gradients. This separates regularization from gradient-based optimization.
Standard weight decay (L2 regularization):
gradient = gradient + λ × parameter
Decoupled weight decay:
parameter = (1 - λ) × parameter - learning_rate × gradient
Why it matters: Decoupled weight decay interacts better with adaptive learning rates and learned optimizers.
Training Concepts
Meta-Training vs Training
Meta-training: Training the optimizer itself on many tasks (done once, expensive)
Training (or optimization): Using the optimizer to train a model (done for each new model)
Hyperparameter Tuning
The process of searching for good hyperparameters (learning rate, weight decay, etc.) for a given task.
Traditional optimizers: Require extensive hyperparameter tuning
Learned optimizers: Work well with default settings (one of their key benefits!)
Technical Jargon
Warp ~~~
A group of 32 threads on NVIDIA GPUs that execute in lockstep. The fundamental unit of execution.
Warp shuffle: Communication between threads in a warp without using shared memory.
Thread Block (or Block)
A group of threads (typically 128-1024) that can cooperate via shared memory.
Streaming Multiprocessor (SM)
A GPU compute unit. Modern GPUs have 40-140 SMs.
Each SM has:
Registers (~256 KB)
Shared memory (48-164 KB)
L1 cache
Tensor cores (on recent GPUs)
Atomic Operation
An operation that completes without interruption. Used for thread-safe updates to shared variables.
Example: atomicAdd(&global_sum, thread_value)