Models

class pylo.models.MetaMLP(*args, **kwargs)[source]

A Multi-Layer Perceptron model used for meta-learning.

This MLP architecture is designed specifically for learned optimizers, with configurable input size, hidden layer size, and number of hidden layers. This follows the architecture described for small_fc_mlp_lopt in the paper Practical Tradeoffs between memory,compute and performance in learned optimizers The model implements PyTorch’s Module interface and can be pushed to or loaded from the Hugging Face Hub.

__init__(input_size, hidden_size, hidden_layers)[source]

Initialize the MetaMLP model.

Parameters:

input_size (int) – The size of the input features.
hidden_size (int) – The size of the hidden layers.
hidden_layers (int) – The number of hidden layers in the network.

forward(x)[source]

Forward pass through the network.

Parameters:: x (torch.Tensor) – Input tensor of shape [batch_size, input_size].
Returns:: Output tensor of shape [batch_size, 2].
Return type:: torch.Tensor

class pylo.models.VeLOMLP(*args, **kwargs)[source]

Versatile Learned Optimizer MLP (VeLO-MLP).

This class implements a multi-layer perceptron based on Google’s VeLO paper, which can adapt its parameters based on a control vector. It maintains two sets of parameters: - Storage parameters (with underscore suffix) that maintain a collection of possible parameter values - Actual parameters that are used during forward computation

The model is designed to be used as a learned optimizer where parameters can be dynamically updated based on optimization context, implementing the versatile parameter adaptation approach described in the VeLO paper.

__init__(param_inits=256, input_size=30, hidden_size=4, hidden_layers=1, output_size=3)[source]

Initialize the VeLOMLP model.

Parameters:

param_inits (int, optional) – Number of parameter initializations to maintain in storage. Defaults to 256.
input_size (int, optional) – Size of the input dimension. Defaults to 30.
hidden_size (int, optional) – Size of the hidden dimensions. Defaults to 4.
hidden_layers (int, optional) – Number of hidden layers. Defaults to 1.
output_size (int, optional) – Size of the output dimension. Defaults to 3.

update_params(control)[source]

Update the actual parameters based on the control vector.

This method computes a weighted average of the storage parameters based on the control vector, and updates the actual parameters with the result. The weighted average is scaled by a factor of 100.0.

Parameters:: control (torch.Tensor) – Control vector that determines the weights for parameter averaging. Shape should be compatible with the first dimension of storage parameters.
Returns:: None

forward(x)[source]

Forward pass through the network.

Parameters:: x (torch.Tensor) – Input tensor
Returns:: Output of the network
Return type:: torch.Tensor

class pylo.models.VeLORNN(*args, **kwargs)[source]

VeLO RNN module that processes tensor-specific features as described in Google’s Versatile Learned Optimizer paper.

This module implements the per-tensor RNN component of the VeLO architecture. It processes tensor-specific features and outputs control vectors and learning rate multipliers that determine how parameters are adapted during optimization.

The VeLORNN applies a feature mixing network (when enabled) followed by an LSTM to produce control vectors that weight different parameter initializations and learning rate multipliers that scale step sizes.

__init__(input_size=30, lstm_hidden_size=512, param_inits=256, mix_layers=True)[source]

Initialize the VeLORNN module.

Parameters:

input_size (int, optional) – Dimension of input features. Defaults to 30.
lstm_hidden_size (int, optional) – Size of LSTM hidden state. Defaults to 512.
param_inits (int, optional) – Number of parameter initializations to control. Determines the dimension of the output control vector. Defaults to 256.
mix_layers (bool, optional) – Whether to use feature mixing layers before LSTM. Defaults to True.

forward(x, state)[source]

Forward pass of the VeLORNN.

This method processes tensor-specific features through optional mixing layers and an LSTM to produce control vectors and learning rate multipliers.

Parameters:

x (torch.Tensor) – Input tensor containing tensor-specific features
state (tuple) – Previous LSTM state (h, c)

Returns:

Tuple containing:

controls (torch.Tensor): Control vector for weighting parameter initializations
lr_mult (torch.Tensor): Learning rate multiplier for scaling step size
state (tuple): Updated LSTM state for next iteration

Return type:

tuple