Command-line usage guide for minimax.train
Parsing command-line arguments is handled by Parsnip
.
You can quickly generate batches of training commands from a JSON configuration file using minimax.config.make_cmd
.
Argument
Description
seed
Random seed, should be unique per experimental run
agent_rl_algo
Base RL algorithm used for training (e.g. PPO)
n_total_updates
Total number of updates for the training run
train_runner
Which training runner to use, e.g. dr
, plr
, or paired
n_devices
Number of devices over which to shard the environment batch dimension
n_students
Number of students in the autocurriculum
n_parallel
Number of parallel environments
n_eval
Number of parallel trials per environment (environment batch dimension is then n_parallel*n_eval
)
n_rollout_steps
Number of steps per rollout (used for each update cycle)
lr
Learning rate
lr_final
Final learning rate, based on linear schedule. Defaults to None
, corresponding to no schedule.
lr_anneal_steps
Number of steps over which to linearly anneal from lr
to lr_final
student_value_coef
Value loss coefficient
student_entropy_coef
Entropy bonus coefficient
student_unroll_update
Unroll multi-gradient updates this many times (can lead to speed ups)
max_grad_norm
Clip gradients beyond this magnitude
adam_eps
Value of $\epsilon$ numerical stability constant for Adam
discount
Discount factor $\gamma$ for the student's RL optimization
n_unroll_rollout
Unroll rollout scans this many times (can lead to speed ups)
Argument
Description
verbose
Random seed, should be unique per experimental run
track_env_metrics
Track per rollout batch environment metrics if True
log_dir
Path to directory storing all experiment folders
xpid
Unique name for experiment folder, stored in --log_dir
log_interval
Log training statistics every this many rollout cycles
wandb_base_url
Base API URL if logging with wandb
wandb_api_key
API key for wandb
wandb_entity
wandb
entity associated with the experiment run
wandb_project
wandb
project for the experiment run
wandb_group
wandb
group for the experiment run
Argument
Description
checkpoint_interval
Random seed, should be unique per experimental run
from_last_checkpoint
Begin training from latest checkpoint.pkl
, if any, in the experiment folder
archive_interval
Save an additional checkpoint for models trained per this many rollout cycles
Argument
Description
test_env_names
Random seed, should be unique per experimental run
test_n_episodes
Average test results over this many episodes per test environment
test_agent_idxs
Test agents at these indices (csv of indices or *
for all indices)
These arguments activate when --agent_rl_algo=ppo
.
Argument
Description
student_ppo_n_epochs
Random seed, should be unique per experimental run
student_ppo_n_epochs
Number of PPO epochs per update cycle
student_ppo_n_minibatches
Number of minibatches per PPO epoch
student_ppo_clip_eps
Clip coefficient for PPO
student_ppo_clip_value_loss
Perform value clipping if True
gae_lambda
Lambda discount factor for Generalized Advantage Estimation
The arguments in this section activate when --train_runner=paired
.
Argument
Description
teacher_lr
Learning rate
teacher_lr_final
Anneal learning rate to this value (defaults to teacher_lr
)
teacher_lr_anneal_steps
Number of steps over which to linearly anneal from lr
to lr_final
teacher_discount
Discount factor, $\gamma$
teacher_value_loss_coef
Value loss coefficient
teacher_entropy_coef
Entropy bonus coefficient
teacher_n_unroll_update
Unroll multi-gradient updates this many times (can lead to speed ups)
ued_score
Name of UED objective, e.g. relative_regret
These PPO-specific arguments for teacher optimization further activate when --agent_rl_algo=ppo
.
Argument
Description
teacher_ppo_n_epochs
Number of PPO epochs per update cycle
teacher_ppo_n_minibatches
Number of minibatches per PPO epoch
teacher_ppo_clip_eps
Clip coefficient for PPO
teacher_ppo_clip_value_loss
Perform value clipping if True
teacher_gae_lambda
Lambda discount factor for Generalized Advantage Estimation
The arguments in this section activate when --train_runner=paired
.
Argument
Description
ued_score
Name of UED objective (aka PLR scoring function)
plr_replay_prob
Replay probability
plr_buffer_size
Size of level replay buffer
plr_staleness_coef
Staleness coefficient
plr_temp
Score distribution temperature
plr_use_score_ranks
Use rank-based prioritization (rather than proportional)
plr_min_fill_ratio
Only replay once level replay buffer is filled above this ratio
plr_use_robust_plr
Use robust PLR (i.e. only update policy on replay levels)
plr_force_unique
Force level replay buffer members to be unique
plr_use_parallel_eval
Use Parallel PLR or Parallel ACCEL (if plr_mutation_fn
is set)
plr_mutation_fn
If set, PLR becomes ACCEL. Use 'default'
for default mutation operator per environment.
plr_n_mutations
Number of applications of plr_mutation_fn
per mutation cycle.
plr_mutation_criterion
How replay levels are selected for mutation (e.g. batch
, easy
, hard
).
plr_mutation_subsample_size
Number of replay levels selected for mutation according to the criterion (ignored if using batch
criterion)
Environment-specific arguments
See the AMaze
docs for details on how to specify training , evaluation , and teacher-specific environment parameters via command line