Skip to main content

Configuration

AlphaApollo uses Hydra as its configuration management framework. All training, generation, and evaluation entry points are driven by composable YAML config files with full support for CLI overrides.

Hydra Basics

Each trainer entry point is decorated with @hydra.main, which loads a default YAML config and merges any command-line overrides on top.

Entry PointConfig FileDescription
verl.trainer.main_ppoverl/trainer/config/ppo_trainer.yamlPPO / GRPO RL training
verl.trainer.main_generationverl/trainer/config/generation.yamlOffline generation / inference
verl.trainer.fsdp_sft_trainerverl/trainer/config/sft_trainer.yamlSupervised Fine-Tuning
verl.trainer.main_evalverl/trainer/config/evaluation.yamlEvaluation pipeline

For example, the PPO trainer is launched as:

# verl/trainer/main_ppo.py
@hydra.main(config_path="config", config_name="ppo_trainer")
def main(config):
...

CLI Overrides

Hydra allows you to override any config parameter directly from the command line using dot-separated paths. This is the primary way to customize training runs without editing YAML files.

Basic Override Syntax

python3 -m verl.trainer.main_ppo \
data.train_batch_size=16 \
data.max_prompt_length=4096 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B-Instruct \
algorithm.adv_estimator=grpo

Common Override Patterns

Nested keys — use dot notation to reach any depth:

actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.rollout.val_kwargs.temperature=0.4

Lists — use bracket syntax:

trainer.logger=['console','wandb']

Boolean values:

actor_rollout_ref.actor.use_kl_loss=True
env.informal_math.enable_python_code=true

Adding new keys — prefix with +:

+data.response_dict_keys=['answer']

Null values:

trainer.default_hdfs_dir=null

Variable Interpolation

Config files use OmegaConf's ${...} syntax for cross-referencing values, keeping configs DRY:

critic:
rollout_n: ${actor_rollout_ref.rollout.n}
ppo_epochs: ${actor_rollout_ref.actor.ppo_epochs}
shuffle: ${actor_rollout_ref.actor.shuffle}
loss_agg_mode: ${actor_rollout_ref.actor.loss_agg_mode}

This means changing actor_rollout_ref.rollout.n automatically updates critic.rollout_n as well.

Config File Structure

A typical RL training config (ppo_trainer.yaml) is organized into the following top-level sections:

data:              # Dataset paths, tokenization, batch sizes
actor_rollout_ref: # Actor model, rollout engine, reference model
critic: # Critic model (for PPO)
reward_model: # Reward model settings
algorithm: # RL algorithm hyperparameters
trainer: # Training loop, logging, checkpointing
env: # Environment configuration
ray_init: # Ray cluster settings

See the dedicated pages for detailed breakdowns:

Recipe Configs

Algorithm-specific recipes inherit from the base ppo_trainer.yaml using Hydra's defaults mechanism:

# recipe/dapo/config/dapo_trainer.yaml
defaults:
- ppo_trainer
- _self_

This allows recipes to only specify the parameters that differ. For example:

RecipeKey Overrides
DAPOfilter_groups.enable=True, custom reward manager
SPPOsppo_eta parameter
SPINdpo_beta parameter
PRIMERLOO advantage estimator, custom reward model

Environment Variables

Several environment variables affect training behavior:

# Debugging
export HYDRA_FULL_ERROR=1 # Show full Hydra error traces

# Ray
export RAY_DISABLE_DASHBOARD=1 # Disable Ray dashboard
export RAY_USAGE_STATS_ENABLED=0 # Disable telemetry
export RAY_memory_usage_threshold=0.98 # OOM kill threshold

# vLLM
export VLLM_ATTENTION_BACKEND=XFORMERS # Attention backend
export CUDA_VISIBLE_DEVICES=0,1,2,3 # GPU selection

Typical Training Launch

A complete GRPO training launch typically looks like:

set -x
ENGINE=${1:-vllm}

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/train.parquet \
data.val_files=$HOME/data/test.parquet \
data.train_batch_size=8 \
data.max_prompt_length=4096 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B-Instruct \
actor_rollout_ref.rollout.name=$ENGINE \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
env.env_name=informal_math_training \
env.max_steps=4 \
env.rollout.n=8 \
trainer.n_gpus_per_node=2 \
trainer.nnodes=1 \
trainer.total_epochs=1
Debugging Hydra errors

Set HYDRA_FULL_ERROR=1 before launching to get the full Python traceback instead of Hydra's condensed error output.