RL Training
AlphaApollo integrates with verl (Versatile RL) for production-grade RL-based LLM post-training. It exposes PPO, GRPO, DAPO, and RLOO through a single unified entry point. This page covers each algorithm, the shared training architecture, and key configuration parameters.
Supported Algorithms
AlphaApollo supports multiple RL algorithms through a unified entry point (verl.trainer.main_ppo). The algorithm is selected via the algorithm.adv_estimator config parameter.
PPO (Proximal Policy Optimization)
PPO is the standard RL algorithm for LLM post-training. It uses a critic model to estimate the value function and computes advantages via Generalized Advantage Estimation (GAE).
Key characteristics:
- Requires a critic model in addition to the actor
- Uses GAE for advantage estimation
- Supports clipped surrogate objective with configurable clip ratio
- Single rollout per prompt (
n=1)
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
algorithm.gamma=1.0 \
algorithm.lam=1.0 \
actor_rollout_ref.rollout.n=1 \
...
- When you have a trained reward model
- When sample efficiency matters (PPO can learn from fewer samples)
- When you need fine-grained value estimation at each token
GRPO (Group Relative Policy Optimization)
GRPO eliminates the critic by estimating advantages from a group of rollouts for each prompt. For each prompt, multiple responses are generated, and their rewards are compared within the group to compute relative advantages.
Key characteristics:
- No critic model needed (reduces memory and compute)
- Generates multiple responses per prompt (
n > 1) - Uses KL divergence loss against reference model
- Advantages are normalized within each group
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.01 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
env.rollout.n=8 \
...
- When you want simpler training without a critic
- When GPU memory is a constraint
- For agentic tasks where group comparison is natural
DAPO (Data-Augmented Policy Optimization)
DAPO builds on GRPO with an additional group filtering mechanism that regenerates rollouts when a group provides no learning signal (all rewards are the same).
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
algorithm.filter_groups.enable=True \
algorithm.filter_groups.max_num_gen_batches=10 \
...
RLOO (Relative Likelihood Optimization)
RLOO uses a leave-one-out baseline for advantage estimation across multiple rollouts.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=rloo \
env.rollout.n=8 \
...
Algorithm Comparison
| Feature | PPO | GRPO | DAPO | RLOO |
|---|---|---|---|---|
| Critic required | Yes | No | No | No |
Group rollouts (n > 1) | No | Yes | Yes | Yes |
| KL loss | Optional | Yes | Yes | Optional |
| Step-level advantage | No | No | No | No |
Discount factor (gamma < 1) | Via GAE | N/A | N/A | N/A |
| Group filtering | No | No | Yes | No |
Training Architecture
AlphaApollo uses verl's HybridFlow architecture that combines single-controller orchestration with multi-controller execution:
┌──────────────────────────────────────────────────────┐
│ PPO Ray Trainer │
│ (Single Controller / Driver) │
├──────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Actor │ │ Rollout │ │
│ │ (FSDP) │◄──│ (vLLM) │ │
│ │ Training │ │ Generation │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ Reference │ │ Critic │ │
│ │ Model │ │ (PPO only) │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Reward │ │ Environment │ │
│ │ Model/Fn │ │ Workers │ │
│ └─────────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────┘
Training Loop
For each training iteration:
- Rollout: Generate responses using the rollout engine (vLLM/SGLang) within the environment
- Reward: Compute rewards using reward model and/or custom reward functions
- Advantage: Estimate advantages (GAE for PPO, group-relative for GRPO)
- Update: Update actor (and critic for PPO) using the clipped surrogate objective
- Sync: Synchronize updated weights to the rollout engine
Multi-Turn Interaction
For agentic tasks, the rollout phase involves multi-turn environment interaction:
┌───────┐ ┌───────────┐ ┌─────────────┐
│ Model │────▶│Environment│────▶│ Reward │
│(Actor)│ │ (e.g., │ │ (per-step │
│ │◀────│ Math) │◀────│ or final) │
└───────┘ └───────────┘ └─────────────┘
│ │
│ step 1..N │
│◀─────────────▶│
Each episode consists of multiple steps where the model generates actions, the environment processes them, and observations are returned.
Key Training Parameters
Learning Rate & Optimization
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1
actor_rollout_ref.actor.optim.warmup_style=cosine # or constant
actor_rollout_ref.actor.optim.weight_decay=0.01
actor_rollout_ref.actor.grad_clip=1.0
KL Divergence Control
KL divergence prevents the policy from drifting too far from the reference model:
# Option 1: KL loss in the actor objective (recommended for GRPO)
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.01
actor_rollout_ref.actor.kl_loss_type=low_var_kl
# Option 2: KL penalty in the reward
algorithm.use_kl_in_reward=True
algorithm.kl_ctrl.type=fixed
algorithm.kl_ctrl.kl_coef=0.001
KL loss types:
kl(k1): Standard KL divergenceabs: Absolute differencemse(k2): Mean squared errorlow_var_kl(k3): Low-variance KL estimator (recommended)full: Full KL divergence
Invalid Action Penalty
For agentic environments, penalize malformed actions:
actor_rollout_ref.actor.use_invalid_action_penalty=True
actor_rollout_ref.actor.invalid_action_penalty_coef=0.1
Data Preparation
Before training, prepare your dataset using the data preprocessing scripts:
# Prepare informal math dataset
python3 -m examples.data_preprocess.prepare_informal_math \
--data_source DigitalLearningGmbH/MATH-lighteval
# Prepare text-based environment dataset
python3 -m examples.data_preprocess.prepare \
--mode 'text' \
--train_data_size 16 \
--val_data_size 128
Datasets are expected in parquet format with at least a prompt column.
Multi-GPU and Multi-Node Training
AlphaApollo supports distributed training across multiple GPUs and nodes:
# Single node, 2 GPUs
trainer.n_gpus_per_node=2 trainer.nnodes=1
# Multi-node (4 nodes, 8 GPUs each)
trainer.n_gpus_per_node=8 trainer.nnodes=4
Key distributed training settings:
# Tensor parallelism for rollout
actor_rollout_ref.rollout.tensor_model_parallel_size=2
# Sequence parallelism for training
actor_rollout_ref.actor.ulysses_sequence_parallel_size=2
# FSDP offloading (trade speed for memory)
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
actor_rollout_ref.ref.fsdp_config.param_offload=True # recommended for ref model >7B
For reference models larger than 7B, enable param_offload to reduce peak GPU memory usage during training.
Related Pages
- RL Training Config — Detailed parameter reference
- Configuration Overview — Hydra basics and CLI overrides
- SFT — Supervised Fine-Tuning process
- Evolving Pipeline — Inference-time self-improvement via policy-verifier loops