Skip to main content

Algorithms

AlphaApollo supports multiple training and inference paradigms for LLM post-training. This section covers the core algorithms and pipelines available in the framework.

Training Pipelines

PipelineDescriptionEntry Point
Supervised Fine-Tuning (SFT)Train on curated instruction-response pairsverl.trainer.fsdp_sft_trainer
RL TrainingReinforcement learning with PPO, GRPO, etc.verl.trainer.main_ppo
Evolving PipelineInference-time self-improvement via policy-verifier loopsexamples/evolving/evolving_main.py

Supported RL Algorithms

Algorithmadv_estimatorCritic RequiredGroup RolloutsKey Feature
PPOgaeYesNoStandard RL with value function
GRPOgrpoNoYes (n > 1)Group-relative advantage estimation
DAPOgrpo + filterNoYes (n > 1)Group filtering for better learning signal
RLOOrlooNoYes (n > 1)Leave-one-out baseline

Typical Workflow

A typical AlphaApollo post-training workflow follows these stages:

┌─────────┐     ┌──────────────┐     ┌──────────────────┐
│ SFT │────▶│ RL Training │────▶│ Evolving │
│ │ │ (GRPO) │ │ (Self-Improve) │
└─────────┘ └──────────────┘ └──────────────────┘
Stage 1 Stage 2 Stage 3
  1. SFT — Fine-tune a pretrained model on task-specific instruction-response data
  2. RL Training — Further optimize the model using environment rewards (GRPO, PPO)
  3. Evolving — Iteratively improve solutions at inference time through policy-verifier self-improvement loops
tip

Each stage is optional — you can start from any point depending on your needs.

Example Scripts

AlphaApollo provides ready-to-use scripts for various environments and algorithms:

RL Training

# GRPO on MATH-lighteval with Qwen2.5-3B-Instruct and evaluate on MATH-500
cd examples/rl
bash run_rl_informal_math_tool.sh

SFT

# SFT on NuminaMath-TIR with Qwen2.5-3B-Instruct
bash examples/sft/run_sft_informal_math_tool.sh

Evolving

# Before running the self-evolution scripts, make sure to serve the corresponding number of models.
python alphaapollo/utils/ray_serve_llm.py --model_path Qwen/Qwen3-4B-Instruct-2507 --gpus "0,1" --port 8000 --model_id "qwen3_4b_inst"

# single-model evolution
python3 -m alphaapollo.workflows.evo \
--preprocess.data_source=math-ai/aime24 \
--run.dataset_name=aime24 \
--policy_model_cfg.model_name=qwen3_4b_inst \
--policy_model_cfg.base_url=http://localhost:8000/v1 \
--verifier_cfg.model_name=qwen3_4b_inst \
--verifier_cfg.base_url=http://localhost:8000/v1