Evolving Pipeline

The Evolving Pipeline is AlphaApollo's inference-time self-improvement framework. It runs a policy–verifier loop across multiple rounds: the Policy Agent generates a candidate solution, the Verifier Agent evaluates it, and the result is stored in Solution Memory for the next round. No model weights are updated — improvement comes entirely from structured feedback and accumulated solution history.

Terminology

This feature is called the Evolving Pipeline throughout AlphaApollo's documentation. External papers and scripts may also use "self-improvement loop", "self-evolving", or "evo" — all refer to the same system.

Overview

The evolving pipeline implements a multi-round self-improvement loop:

┌─────────────────────────────────────────────────────────────┐
│                    Evolving Loop                             │
│                                                             │
│  For each problem × evolving_round:                         │
│                                                             │
│  ┌──────────┐    ┌───────────────┐    ┌──────────────────┐  │
│  │  Policy   │───▶│  Environment  │───▶│    Verifier      │  │
│  │  Agent    │    │  (tool use,   │    │  (evaluate &     │  │
│  │          │◀───│   code exec)  │◀───│   report)        │  │
│  └──────────┘    └───────────────┘    └──────────────────┘  │
│       │                                        │            │
│       ▼                                        ▼            │
│  ┌────────────────────────────────────────────────────┐     │
│  │              Solution Memory                        │     │
│  │  (retains top solutions across rounds)              │     │
│  └────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘

Key Concepts

Policy Agent

The policy agent generates solutions to problems. It can use tools (Python code execution, RAG retrieval) and iterates across multiple steps within each round.

Verifier Agent

The verifier agent evaluates policy-generated solutions. It:

Inspects <answer> tags in the policy's output
Can execute Python code to check mathematical correctness
Returns a structured <report> with pass/fail judgment
Multiple verifier instances run in parallel for majority voting

Solution Memory

The solution memory retains top-scoring solutions across evolving rounds, enabling the policy to learn from its best past attempts:

Memory Type	Description
`simple`	Retains recent history (FIFO)
`score`	Retains highest-scoring solutions
`ndimensional`	Multi-dimensional scoring with `scored_history_length` control

Execution Modes

Mode	Description
Agentic	Interleaves verification when `<answer>` tags appear. Force-verifies on the final step of each round. Advances to the next round after verification.
Mechanical	Runs a fixed number of steps without interleaved verification. Performs a single verifier session at the end of each round.

Pipeline Steps

1. Data Loading

Problems are loaded from processed parquet files via examples/evolving/utils/dataset_loader.py:

# Load problems from parquet
problems = load_dataset(
    data_root="./data/math-ai/",
    dataset_name="aime24",
    file_name="test.parquet"
)

2. Environment Setup

The pipeline creates separate environments for policy and verifier roles:

# Policy environment: generates solutions
policy_env = InformalMathEnvironmentManager(
    max_steps=4,
    history_length=4,
    memory_type="simple",
    enable_python_code=True,
    enable_local_rag=True
)

# Verifier environment: evaluates solutions
verifier_env = InformalMathEnvironmentManager(
    max_steps=4,
    history_length=4,
    memory_type="simple",
    enable_python_code=True
)

3. Evolving Loop

For each problem and each evolving round:

Round 1:  Policy generates solution → Verifier evaluates → Update memory
Round 2:  Policy generates (with memory) → Verifier evaluates → Update memory
  ...
Round N:  Policy generates (with rich memory) → Verifier evaluates → Final result

Each round benefits from the accumulated solution memory, allowing the policy to refine its approach based on previous attempts and verifier feedback.

4. Verification

The verifier runs multiple instances in parallel and uses majority voting:

verifier_env_num: 5  # Odd number for majority voting
concurrency:
  verifier_max_workers: 5

tip

Use an odd number for verifier_env_num (e.g., 3, 5, 7) to ensure majority voting always produces a definitive result.

5. Output Collection

The pipeline produces detailed outputs:

outputs/<dataset>/<tag>/<model>/test_*/
├── step_outputs/        # Per-step trajectory data
├── problem_results/     # Per-problem results with correctness
└── metrics.json         # Aggregate metrics

Metrics tracked:

Policy accuracy: Does the policy's answer match ground truth?
Verifier accuracy: Does the verifier's judgment agree with ground truth?
Success rate: Fraction of problems solved across evolving rounds

Single-Model Evolving

The standard single-model setup uses one policy model and one verifier model (can be the same model):

# 1. Host the model
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --served-model-name qwen25_7b_inst \
    --port 8000

# 2. Run evolving
python examples/evolving/evolving_main.py \
    --config configs/vllm_informal_math.yaml

Config highlights:

env.config.informal_math_evolving:
  evolving_round: 10        # 10 rounds of self-improvement
  enable_verify: true        # Enable verifier
  policy_env.max_steps: 4    # 4 steps per round
  policy_env.enable_python_code: true

policy_model_cfg:
  temperature: 0.7           # Higher for diversity
  max_tokens: 8192

verifier_cfg:
  temperature: 0.4           # Lower for precision
  max_tokens: 4096

Multi-Model K-Branch Evolving

K-branch evolving runs multiple policy models in parallel, with a shared solution memory for cross-pollination:

┌─────────────────────────────────────────────────┐
│           Shared Solution Memory                 │
│     (ThreadSafeSolutionMemory)                  │
├─────────┬────────────┬────────────┬─────────────┤
│         │            │            │             │
│  Branch 1     Branch 2     Branch 3    ...     │
│  (Qwen-7B)   (Qwen-14B)  (Qwen-3B)           │
│  Policy +     Policy +     Policy +             │
│  Verifier     Verifier     Verifier             │
└─────────┴────────────┴────────────┴─────────────┘

Key benefits:

Different models bring different strengths and perspectives
Cross-pollination via shared memory accelerates convergence
Branches run in parallel for efficiency

# 1. Host multiple models
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-4B-Instruct \
    --served-model-name qwen3_4b_inst \
    --port 8000 &

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-14B-Instruct \
    --served-model-name qwen2_5_14b_inst \
    --port 8001 &

# 2. Run multi-model evolving
python examples/evolving/evolving_multi_models.py \
    --config configs/vllm_informal_math_multi_models.yaml

Config highlights:

concurrency:
  branch_max_workers: 2     # Run 2 branches in parallel

branches:
  - branch_id: "qwen25_branch"
    policy_model_cfg:
      model_name: "qwen2_5_14b_inst"
      base_url: "http://localhost:8001/v1"
      temperature: 0.7
  - branch_id: "qwen3_branch"
    policy_model_cfg:
      model_name: "qwen3_4b_inst"
      base_url: "http://localhost:8000/v1"
      temperature: 0.7

Concurrency and Parallelism

The evolving pipeline supports multiple levels of parallelism:

Level	Parameter	Description
Problem-level	`problem_max_workers`	Process multiple problems simultaneously
Verifier-level	`verifier_max_workers`	Run multiple verifier instances per problem
Branch-level	`branch_max_workers`	Run multiple model branches in parallel
Environment-level	`policy_env_num`	Multiple policy environments per problem

Recommended settings for a single GPU:

concurrency:
  verifier_max_workers: 5
  problem_max_workers: 30
  branch_max_workers: 2  # multi-model only

Evolving Config — Detailed configuration reference
RL Training — RL algorithms for post-training
SFT — Supervised Fine-Tuning process
Configuration Overview — Hydra basics and CLI overrides

Overview​

Key Concepts​

Policy Agent​

Verifier Agent​

Solution Memory​

Execution Modes​

Pipeline Steps​

1. Data Loading​

2. Environment Setup​

3. Evolving Loop​

4. Verification​

5. Output Collection​

Single-Model Evolving​

Multi-Model K-Branch Evolving​

Concurrency and Parallelism​

Related Pages​