Skip to main content

Agent System

AlphaApollo is built around an environment-driven, multi-turn agentic reasoning system that follows the Gym-style interface pattern. At its core, a language model interacts with a structured environment over multiple turns: at each step the model produces an action (potentially including tool calls), the environment executes it, returns an observation, and the loop continues until the problem is solved or a budget is exhausted.

Architecture Overview

The system is organized in a layered hierarchy, from the lowest-level abstraction up to the orchestration layer:

Each layer adds a well-defined concern:

LayerResponsibility
EnvAbstract step / reset / close interface
BaseTextEnvTool group management and tool dispatch
InformalMathEnvReward computation, action parsing, tool pattern matching
MultiProcessEnvThread-pool parallelism over a batch of environments
EnvironmentManagerPrompt construction, memory read/write, projection, success evaluation
GenerationTokenization, model inference integration (verl / evolving)
Source file mapping
  • Envcore/environments/informal_math_training/core.py
  • BaseTextEnvcore/environments/informal_math_training/base_text_env.py
  • InformalMathEnvcore/environments/informal_math_training/env.py
  • MultiProcessEnvcore/environments/informal_math_training/envs.py
  • EnvironmentManagerBasecore/environments/base.py
  • Training/Evolving Managers → core/environments/env_manager.py

Core Abstraction: Environment

Env — the Gym-style base

alphaapollo/core/environments/informal_math_training/core.py defines the minimal Env[ObsType, ActType] generic class:

  • step(action)EnvStepOutput containing observations, reward, done, metadata.
  • init(**kwargs) — initialize the environment.
  • close() — clean up resources.

BaseTextEnv — tool-aware text environment

alphaapollo/core/environments/informal_math_training/base_text_env.py extends Env[str, str] with tool management:

  • init_tool_groups(tool_groups) — registers one or more ToolGroup instances.
  • _execute_tool(group_name, tool_name, tool_input) — dispatches a tool call by looking up the correct group and invoking the named tool.
  • Returns BaseTextEnvStepOutput with an additional postprocessed_action field.

InformalMathEnv — domain environment

alphaapollo/core/environments/informal_math_training/env.py implements the concrete math-solving environment:

  • Tool patternsTOOL_PATTERNS is an extensible list of (tool_name, regex_pattern) tuples covering python_code and local_rag.
  • Reset — sets the question, ground truth, and max steps; initializes the chat history.
  • Step — parses the model's action with _parse_action(), checks termination via _is_done(), executes matched tools, and returns observations.
  • Reward_get_reward() calls compute_score() on termination (binary 0/1); intermediate steps yield no reward.
  • RAG hint — if a python_code execution fails (score = 0) and RAG is enabled, the environment appends a suggestion to try the RAG tool.

Vectorized Parallel Environments

alphaapollo/core/environments/informal_math_training/envs.py provides InformalMathTrainingMultiProcessEnv, which runs multiple environment instances in parallel using a ThreadPoolExecutor:

  • batch_size = env_num × group_n — total number of concurrent environments.
  • reset(kwargs) / step(actions) — broadcast operations across all instances with automatic padding and valid_mask tracking.
  • close() — shuts down the thread pool and event loops.

A corresponding InformalMathEvolvingMultiProcessEnv exists for the evolution workflow with additional support for policy_solution and previous_solutions fields.

Environment Manager

The environment manager is the high-level orchestrator that wires together prompts, memory, projection, and the vectorized environment.

EnvironmentManagerBase

Defined in alphaapollo/core/environments/base.py, this is the abstract orchestrator:

MethodDescription
reset(kwargs)Resets all environments, returns observation dict {text, image, anchor}
step(text_actions)Runs projection → env step → returns (obs, rewards, dones, infos)
build_text_obs()Constructs the text observation (abstract, implemented by subclasses)
success_evaluator(**kwargs)Checks info['won'] across a batch to compute success rates
save_image(image, step)Debug utility: saves an observation image to images/<env_name>/step{N}.png
close()Delegates to self.envs.close()

Helper: to_numpy(data) converts torch.Tensor, lists, and scalars to np.ndarray — used extensively in step() and success_evaluator().

Training Environment Manager

InformalMathTrainingEnvironmentManager in alphaapollo/core/environments/env_manager.py:

  • Selects the memory type based on config.env.informal_math.memory_type:
    • "score"EvolvingMemory
    • "ndimensional"NDimensionalMemory
    • "simple"SimpleMemory
  • Reads execution_mode from config.env.informal_math.execution_mode (default: "agentic").
  • On reset(): resets the environment, initializes memory, and constructs a prompt-augmented text observation via get_policy_training_prompt().
  • On step(): runs projection → environment step → stores the transition in memory → builds a new observation.

Evolving Environment Manager

InformalMathEvolvingEnvironmentManager adds:

  • Verifier mode — distinguishes policy and verifier roles, selecting prompts via get_policy_prompt() or get_verifier_prompt().
  • Previous solutions — injects prior solutions from memory into the prompt.
  • Force done — terminates on empty actions or <report> tags.
  • Action sanitization_sanitize_action_for_memory() cleans invalid actions before storing.
  • Per-source tracking_process_batch() computes success rates grouped by data_source.

Factory

make_envs(config) in env_manager.py is the entry point that instantiates the correct environment manager based on config.env.env_name. Currently supported values:

env_name (case-insensitive match)Manager Class
informal_math_trainingInformalMathTrainingEnvironmentManager
informal_math_evolvingInformalMathEvolvingEnvironmentManager

The factory also reads config.env.rollout.n (the group_n parameter) to determine the number of rollout groups per environment.

Adding new environments

To add support for a new domain, add an elif branch in make_envs(). See Adding a New Environment for the full guide.

Utility Functions

env_manager.py also exports two helper functions:

FunctionDescription
parse_gamefile(infos)Extracts game file paths from environment info dicts
set_gamefile(infos, gamefile)Injects a game file path into environment info dicts

Projection

The projection layer (alphaapollo/core/environments/informal_math_training/projection.py) maps raw LLM output into a structured action:

  • Supported tool tokens: python_code, local_rag.
  • Priority: <answer> tags take precedence over tool-call tags.
  • Post-processing: _postprocess_action() truncates at the first matching close tag to prevent hallucinated continuations.
  • Validity checks:
    • An action with both a tool tag and <answer> is marked invalid.
    • Multiple instances of the same tag are marked invalid.

The evolving projection (informal_math_evolving/projection.py) adds support for the <report> tag (highest priority), used by the verifier to terminate.

Memory System

Memory stores past interactions and enables the model to reference prior attempts. All memory types implement the BaseMemory interface (reset, store, fetch).

SimpleMemory

Plain sequential storage. fetch() returns the most recent N entries formatted as:

[Action X: '...', Observation X: '...']

SearchMemory

Retrieval-based memory that supports semantic search over stored entries. Useful for finding relevant past interactions based on content similarity rather than recency.

EvolvingMemory

Uses an OrderedRecordList that keeps entries sorted by a configurable score key. fetch() returns the top-K entries with their scores — useful for showing the model its best prior solutions.

NDimensionalMemory

Stores entries in an N-dimensional grid (NDimensionalSpaceList) with deduplication. Supports two retrieval strategies:

  • min_combined — rank-sum sorting across dimensions (e.g., performance + complexity).
  • random — uniform sampling.

Memory type is selected via the memory_type field in the environment config (simple, score, or ndimensional).

Prompt System

Prompt templates live in alphaapollo/core/environments/prompts/ and are selected dynamically based on tool configuration and workflow type.

Training Prompts

informal_math_training.py provides templates for:

TemplateToolsHistory
INFORMAL_MATH_TEMPLATE_NO_TOOLNoneN/A
INFORMAL_MATH_TEMPLATE_NO_HIS / WITH_HISpython_codeNo / Yes
INFORMAL_MATH_TEMPLATE_RAG_NO_HIS / RAG_WITH_HISpython_code + local_ragNo / Yes
INFORMAL_MATH_TEMPLATE_RAG_ONLY_*local_rag onlyNo / Yes

get_policy_training_prompt(use_history, max_steps, tool_config) selects the appropriate template.

Evolving Prompts

informal_math_evolving.py extends the training templates with:

  • Previous-solutions variants — inject prior solutions into the prompt.
  • Force-answer variants — require the model to produce a final <answer>.
  • Verifier prompts — instruct the verifier to evaluate a policy solution and output <report>...\boxed{1} or \boxed{0}</report>.
  • Report aggregation template — merges multiple verifier reports via majority voting.
  • Summarizer template — condenses a policy trajectory into a Verification Brief.

Selection functions: get_policy_prompt(...) and get_verifier_prompt(...).

Reward Manager

EpisodeRewardManager in alphaapollo/core/reward_manager/episode.py handles reward assignment for the verl training framework:

  • If pre-computed rm_scores exist, they are used directly.
  • Otherwise, rewards are extracted from episode_rewards and placed at the last valid token position of the response.
  • Supports optional normalize_by_length to avoid length bias.
  • Provides a debug mode that prints prompt / response / score for inspection.
  • When called with return_dict=True, returns {"reward_tensor": ..., "reward_extra_info": {}} instead of just the tensor.

For custom reward logic, see Adding a New Algorithm.

Generation Layer

The generation layer bridges the environment system with the model inference backend.

TrajectoryCollector

alphaapollo/core/generation/multi_turn_rollout/rollout_loop.py defines TrajectoryCollector:

  • preprocess_single_sample() — constructs a chat from the observation, applies the chat template, tokenizes, and handles multimodal inputs.
  • preprocess_batch() — batches multiple samples with proper padding and truncation.
  • Integrates with verl's DataProto, compute_position_id_with_mask, and related utilities.

Training vs. Evolving Environments

While the training and evolving environment packages (informal_math_training/ and informal_math_evolving/) share the same layered structure, the evolving variants introduce several key differences:

AspectTrainingEvolving
Termination tags<answer><answer> + <report>
RolesPolicy onlyPolicy + Verifier
Extra fieldspolicy_solution, done_reason
Previous solutionsNot usedInjected from shared memory
Force-done logicEmpty action or <report> triggers termination
Success trackingGlobalPer data_source

Episode Data Flow

A complete episode flows through the system as follows: