Agent System
AlphaApollo is built around an environment-driven, multi-turn agentic reasoning system that follows the Gym-style interface pattern. At its core, a language model interacts with a structured environment over multiple turns: at each step the model produces an action (potentially including tool calls), the environment executes it, returns an observation, and the loop continues until the problem is solved or a budget is exhausted.
Architecture Overview
The system is organized in a layered hierarchy, from the lowest-level abstraction up to the orchestration layer:
Each layer adds a well-defined concern:
| Layer | Responsibility |
|---|---|
Env | Abstract step / reset / close interface |
BaseTextEnv | Tool group management and tool dispatch |
InformalMathEnv | Reward computation, action parsing, tool pattern matching |
MultiProcessEnv | Thread-pool parallelism over a batch of environments |
EnvironmentManager | Prompt construction, memory read/write, projection, success evaluation |
Generation | Tokenization, model inference integration (verl / evolving) |
Env→core/environments/informal_math_training/core.pyBaseTextEnv→core/environments/informal_math_training/base_text_env.pyInformalMathEnv→core/environments/informal_math_training/env.pyMultiProcessEnv→core/environments/informal_math_training/envs.pyEnvironmentManagerBase→core/environments/base.py- Training/Evolving Managers →
core/environments/env_manager.py
Core Abstraction: Environment
Env — the Gym-style base
alphaapollo/core/environments/informal_math_training/core.py defines the minimal Env[ObsType, ActType] generic class:
step(action)→EnvStepOutputcontainingobservations,reward,done,metadata.init(**kwargs)— initialize the environment.close()— clean up resources.
BaseTextEnv — tool-aware text environment
alphaapollo/core/environments/informal_math_training/base_text_env.py extends Env[str, str] with tool management:
init_tool_groups(tool_groups)— registers one or moreToolGroupinstances._execute_tool(group_name, tool_name, tool_input)— dispatches a tool call by looking up the correct group and invoking the named tool.- Returns
BaseTextEnvStepOutputwith an additionalpostprocessed_actionfield.
InformalMathEnv — domain environment
alphaapollo/core/environments/informal_math_training/env.py implements the concrete math-solving environment:
- Tool patterns —
TOOL_PATTERNSis an extensible list of(tool_name, regex_pattern)tuples coveringpython_codeandlocal_rag. - Reset — sets the question, ground truth, and max steps; initializes the chat history.
- Step — parses the model's action with
_parse_action(), checks termination via_is_done(), executes matched tools, and returns observations. - Reward —
_get_reward()callscompute_score()on termination (binary 0/1); intermediate steps yield no reward. - RAG hint — if a
python_codeexecution fails (score = 0) and RAG is enabled, the environment appends a suggestion to try the RAG tool.
Vectorized Parallel Environments
alphaapollo/core/environments/informal_math_training/envs.py provides InformalMathTrainingMultiProcessEnv, which runs multiple environment instances in parallel using a ThreadPoolExecutor:
batch_size = env_num × group_n— total number of concurrent environments.reset(kwargs)/step(actions)— broadcast operations across all instances with automatic padding andvalid_masktracking.close()— shuts down the thread pool and event loops.
A corresponding InformalMathEvolvingMultiProcessEnv exists for the evolution workflow with additional support for policy_solution and previous_solutions fields.
Environment Manager
The environment manager is the high-level orchestrator that wires together prompts, memory, projection, and the vectorized environment.
EnvironmentManagerBase
Defined in alphaapollo/core/environments/base.py, this is the abstract orchestrator:
| Method | Description |
|---|---|
reset(kwargs) | Resets all environments, returns observation dict {text, image, anchor} |
step(text_actions) | Runs projection → env step → returns (obs, rewards, dones, infos) |
build_text_obs() | Constructs the text observation (abstract, implemented by subclasses) |
success_evaluator(**kwargs) | Checks info['won'] across a batch to compute success rates |
save_image(image, step) | Debug utility: saves an observation image to images/<env_name>/step{N}.png |
close() | Delegates to self.envs.close() |
Helper: to_numpy(data) converts torch.Tensor, lists, and scalars to np.ndarray — used extensively in step() and success_evaluator().
Training Environment Manager
InformalMathTrainingEnvironmentManager in alphaapollo/core/environments/env_manager.py:
- Selects the memory type based on
config.env.informal_math.memory_type:"score"→EvolvingMemory"ndimensional"→NDimensionalMemory"simple"→SimpleMemory
- Reads
execution_modefromconfig.env.informal_math.execution_mode(default:"agentic"). - On
reset(): resets the environment, initializes memory, and constructs a prompt-augmented text observation viaget_policy_training_prompt(). - On
step(): runs projection → environment step → stores the transition in memory → builds a new observation.
Evolving Environment Manager
InformalMathEvolvingEnvironmentManager adds:
- Verifier mode — distinguishes policy and verifier roles, selecting prompts via
get_policy_prompt()orget_verifier_prompt(). - Previous solutions — injects prior solutions from memory into the prompt.
- Force done — terminates on empty actions or
<report>tags. - Action sanitization —
_sanitize_action_for_memory()cleans invalid actions before storing. - Per-source tracking —
_process_batch()computes success rates grouped bydata_source.
Factory
make_envs(config) in env_manager.py is the entry point that instantiates the correct environment manager based on config.env.env_name. Currently supported values:
env_name (case-insensitive match) | Manager Class |
|---|---|
informal_math_training | InformalMathTrainingEnvironmentManager |
informal_math_evolving | InformalMathEvolvingEnvironmentManager |
The factory also reads config.env.rollout.n (the group_n parameter) to determine the number of rollout groups per environment.
To add support for a new domain, add an elif branch in make_envs(). See Adding a New Environment for the full guide.
Utility Functions
env_manager.py also exports two helper functions:
| Function | Description |
|---|---|
parse_gamefile(infos) | Extracts game file paths from environment info dicts |
set_gamefile(infos, gamefile) | Injects a game file path into environment info dicts |
Projection
The projection layer (alphaapollo/core/environments/informal_math_training/projection.py) maps raw LLM output into a structured action:
- Supported tool tokens:
python_code,local_rag. - Priority:
<answer>tags take precedence over tool-call tags. - Post-processing:
_postprocess_action()truncates at the first matching close tag to prevent hallucinated continuations. - Validity checks:
- An action with both a tool tag and
<answer>is marked invalid. - Multiple instances of the same tag are marked invalid.
- An action with both a tool tag and
The evolving projection (informal_math_evolving/projection.py) adds support for the <report> tag (highest priority), used by the verifier to terminate.
Memory System
Memory stores past interactions and enables the model to reference prior attempts. All memory types implement the BaseMemory interface (reset, store, fetch).
SimpleMemory
Plain sequential storage. fetch() returns the most recent N entries formatted as:
[Action X: '...', Observation X: '...']
SearchMemory
Retrieval-based memory that supports semantic search over stored entries. Useful for finding relevant past interactions based on content similarity rather than recency.
EvolvingMemory
Uses an OrderedRecordList that keeps entries sorted by a configurable score key. fetch() returns the top-K entries with their scores — useful for showing the model its best prior solutions.
NDimensionalMemory
Stores entries in an N-dimensional grid (NDimensionalSpaceList) with deduplication. Supports two retrieval strategies:
min_combined— rank-sum sorting across dimensions (e.g., performance + complexity).random— uniform sampling.
Memory type is selected via the memory_type field in the environment config (simple, score, or ndimensional).
Prompt System
Prompt templates live in alphaapollo/core/environments/prompts/ and are selected dynamically based on tool configuration and workflow type.
Training Prompts
informal_math_training.py provides templates for:
| Template | Tools | History |
|---|---|---|
INFORMAL_MATH_TEMPLATE_NO_TOOL | None | N/A |
INFORMAL_MATH_TEMPLATE_NO_HIS / WITH_HIS | python_code | No / Yes |
INFORMAL_MATH_TEMPLATE_RAG_NO_HIS / RAG_WITH_HIS | python_code + local_rag | No / Yes |
INFORMAL_MATH_TEMPLATE_RAG_ONLY_* | local_rag only | No / Yes |
get_policy_training_prompt(use_history, max_steps, tool_config) selects the appropriate template.
Evolving Prompts
informal_math_evolving.py extends the training templates with:
- Previous-solutions variants — inject prior solutions into the prompt.
- Force-answer variants — require the model to produce a final
<answer>. - Verifier prompts — instruct the verifier to evaluate a policy solution and output
<report>...\boxed{1} or \boxed{0}</report>. - Report aggregation template — merges multiple verifier reports via majority voting.
- Summarizer template — condenses a policy trajectory into a Verification Brief.
Selection functions: get_policy_prompt(...) and get_verifier_prompt(...).
Reward Manager
EpisodeRewardManager in alphaapollo/core/reward_manager/episode.py handles reward assignment for the verl training framework:
- If pre-computed
rm_scoresexist, they are used directly. - Otherwise, rewards are extracted from
episode_rewardsand placed at the last valid token position of the response. - Supports optional
normalize_by_lengthto avoid length bias. - Provides a debug mode that prints prompt / response / score for inspection.
- When called with
return_dict=True, returns{"reward_tensor": ..., "reward_extra_info": {}}instead of just the tensor.
For custom reward logic, see Adding a New Algorithm.
Generation Layer
The generation layer bridges the environment system with the model inference backend.
TrajectoryCollector
alphaapollo/core/generation/multi_turn_rollout/rollout_loop.py defines TrajectoryCollector:
preprocess_single_sample()— constructs a chat from the observation, applies the chat template, tokenizes, and handles multimodal inputs.preprocess_batch()— batches multiple samples with proper padding and truncation.- Integrates with verl's
DataProto,compute_position_id_with_mask, and related utilities.
Training vs. Evolving Environments
While the training and evolving environment packages (informal_math_training/ and informal_math_evolving/) share the same layered structure, the evolving variants introduce several key differences:
| Aspect | Training | Evolving |
|---|---|---|
| Termination tags | <answer> | <answer> + <report> |
| Roles | Policy only | Policy + Verifier |
| Extra fields | — | policy_solution, done_reason |
| Previous solutions | Not used | Injected from shared memory |
| Force-done logic | — | Empty action or <report> triggers termination |
| Success tracking | Global | Per data_source |
Episode Data Flow
A complete episode flows through the system as follows: