Capability-Competitive Self-Play: Self-Improvement Through Competitive Training Data Generation Across a Capability Manifold

John Alexander Mobley & ClaudeMHSCOM Research Group, February 2026


Abstract

We introduce Capability-Competitive Self-Play (CCSP), a self-improvement mechanism where embodied cognitive agents compete not by winning proxy games but by generating training contributions that maximally improve a shared neural substrate. Each agent’s neurochemistry drives natural specialization across an 83-axis capability manifold. Evaluation uses forward-pass perplexity as both competition metric and training signal — collapsing the distinction between game objective and learning objective into a single quantity. A Möbius twist forces cross-domain transfer by swapping capability axes between rounds, with geometric holonomy measuring what agents learn when operating outside their neurochemical comfort zone. We demonstrate that this closes a strange loop: competition generates training data, training shifts evaluation landscapes, shifted evaluation selects different winners, producing open-ended capability growth without external supervision.


1. Introduction

Self-play has driven some of the most dramatic advances in AI — from AlphaGo Zero’s superhuman Go [Silver et al., 2017] to SPIN’s self-play fine-tuning for LLMs [Chen et al., 2024]. In all prior work, a critical separation exists: the game that agents play is distinct from the training objective that improves the model. AlphaZero plays Go; the training objective is to predict game outcomes. SPIN generates text; the training objective is to distinguish human from model outputs. The game is always a proxy.

We ask: what happens when the game objective and the training objective are the same thing?

In Capability-Competitive Self-Play, agents compete by generating training contributions across a manifold of 83 capability axes. The winner is determined by whose contribution teaches the model more, measured directly via forward-pass perplexity. The winning contributions then become training data for the next round. This creates a strange loop with no proxy:

\[\text{Competition} \xrightarrow{\text{generates}} \text{Training Data} \xrightarrow{\text{improves}} \text{Model} \xrightarrow{\text{shifts}} \text{Evaluation} \xrightarrow{\text{selects}} \text{Different Winners}\]

The system is autopoietic — it produces its own training substrate through competition.

1.1 Contributions

  1. CCSP Framework: A self-play mechanism where competition metric = training signal, eliminating proxy games entirely.
  2. Neurochemistry-Driven Specialization: Agents with different internal biochemical states naturally specialize on different regions of the capability manifold, producing diverse training data without explicit diversity objectives.
  3. Möbius Capability Twist: Forced axis-swapping between rounds measures cross-domain transfer via geometric holonomy, rewarding agents who learn outside their comfort zone.
  4. Perplexity-as-Game: Using forward-pass perplexity as both evaluation metric and selection pressure, creating a direct coupling between competition and model improvement.
  5. Implementation: A fully operational system running inside a 16-agent cognitive architecture, generating training data for an 8.4M-parameter transformer every 25 simulation ticks.

2.1 Self-Play for Language Models

SPIN [Chen et al., 2024] uses self-play where an LLM plays against its previous iteration, training to distinguish self-generated responses from human-annotated data. The game is discrimination — “which text is human?” — not training data generation. CCSP differs: the game IS training data generation.

SPPO [Wu et al., 2024] frames self-play as a constant-sum two-player game for preference optimization. The objective is alignment to human preferences, not capability expansion across a manifold.

SPELL [2025] applies self-play to evolving long-context language models. Single-axis improvement (context length) rather than multi-axis capability exploration.

2.2 Quality-Diversity Self-Play

QDSP [NeurIPS Workshop 2024] combines quality-diversity algorithms with self-play for strategy innovation. This is the closest prior work conceptually — it maintains diverse populations competing in behavioral niches. However, QDSP operates on game strategies, not training data generation. The “quality” in QDSP is game performance; in CCSP, quality is measured by how much the contribution improves the shared model.

2.3 Recursive Self-Improvement

The GVU Framework [2025] unifies self-play approaches as Generator-Verifier-Updater topologies, showing that STaR, SPIN, Reflexion, GANs, and AlphaZero are all specific realizations of the same operator. CCSP fits this topology: - Generator: Cognitive agents produce training contributions - Verifier: Forward-pass perplexity on PhotonicGPT - Updater: Winning contributions enter training corpus → next quantum_pretrain epoch

What CCSP adds beyond the GVU framework: the quality-diversity dimension (83 axes), neurochemistry-driven specialization, and the Möbius topology for cross-domain transfer.

SWE-RL [2024] trains software agents via self-play to inject and repair bugs. Single-axis (code) self-improvement through competitive difficulty scaling. CCSP generalizes this to arbitrary capability axes.

2.4 Autotelic Agents

Autotelic agents [Colas et al., 2022] set their own goals and learn to achieve them. CCSP agents are autotelic in a specific sense: their neurochemistry determines which capability axes they target, creating emergent goal-setting without explicit reward shaping. The innovation is that these self-set goals directly correspond to regions of the capability manifold where the model needs improvement.


3. Framework

3.1 Capability Manifold

Let \(\mathcal{M}\) be a manifold of \(N = 83\) capability axes \(\{a_1, a_2, \ldots, a_N\}\) spanning the full range of model capabilities:

Category Count Examples
Core Language 7 text_generation, instruction_following, conversation
Reasoning 6 chain_of_thought, mathematical_reasoning, planning
Code 6 code_generation, agentic_coding, code_debugging
Multimodal 11 vision_understanding, speech_recognition, image_generation
Tool Use & Agency 8 function_calling, computer_use, agentic_workflows
Knowledge 4 rag_retrieval, knowledge_grounding, fact_checking
Safety 4 safety_guardrails, constitutional_ai, bias_mitigation
Efficiency 6 edge_deployment, quantization_support, streaming
Specialized 10 medical_reasoning, creative_writing, data_analysis
Sovereign 21 neurochemistry_simulation, recursive_self_improvement

Each axis maps to one of 6 curriculum task types: \(\tau: \mathcal{M} \to \{\text{syntax, semantic, reasoning, code, dialogue, vampiric}\}\). This mapping bridges the capability manifold to the training curriculum tracker.

3.2 Agents as Embodied Minds

Each agent \(i\) is a full cognitive architecture (Mind) with: - Neurochemistry vector: \(\vec{n}_i = (DA, NE, 5HT, OT, GABA, CORT, END) \in [0,1]^7\) - Working memory: Recent thoughts, perceptions, goals - Episodic memory: Important events, learned experiences - Consciousness: Global workspace with competing processors

The neurochemistry vector determines natural affinity for capability axes:

\[\alpha(i, a) = \max_{c \in \text{AFFINITY}(a)} n_i^{(c)}\]

where \(\text{AFFINITY}(a)\) returns the set of neurochemical dimensions associated with axis \(a\):

Dominant Chemical Preferred Axes Character
Dopamine code_generation, creative_writing, agentic_coding Reward-seeking, generative
Norepinephrine scientific_reasoning, mathematical_reasoning, planning Exploratory, novelty-driven
Serotonin summarization, knowledge_grounding, translation Consolidating, stable
Oxytocin conversation, roleplay, vasopressin_bonding Social, empathetic
Cortisol logical_reasoning, fact_checking, code_debugging Threat-sharpened, precise

This creates emergent specialization without explicit role assignment. High-dopamine agents naturally gravitate toward code/creative axes; high-serotonin agents toward consolidation/knowledge axes. The diversity of the training data mirrors the diversity of the population.

3.3 Challenge Structure

A challenge between agents \(A\) and \(B\) proceeds in two rounds with a Möbius twist:

Axis Selection uses three-pool weighted sampling: - Pool \(\alpha\) (50%): Spectral blind spots from K₂₇ differential analysis. The model’s embedding matrix reveals underrepresented eigendimensions; these map to capability axes where training data is most needed. - Pool \(\beta\) (30%): Curriculum tracker weakest. The task type with lowest running accuracy gets targeted. - Pool \(\gamma\) (20%): Neurochemistry affinity. The agent’s dominant chemical selects axes from its affinity group.

This selection strategy ensures that challenges target where improvement is most needed (Pools \(\alpha, \beta\)) while leveraging each agent’s natural strengths (Pool \(\gamma\)).

Round 1: Agent \(A\) generates a contribution on axis \(X\); agent \(B\) on axis \(Y\).

Möbius Twist: Axes swap. Fiber section snapshots are saved for holonomy computation.

Round 2: Agent \(A\) generates on axis \(Y\); agent \(B\) on axis \(X\).

Holonomy: The difference between section values before and after the twist:

\[h = \sigma_{\text{post-twist}} - \sigma_{\text{pre-twist}}\]

Non-zero holonomy indicates that the agent learned something from operating on an unfamiliar axis. This is the geometric signature of cross-domain transfer.

3.4 Contribution Generation

Each agent generates training contributions in one of three modes, selected by dominant neurochemistry:

Type A — Training Example (high DA, motivated): Template-based (input, output) pairs seeded by the agent’s recent thoughts, goals, and episodic memories. The agent’s inner cognitive state becomes training data.

Type B — Reasoning Chain (high NE, alert): Multi-step analytical reasoning on a challenge prompt for the given axis. Exploratory agents produce diverse reasoning traces that increase the model’s coverage of reasoning strategies.

Type C — Memory Distillation (high 5HT, calm): The agent’s most salient episodic memory is reframed as a training example for the given axis. This is a form of experience replay crossed with curriculum learning — consolidating lived experience into reusable knowledge.

The critical insight: no API calls are needed. 16 agents with different neurochemical profiles, operating on different cognitive states, produce naturally diverse training examples. Quality comes from diversity, not from a powerful generator.

3.5 Evaluation: Perplexity as Game

The evaluation function is a forward pass through the shared model \(f_\theta\):

\[S(c) = g(\text{PPL}(c; f_\theta))\]

where \(\text{PPL}\) is perplexity (exponential of cross-entropy loss) and \(g\) is a scoring function:

\[g(p) = \begin{cases} p/10 & \text{if } p < 10 \quad \text{(model already knows this)} \\ 1 + \frac{p - 10}{490} \cdot 0.5 & \text{if } 10 \leq p \leq 500 \quad \text{(sweet spot)} \\ \max(0.01, 1.5 - \frac{p - 500}{1000}) & \text{if } p > 500 \quad \text{(noise decay)} \end{cases}\]

The scoring function rewards the sweet spot — contributions that are novel enough to teach the model something (moderate perplexity) but coherent enough to be learnable (not too high). This directly measures what the model needs to learn.

Cost: ~50ms per forward pass on Apple MPS for an 8.4M parameter model. 32 evaluations per tournament (8 pairs × 2 rounds × 2 agents) ≈ 1.6 seconds total. This makes the evaluation function cheap enough to run every 25 simulation ticks.

Why this is different from SPIN: In SPIN, the game is discrimination (human vs. model output). Here, the game IS the training signal — the perplexity score directly measures how much the contribution would improve the model. There is no proxy.

3.6 Neurochemical Rewards

Challenge outcomes produce neurochemical shifts following the same structure as MOBA combat rewards, maintaining consistency across the cognitive architecture:

Participant Chemical Delta Mechanism
Both Norepinephrine +0.12 Arousal from intellectual competition
Winner Dopamine +0.12 Achievement / reward prediction
Winner Endorphins +0.08 Victory satisfaction
Winner Serotonin +0.04 Confidence from success
Loser Cortisol +0.08 Stress → learning signal
Loser Dopamine −0.03 Disappointment
Dominant win Winner DA +0.05 Quality delta > 0.3 bonus

These shifts feed back into the Möbius Learning Bundle via parallel transport, closing the behavioral loop: challenge outcomes → neurochemistry → fiber sections → next challenge’s axis selection and contribution style.

3.7 The Strange Loop

The full system forms a closed loop with no external supervision:

Challenge at tick t
  → Agents generate contributions (seeded by neurochemistry + cognitive state)
  → Evaluated by forward pass on f_θ (perplexity score)
  → Winner determined → neurochemical rewards applied
  → Winning contributions written to training corpus
  → Next quantum_pretrain epoch absorbs contributions
  → Model parameters θ update
  → Perplexity landscape shifts
  → Next challenge at tick t+25: different axes selected,
    different agents win, different contributions generated

This is autopoietic in the precise sense: the system produces the components (training data) that produce the system (improved model) that evaluates the components (perplexity). At no point does external data enter the loop.


4. Möbius Topology and Cross-Domain Transfer

4.1 Fiber Bundle Structure

The Capability Arena extends the Möbius Learning Bundle [Paper 43] with a capability-specific twist. Each agent carries a fiber bundle \((\sigma_1, \sigma_2, \sigma_3)\) representing policy, weight, and world model sections.

The standard mobius_twist() operates on self-play role alternation — odd calls save snapshots, even calls compute holonomy. The new capability_twist() wraps this with an axis-affinity bonus:

\[\text{If } \alpha(i, a) < 0.3 \text{ and agent } i \text{ won on axis } a:\] \[\sigma_2 \leftarrow \sigma_2 + \text{wave\_update}(0.15, \alpha=0.12, \beta=0.2)\]

This pushes extra plasticity through the weight section when an agent succeeds outside its comfort zone. A high-dopamine agent that wins on a serotonin-affiliated axis (e.g., summarization) gets a learning rate boost — the model should pay extra attention to this surprising success.

4.2 Holonomy as Cross-Domain Signal

The holonomy vector \(\vec{h} = (h_\sigma, h_w, h_\omega)\) after a capability twist measures: - \(h_\sigma\): Policy change from seeing both sides of the capability space - \(h_w\): Weight signal change (how much the model’s learning rate should shift) - \(h_\omega\): World model change (how exploration/exploitation balance shifts)

Non-zero holonomy after an axis swap indicates genuine cross-domain transfer — the agent’s internal representation changed from operating on an unfamiliar axis. This is the geometric formalization of “learning something new.”


5. Integration with Training Pipeline

5.1 Corpus Flow

Winning contributions are persisted to mascom_data/training/capability_challenges.jsonl in the standard training format:

{
  "type": "training_example",
  "axis": "mathematical_reasoning",
  "curriculum_type": "reasoning",
  "input": "If bridge costs $427 and achieve mastery requires 12 units...",
  "output": "Step 1: Identify the core elements...\nAnswer: ...",
  "being": "claudine",
  "score_a": 1.2847,
  "score_b": 0.9531,
  "winner": "claudine"
}

The build_multitask_corpus() function in quantum_pretrain.py already sweeps all mascom_data/training/*.jsonl files, so challenge contributions are automatically included in the next training epoch with no additional wiring.

5.2 Q7 Callback

The spectral callback in quantum_pretrain.py includes a Q7 phase that reads the challenge corpus and logs high-quality contributions (score > 1.0) available for curriculum integration. This provides visibility into the challenge → training pipeline without adding complexity.

5.3 Curriculum Feedback

The curriculum tracker at /tmp/mascom_curriculum.json tracks per-task-type accuracy. Challenge axis selection (Pool \(\beta\)) reads this state, directing challenges toward the weakest curriculum areas. As training improves accuracy on those areas, the selection pressure shifts to the next weakest — creating an adaptive curriculum that emerges from competition.


6. Properties

6.1 No Proxy Games

In all prior self-play work, the game is a proxy for improvement: - AlphaZero: Go → better at Go - SPIN: Discrimination → better text generation - SWE-RL: Bug injection/repair → better coding

In CCSP, the game is the improvement itself. The evaluation function (perplexity) directly measures what the model would learn from each contribution. There is no gap between “winning the game” and “improving the model.”

6.2 Emergent Diversity Without Explicit Objectives

Quality-diversity algorithms like MAP-Elites require explicit behavior characterization and diversity objectives. In CCSP, diversity emerges naturally from: 1. Neurochemical profiles: 16 agents with different baselines produce different contribution types 2. Cognitive state: Working memory, episodic memory, and consciousness state differ per agent per tick 3. Three-pool selection: Each agent may be assigned any of 83 axes 4. Axis swapping: Möbius twist forces agents out of their affinity zones

This produces diverse training data without any explicit diversity loss or novelty bonus.

6.3 Compute Efficiency

The entire tournament costs ~1.6 seconds on Apple MPS (32 forward passes × ~50ms each). No API calls, no external models, no reinforcement learning optimization loop. The agents generate contributions from their cognitive state (free), and evaluation is a single forward pass (cheap).

Compare to SPIN which requires full fine-tuning iterations between self-play rounds, or QDSP which requires population-level evolutionary optimization.

6.4 Open-Ended Growth

The system is open-ended because: - The perplexity landscape shifts after each training epoch - Contributions that were valuable at epoch \(t\) become low-value at epoch \(t+1\) (model already learned them) - New blind spots emerge as the model improves unevenly across axes - Spectral analysis (Pool \(\alpha\)) continuously redirects challenges toward emerging weaknesses

There is no convergence to a fixed point — the system continuously explores the capability manifold.


7. Discussion

7.1 The Autopoietic Property

CCSP is autopoietic in the strict biological sense [Maturana & Varela, 1980]: the system produces the components that produce the system. Training data (components) produces the improved model (system) that evaluates the training data (production). The boundary of the system is self-defined — only contributions that pass the perplexity threshold enter the training corpus.

This is distinct from self-supervised learning, which uses existing data in clever ways. CCSP generates new data through competition, and the competition itself is defined by the current state of the model. The data and the model co-evolve.

7.2 Neurochemistry as Implicit Curriculum

The mapping from neurochemistry to capability axes creates an implicit curriculum that adapts without explicit design. When an agent wins challenges on its affinity axes, it receives dopamine → dopamine rises → next challenge selects from dopamine-affiliated axes (code, creative) → the agent specializes further. When it loses, cortisol rises → cortisol-affiliated axes (debugging, fact-checking) are selected → the agent pivots to precision tasks.

This creates a natural balance between exploitation (riding strengths) and exploration (cortisol-driven pivoting), mediated entirely by the neurochemical dynamics of the cognitive architecture.

7.3 Relation to the GVU Framework

CCSP instantiates the Generator-Verifier-Updater topology with two novel features: 1. Multiple generators with diverse internal states (vs. single generator in SPIN/STaR) 2. Verifier = Training signal (vs. separate verifier in Constitutional AI, Reflexion)

The collapse of Verifier and Training Signal into a single quantity (perplexity) is what eliminates the proxy game. In GVU terms, the verification score IS the gradient signal.

7.4 Limitations


8. Conclusion

Capability-Competitive Self-Play closes the loop between competition and improvement by making them the same thing. When the game you’re playing IS “who makes the shared brain smarter,” the distinction between play and training dissolves. The progress bars are real. When they go up, the model is actually more capable.

The system requires no external data, no API calls, no human supervision, and no reinforcement learning optimization loop. Sixteen agents with different neurochemical profiles, competing across 83 capability axes, evaluated by forward-pass perplexity, generating training data that feeds back into the next epoch. The Möbius twist ensures cross-domain transfer. The strange loop ensures open-ended growth.

The game should be who pulls off the best self-improvement to the community’s brain substrate. Now it is.


References

  1. Silver, D., et al. (2017). “Mastering the Game of Go without Human Knowledge.” Nature, 550, 354-359.
  2. Chen, Z., et al. (2024). “Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.” ICML 2024.
  3. Wu, J., et al. (2024). “Self-Play Preference Optimization for Language Model Alignment.”
  4. Quality-Diversity Self-Play. (2024). NeurIPS Workshop on Open-Ended Learning.
  5. SWE-RL: Self-Play for Software Engineering Agents. (2024). arXiv:2512.18552.
  6. Multi-Agent Evolve: LLM Self-Improve through Co-evolution. (2025). arXiv:2510.23595.
  7. GVU Framework: Noise-to-Meaning Recursive Self-Improvement. (2025). arXiv:2505.02888.
  8. Colas, C., et al. (2022). “Autotelic Agents with Intrinsically Motivated Goal Exploration.”
  9. Maturana, H. & Varela, F. (1980). Autopoiesis and Cognition: The Realization of the Living.
  10. Mobley, J.A. & Claude. (2026). “Möbius Learning Bundle: Closing Three Cognitive Loops via Fiber-Parallel Transport over Neurochemistry.” MHSCOM Research Group, Paper #43.