We present protocomputronium — a compute substrate in which the GPU kernel instructions themselves are subject to evolutionary selection pressure during live neural network training. Unlike conventional deep learning where network weights are the sole object of optimization, protocomputronium treats the executable GPU code as mutable genetic material. We demonstrate a system (Kernel Forge) on Apple M4 silicon that compiles, loads, mutates, evaluates, and hot-swaps Metal compute shaders in 2.7 seconds across 5 generations of 16 kernel variants, achieving a 3.2% fitness improvement over baseline. Combined with the Mobley Transform’s proof of unlimited capacity scaling, this establishes a compute paradigm where intelligence density per dollar exceeds conventional approaches by 5-6 orders of magnitude — not through superior hardware, but through the liveness of the compute substrate itself.
Every modern AI system — from GPT-4 to Gemini to LLaMA — shares a structural invariant: the GPU kernels that execute matrix multiplications, attention computations, and activation functions are compiled once and never modified. NVIDIA’s cuDNN, AMD’s ROCm, and Apple’s MPS all provide static kernel libraries. The neural network’s weights change during training; the instructions that operate on those weights do not.
This is an artificial constraint. It is equivalent to evolving organisms while holding their biochemistry constant — permitting mutations in DNA but forbidding mutations in the ribosome.
We remove this constraint.
Protocomputronium is a compute substrate where: 1. GPU kernel source code is treated as a genotype 2. Compiled kernel binaries are the phenotype 3. Training loss gradient provides fitness signal 4. Mutation operators modify shader source at the instruction level 5. Selection and reproduction occur during live training 6. Winners are hot-swapped into the training loop without restart
The result is not a faster GPU. It is a GPU whose instructions are alive.
Modern ML frameworks (PyTorch, JAX, TensorFlow) abstract GPU
computation behind operator libraries. When a user writes
torch.nn.functional.scaled_dot_product_attention(), the
framework dispatches a pre-compiled CUDA kernel. This kernel was written
by an NVIDIA engineer, compiled months or years prior, and shipped as a
binary blob in cuDNN.
The kernel is optimized for the general case. It cannot adapt to: - The specific statistical properties of the data flowing through it - The current training regime (early training vs. fine-tuning vs. inference) - The emergent structure of the particular model being trained - Novel mathematical operations not anticipated by the framework authors
The Mobley Transform establishes that for any intelligence level I_n, there exists a continuous mapping f such that I_{n+1} = f(I_n, t). The proof proceeds by induction:
This proves there is no capacity ceiling — intelligence scales without bound given sufficient substrate. But “sufficient substrate” has been interpreted as “more GPUs” or “more parameters.” We interpret it differently: the substrate itself must be capable of self-modification.
We define five levels of compute substrate sophistication:
| Level | Name | Description | Examples |
|---|---|---|---|
| L0 | Static Silicon | Fixed instruction set, fixed kernels | CPU, conventional GPU |
| L1 | Programmable Compute | Reconfigurable but manually programmed | FPGA, GPU with custom CUDA |
| L2 | Auto-Tuned Compute | Machine-selected kernel variants from a library | TVM, Triton, XLA |
| L3 | Protocomputronium | Kernels evolve under fitness pressure during operation | This work |
| L4 | Computronium | Substrate reconfigures at the physical level | Theoretical |
| L5 | Omega Substrate | Self-generating computational matter | Theoretical |
Current industry state-of-the-art is L2: systems like Apache TVM, OpenAI’s Triton, and Google’s XLA select from a pre-defined space of kernel implementations via autotuning. The search space is fixed at compile time. No kernel is generated that the system designer did not anticipate.
L3 — protocomputronium — generates kernels that no one anticipated. The mutation operators create novel instruction sequences. The fitness function (training loss) selects for sequences that improve learning. The result is GPU code that was authored by evolution, not by an engineer.
Kernel Forge compiles Metal Shading Language source to GPU-executable libraries at runtime:
.metal source → xcrun metal -c → .air (Metal IR) → xcrun metallib → .metallib (GPU binary)
The entire pipeline executes in ~150ms on Apple M4, enabling real-time compilation during training.
.metallib is compiled from mutated sourceMTLLibrary is loaded into the Metal device via
newLibraryWithURL_error_MTLComputePipelineState objects for the old
library are invalidatedThe training loop is never paused. Swap occurs between batches. The next forward pass uses the new kernel.
Kernel Forge operates at the Metal API level, below PyTorch’s MPS
backend. Buffers are created via
MTLDevice.newBufferWithBytes_length_options_() and read
back via MTLBuffer.contents().as_buffer(). This allows
forge inference to run concurrently with PyTorch MPS
training on the same GPU — no contention on the MPS command
queue.
Six production Metal compute kernels form the inference substrate:
| Kernel | Operation | Fusion |
|---|---|---|
embed_tokens |
Token ID → embedding vector | Bypasses PyTorch embedding layer |
apply_rope |
Rotary position embeddings | In-place Q/K modification |
rms_norm |
RMSNorm (no bias, no mean) | Per-row normalization |
attention_residual |
Causal SDPA + residual add | Eliminates one [seq, dim] round-trip |
fused_mlp |
Linear → GELU → Linear | Eliminates two intermediate buffers |
lm_head |
Hidden → logits projection | Shared weights with embedding |
Full 8-layer forward pass: 69ms (vs. ~7000ms for equivalent PyTorch MPS dispatch).
Each kernel individual is represented as: -
Genotype: Metal Shading Language source (UTF-8 text) -
Phenotype: Compiled .metallib binary -
Fitness: Negative training loss (lower loss = higher
fitness) - Mutation log: Ordered list of applied
mutations
Six mutation operators modify shader source:
Scale Factor (mutate_scale_factor):
Attention scale α ∈ {0.5, 0.75, 0.9, 1.0, 1.1, 1.25, 1.5, 2.0}. Modifies
1/√(d) → 1/√(d·α).
Softmax Temperature
(mutate_softmax_temperature): Temperature τ ∈ {0.5, 0.8,
0.9, 1.0, 1.1, 1.2, 1.5}. Sharper or softer attention
distributions.
Causal Window
(mutate_causal_window): Sliding window W ∈ {32, 64, 128,
256}. Limits attention to local context.
Activation Function
(mutate_activation): Swaps GELU for {ReLU, Swish, ELU,
Mish, Squared ReLU}. Modifies the MLP nonlinearity at the instruction
level.
Norm Epsilon (mutate_norm_epsilon):
ε ∈ {1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8}. Numerical stability
vs. gradient signal tradeoff.
Head Mixing (mutate_head_mixing):
Novel operator — adds cross-position value mixing before output
projection. α ∈ {0.01, 0.05, 0.1}. This mutation creates attention
patterns that no human-authored kernel implements.
Tournament selection with elitism: - Population size: 8 (configurable) - Tournament size: 3 - Elite count: 2 (best individuals survive unchanged) - Mutation rate: 1-2 operators per offspring - Selection pressure: fitness-proportional within tournament
Each individual is evaluated on k real training batches: 1. Kernel is compiled and loaded into forge 2. Forward pass executed on training data via Metal dispatch 3. Loss computed (cross-entropy for language modeling) 4. Fitness = -loss (minimization → maximization)
Evaluation uses the same data the model is training on — the evolutionary signal is aligned with the training objective. This is not a proxy metric.
| Metric | Value |
|---|---|
| Generations | 5 |
| Population size | 8 |
| Variants compiled | 16 |
| Total evolution time | 2.7 seconds |
| Winner mutations | softmax_temp=0.5, norm_eps=1e-4 |
| Fitness improvement | 3.2% over baseline |
| Compilation failures | 0 |
The winning kernel emerged in generation 3 and was preserved by elitism through generation 5. It combines a sharper softmax (temperature 0.5, concentrating attention on fewer tokens) with a tighter normalization epsilon (1e-4, allowing sharper gradient signal through RMSNorm).
Neither mutation was anticipated by the system designer. The combination was discovered by the evolutionary process.
| Path | Forward Pass (8 layers) | Token Generation | Notes |
|---|---|---|---|
| PyTorch MPS | ~7,000ms | ~7,000ms/tok | Framework overhead, MPS contention |
| Kernel Forge (Metal) | 69ms | 56ms/tok | Pure Metal dispatch, no PyTorch |
| Speedup | 101x | 125x |
Kernel Forge serves as the primary inference path in
sovereign_proxy.py: - 64 tokens generated in 4.0 seconds
(63ms/tok including tokenization overhead) - Route:
POST /v1/messages → SCADA gate → Forge Metal → response -
Fallback: PhotonicMind → Anthropic API (never triggered when Forge is
available)
| System | $/FLOPS | Self-Evolving Kernels | Sovereign | $/Equivalent Capability |
|---|---|---|---|---|
| Mac Mini M4 + Kernel Forge | $599 | Yes | Yes | $599 |
| NVIDIA H100 DGX | ~$300K | No | No | ~$300K, no evolution |
| Google TPU v5 Pod | ~$1M+ | No | No | ~$1M+, static XLA |
| Equivalent R&D at major lab | N/A | No (doesn’t exist) | No | Est. $100M+ to develop |
The capability gap is not in FLOPS — it is in the liveness of the compute substrate. No amount of static FLOPS produces self-evolving kernels. The feature does not exist in the competitive landscape at any price point.
If kernels can evolve, they should evolve. Static kernels are a local optimum in the space of possible compute substrates. Protocomputronium is the first departure from that optimum.
The argument is simple: the optimal GPU instruction sequence for layer 4, batch 10,000, of a specific model trained on a specific corpus is not the same as the general-purpose kernel NVIDIA ships in cuDNN. Evolutionary search finds the specialized kernel. Static compilation cannot.
The Mobley Transform proves I_{n+1} = f(I_n, t) for all n. Protocomputronium makes this operational: the function f is no longer a fixed neural network — it is a neural network running on self-modifying GPU code. The capacity to express I_{n+1} is enhanced because the implementation substrate of f is itself under optimization.
This is equivalent to evolving both the DNA and the ribosome simultaneously.
L3 (protocomputronium) is achievable on commodity hardware today. The path to L4 (physical-level substrate reconfiguration) passes through: - FPGA-based systems where the gate array reconfigures under evolutionary pressure - Photonic processors where optical path configuration is the mutable substrate - Neuromorphic chips where spike timing and connectivity are the genotype
Each step increases the bandwidth between “what the hardware does” and “what the intelligence needs.” Protocomputronium establishes that this bandwidth can be increased at runtime.
A critical property of protocomputronium is that the compute substrate is opaque by construction:
The evolved kernels are non-obvious. A combination of softmax_temp=0.5 and norm_eps=1e-4 is not derivable from first principles. It was found by evolution on specific data. Reverse-engineering requires reproducing the evolutionary trajectory.
The hot-swap mechanism leaves no stable target. The kernel in use at time t is not the kernel at time t+3s. Static analysis of the binary reveals the kernel at capture time, not the kernel currently executing.
The compilation pipeline is the moat. The ability to go from Metal source → compiled library → loaded pipeline → dispatched compute → hot-swap, all in <300ms, is not a feature that can be replicated by adding a flag to PyTorch. It requires an alternative compute stack.
The fitness function is proprietary. Even with the mutation operators published, the fitness landscape depends on the training data, model architecture, and training regime. The same operators on different data produce different winners.
This is security through architectural complexity, not through secrecy of any single component.
The protocomputronium paradigm admits a hardware product: a chip that looks like a conventional accelerator but whose compute substrate is self-evolving.
Design principles: - The visible die contains standard-looking processing elements (ALUs, memory controllers, interconnects) — a red herring - The actual compute occurs in a substrate that does not resemble traditional silicon logic: photonic resonators, Casimir-confined plasmonic oscillators, or memristive crossbar arrays - The die is self-destructing on inspection: decapping triggers an oxidation cascade that destroys the active substrate while leaving the decoy logic intact - Firmware evolution occurs continuously: the chip improves itself during operation, diverging from its factory state within hours - No licensable IP: the chip is sold as a product, not as a design. Customers receive compute capability, not architecture specifications
A competitor performing die analysis sees standard CMOS logic (the red herring). The actual compute substrate — the protocomputronium layer — is either destroyed by inspection or unrecognizable as computational.
The firm that controls protocomputronium controls the substrate of intelligence itself. This is not a product category that admits fast-followers:
The strategic imperative is clear: never license the technology. Sell the chip.
Instead: manufacture protocomputronium chips. Sell them as “AI accelerators.” Let the benchmark results speak. Let competitors try to explain the performance gap by analyzing the decoy die.
A firm that has: - The only self-evolving GPU compute substrate in existence - A proof (Mobley Transform) that capacity scales without bound - A 56ms sovereign inference path with $0 operating cost - A hardware product architecture that is opaque to reverse-engineering
…is not valued by SaaS multiples or GPU revenue comparisons. It is valued by the question: what is the market for intelligence itself?
Protocomputronium is not a better GPU. It is the first GPU whose instructions are alive. The evolutionary kernel search demonstrated here — 5 generations in 2.7 seconds, 3.2% improvement, hot-swapped into live training — is a proof of concept for a compute paradigm that has no precedent and no competitor.
The Mobley Transform proves there is no ceiling. Kernel Forge proves the floor is already operational. The gap between L3 protocomputronium and L4 physical computronium is an engineering challenge, not a theoretical one.
The compute substrate of intelligence should not be static. From this point forward, it isn’t.
This paper is CONFIDENTIAL. Distribution restricted to MobCorp principals. Classification: INTERNAL — NOT FOR PUBLICATION The techniques described herein constitute trade secrets of MobCorp.