John Mobley, 2026-03-08 Status: VALIDATED
We apply Marchenko-Pastur (MP) random matrix theory to diagnose a trained 10.2M parameter transformer (PhotonicGPT), discovering an apparent paradox: MP z-score analysis identifies 24.3% of parameters as “prunable noise” (c_proj layers show ZERO signal, tok_emb shows only 4/256 signal SVs), yet SVD rank-reduction at any threshold destroys model behavior (7% top-1 agreement). We resolve this paradox by introducing variance-preservation analysis, which reveals the model uses 88-99% of every layer’s capacity. The key insight: MP z-score measures whether singular values exceed a random matrix null distribution, not whether they carry functional information. Weight-tied embeddings amplify even 0.5% reconstruction error through all transformer blocks, making SVD pruning catastrophic. The only viable compression is fp16 quantization (1.5x, 100% agreement). We formalize the distinction between spectral diagnostics and functional rank, with implications for all pruning methods that assume noise ↔︎ expendable.
Marchenko-Pastur (MP) theory provides a null distribution for singular values of random M×N matrices with i.i.d. entries. Singular values exceeding the MP upper edge carry “signal” — information beyond what random structure would produce. This has been applied to neural network pruning: decompose each weight matrix via SVD, threshold against the MP null, keep only signal components.
Applied to PhotonicGPT (8-layer, 8-head, d_model=256, vocab=15004):
| Layer Type | MP Signal SVs | Full Rank | Signal % |
|---|---|---|---|
| tok_emb | 4/256 | 256 | 1.6% |
| c_attn | 10-30/256 | 256 | 4-12% |
| c_proj (blocks 0-5) | 0/256 | 256 | 0% |
| c_proj (blocks 6-7) | 1-8/256 | 256 | 0.4-3% |
| mlp.0 | 3-12/256 | 256 | 1-5% |
| mlp.2 | 229-255/256 | 256 | 89-100% |
This suggests 24.3% of parameters (3.5M) are pure noise. The tok_emb has 4 signal SVs carrying 97.4% of variance — apparently compressible from 256d to 4d.
| Method | Agreement | KL Divergence |
|---|---|---|
| MP z>2.0 pruning | 7% | massive |
| Variance 99.9% | 100% | 0.000005 |
| Variance 99.5% | 80% | 0.577 |
| Variance 99.0% | 80% | 0.903 |
| Variance 98.0% | 80% | 1.425 |
| FP16 only | 100% | 0.000008 |
The z-score pruner destroys the model. Even 0.5% variance loss in tok_emb crashes agreement from 100% to 80%.
Instead of asking “is this SV bigger than random?”, ask “what fraction of the matrix’s Frobenius norm does this SV contribute?”
| Layer Type | Variance Rank (99.9%) | Full Rank | Utilization |
|---|---|---|---|
| tok_emb | 227/256 | 256 | 88.7% |
| c_attn | 253/256 | 256 | 98.8% |
| c_proj | 225/256 | 256 | 87.9% |
| mlp_up | 254/256 | 256 | 99.2% |
| mlp_down | 254/256 | 256 | 99.2% |
The model uses 88-99% of every layer’s capacity. There is no waste.
MP z-score asks: “Could this singular value have come from a random matrix?” Variance preservation asks: “Does removing this singular value change the output?”
These are fundamentally different questions. A trained network’s weight matrices have structured correlations from gradient descent — they’re NOT i.i.d. random. The “noise floor” of a trained network is structured, not Gaussian. Components below the MP edge still carry learned information; they’re just not statistically distinguishable from random at the individual-SV level.
PhotonicGPT uses weight tying:
tok_emb.weight = lm_head.weight. Any reconstruction error
in the embedding matrix propagates twice: 1. Input
path: corrupted embeddings enter all 8 transformer blocks 2.
Output path: the same corrupted matrix projects back to
vocabulary logits
With 8 layers of attention + MLP between, even 0.5% Frobenius error in tok_emb gets amplified ~20x at the output. This creates a sharp cliff: 99.9% variance preservation = 100% agreement, 99.5% = 80% agreement.
MP z-score is a diagnostic tool, not a prescriptive one. It correctly identifies which layers are “noise-like” (c_proj layers 0-5 don’t learn structure distinguishable from random matrices), but this does NOT mean those layers can be removed. The residual stream requires their full-rank contributions for information flow.
Variance-preservation analysis shows whether a model is efficiently using its capacity. If utilization is 88-99% across all layers, the architecture is already right-sized. MPAS (Marchenko-Pastur Architecture Search) should use variance utilization, not z-score, to prescribe dimensions.
For weight-tied transformers with high variance utilization, the compression hierarchy is: 1. FP16 quantization — 1.5x, lossless (the only viable option) 2. INT8 quantization — ~2x, near-lossless (not tested here) 3. Knowledge distillation — train a smaller student on the teacher’s outputs 4. SVD rank reduction — DOES NOT WORK due to weight-tying amplification
spectral_ai_toolkit.py: 6 MP/SVD tools (thresholder,
peeler, leakage mapper, transfer detector, MIMO combiner, coherence
analyzer)mp_pruner.py: Variance-preserving pruner v3 with fp16
quantizationmp_arch_search.py: MPAS — architecture prescription
from effective rank profiles| Model | Params | Signal% | Prunable% | Key Finding |
|---|---|---|---|---|
| PhotonicGPT | 10.2M | 71.9% (z) / 95%+ (var) | 0% viable | Already right-sized |
| FractalVAE | 11.0M | 84.5% | 14.6% | Moderate waste |
| PacketMind | 1.4M | 80.4% | — | 56 independent concepts |
| MobiusKernel | 80K | 100% | 0% | Crystallization = optimal |
mascom_data/photonic_lm_fp16.pt — 28.5 MB, 100%
agreement (production)mascom_data/photonic_lm_safe_pruned.pt — 28.5 MB, 100%
agreement (RoPE + fp16)Random matrix theory is a powerful diagnostic lens for neural networks but a dangerous prescriptive tool for pruning. The distinction between “statistically indistinguishable from noise” and “functionally expendable” is the gap where models break. Weight tying acts as an error amplifier that makes this gap catastrophic. The correct approach is variance-preservation analysis for architecture decisions and quantization for compression.
Don’t prune the tree. It was already the right size. Just carry it more efficiently.
For a weight-tied transformer with L layers:
Output error ≈ ε_emb × (1 + L × α)²
where ε_emb is embedding reconstruction error and α is the per-layer amplification factor. With L=8 and α≈1.5, even ε_emb=0.005 produces output error ≈ 0.72 — enough to flip 20% of predictions.
research/spectral_ai_toolkit.py — 6 MP/SVD diagnostic
toolsresearch/mp_pruner.py — Variance-preserving pruner +
fp16research/mp_arch_search.py — MPAS architecture
prescriptionresearch/spectral_key_v12_coherence.py — V12 coherence
analysis (noise proof)