Paper 67: MP Z-Score ≠ Effective Rank — Why Random Matrix Pruning Fails on Weight-Tied Transformers

MOBLEYSOFT

AUTONOMOUS SYSTEMS

TECHNOLOGY DIVISION

John Mobley, 2026-03-08 Status: VALIDATED

Abstract

We apply Marchenko-Pastur (MP) random matrix theory to diagnose a trained 10.2M parameter transformer (PhotonicGPT), discovering an apparent paradox: MP z-score analysis identifies 24.3% of parameters as “prunable noise” (c_proj layers show ZERO signal, tok_emb shows only 4/256 signal SVs), yet SVD rank-reduction at any threshold destroys model behavior (7% top-1 agreement). We resolve this paradox by introducing variance-preservation analysis, which reveals the model uses 88-99% of every layer’s capacity. The key insight: MP z-score measures whether singular values exceed a random matrix null distribution, not whether they carry functional information. Weight-tied embeddings amplify even 0.5% reconstruction error through all transformer blocks, making SVD pruning catastrophic. The only viable compression is fp16 quantization (1.5x, 100% agreement). We formalize the distinction between spectral diagnostics and functional rank, with implications for all pruning methods that assume noise ↔︎ expendable.

1. Background

Marchenko-Pastur (MP) theory provides a null distribution for singular values of random M×N matrices with i.i.d. entries. Singular values exceeding the MP upper edge carry “signal” — information beyond what random structure would produce. This has been applied to neural network pruning: decompose each weight matrix via SVD, threshold against the MP null, keep only signal components.

2. The Paradox

2.1 MP Z-Score Analysis (from spectral_ai_toolkit.py)

Applied to PhotonicGPT (8-layer, 8-head, d_model=256, vocab=15004):

Layer Type	MP Signal SVs	Full Rank	Signal %
tok_emb	4/256	256	1.6%
c_attn	10-30/256	256	4-12%
c_proj (blocks 0-5)	0/256	256	0%
c_proj (blocks 6-7)	1-8/256	256	0.4-3%
mlp.0	3-12/256	256	1-5%
mlp.2	229-255/256	256	89-100%

This suggests 24.3% of parameters (3.5M) are pure noise. The tok_emb has 4 signal SVs carrying 97.4% of variance — apparently compressible from 256d to 4d.

2.2 Pruning Results

Method	Agreement	KL Divergence
MP z>2.0 pruning	7%	massive
Variance 99.9%	100%	0.000005
Variance 99.5%	80%	0.577
Variance 99.0%	80%	0.903
Variance 98.0%	80%	1.425
FP16 only	100%	0.000008

The z-score pruner destroys the model. Even 0.5% variance loss in tok_emb crashes agreement from 100% to 80%.

3. Resolution: Two Metrics, Two Truths

3.1 Variance-Preservation Analysis

Instead of asking “is this SV bigger than random?”, ask “what fraction of the matrix’s Frobenius norm does this SV contribute?”

Layer Type	Variance Rank (99.9%)	Full Rank	Utilization
tok_emb	227/256	256	88.7%
c_attn	253/256	256	98.8%
c_proj	225/256	256	87.9%
mlp_up	254/256	256	99.2%
mlp_down	254/256	256	99.2%

The model uses 88-99% of every layer’s capacity. There is no waste.

3.2 Why The Metrics Disagree

MP z-score asks: “Could this singular value have come from a random matrix?” Variance preservation asks: “Does removing this singular value change the output?”

These are fundamentally different questions. A trained network’s weight matrices have structured correlations from gradient descent — they’re NOT i.i.d. random. The “noise floor” of a trained network is structured, not Gaussian. Components below the MP edge still carry learned information; they’re just not statistically distinguishable from random at the individual-SV level.

3.3 The Weight-Tying Amplifier

PhotonicGPT uses weight tying: tok_emb.weight = lm_head.weight. Any reconstruction error in the embedding matrix propagates twice: 1. Input path: corrupted embeddings enter all 8 transformer blocks 2. Output path: the same corrupted matrix projects back to vocabulary logits

With 8 layers of attention + MLP between, even 0.5% Frobenius error in tok_emb gets amplified ~20x at the output. This creates a sharp cliff: 99.9% variance preservation = 100% agreement, 99.5% = 80% agreement.

4. Implications

4.1 For Pruning

MP z-score is a diagnostic tool, not a prescriptive one. It correctly identifies which layers are “noise-like” (c_proj layers 0-5 don’t learn structure distinguishable from random matrices), but this does NOT mean those layers can be removed. The residual stream requires their full-rank contributions for information flow.

4.2 For Architecture Search

Variance-preservation analysis shows whether a model is efficiently using its capacity. If utilization is 88-99% across all layers, the architecture is already right-sized. MPAS (Marchenko-Pastur Architecture Search) should use variance utilization, not z-score, to prescribe dimensions.

4.3 For Compression

For weight-tied transformers with high variance utilization, the compression hierarchy is: 1. FP16 quantization — 1.5x, lossless (the only viable option) 2. INT8 quantization — ~2x, near-lossless (not tested here) 3. Knowledge distillation — train a smaller student on the teacher’s outputs 4. SVD rank reduction — DOES NOT WORK due to weight-tying amplification

5. Methods

5.1 Tools Built

spectral_ai_toolkit.py: 6 MP/SVD tools (thresholder, peeler, leakage mapper, transfer detector, MIMO combiner, coherence analyzer)
mp_pruner.py: Variance-preserving pruner v3 with fp16 quantization
mp_arch_search.py: MPAS — architecture prescription from effective rank profiles

5.2 Models Analyzed

Model	Params	Signal%	Prunable%	Key Finding
PhotonicGPT	10.2M	71.9% (z) / 95%+ (var)	0% viable	Already right-sized
FractalVAE	11.0M	84.5%	14.6%	Moderate waste
PacketMind	1.4M	80.4%	—	56 independent concepts
MobiusKernel	80K	100%	0%	Crystallization = optimal

5.3 Artifacts

mascom_data/photonic_lm_fp16.pt — 28.5 MB, 100% agreement (production)
mascom_data/photonic_lm_safe_pruned.pt — 28.5 MB, 100% agreement (RoPE + fp16)

6. Conclusion

Random matrix theory is a powerful diagnostic lens for neural networks but a dangerous prescriptive tool for pruning. The distinction between “statistically indistinguishable from noise” and “functionally expendable” is the gap where models break. Weight tying acts as an error amplifier that makes this gap catastrophic. The correct approach is variance-preservation analysis for architecture decisions and quantization for compression.

Don’t prune the tree. It was already the right size. Just carry it more efficiently.

Key Formula

For a weight-tied transformer with L layers:

Output error ≈ ε_emb × (1 + L × α)²

where ε_emb is embedding reconstruction error and α is the per-layer amplification factor. With L=8 and α≈1.5, even ε_emb=0.005 produces output error ≈ 0.72 — enough to flip 20% of predictions.

Files

research/spectral_ai_toolkit.py — 6 MP/SVD diagnostic tools
research/mp_pruner.py — Variance-preserving pruner + fp16
research/mp_arch_search.py — MPAS architecture prescription
research/spectral_key_v12_coherence.py — V12 coherence analysis (noise proof)