Authors: John Mobley & MASCOM PhotonicMind
Date: 2026-03-07 Status: PARTIAL —
corpus explains 38.3%, training is essential (61.7%)
Experiment:
mascom_data/ct_experiment/weight_from_pr_exp.py
The intersection of Papers 55 and 56 predicted that the entire weight tensor might be derivable from just PR (participation ratio) and V (vocabulary size). This paper tests that hypothesis. Result: FALSIFIED for full weight prediction (R^2=-0.97 for direct prediction), but PARTIALLY VALIDATED for optimal linear projection (R^2=0.383 at K=32). The corpus embedding covariance basis explains 38.3% of weight variance; the remaining 61.7% requires training. QK^T does NOT correlate with the projected transition matrix (cos≈0 across all layers) — attention learns its own routing independent of corpus bigrams.
| Layer | QK^T ~ T_proj cosine |
|---|---|
| L0 | 0.001 |
| L1 | -0.005 |
| L2 | 0.001 |
| L3 | -0.016 |
| L4 | -0.007 |
| L5 | -0.004 |
| L6 | 0.000 |
| L7 | -0.001 |
All correlations are essentially zero. The attention pattern QK^T does NOT learn to mimic the corpus bigram transition matrix T. Whatever attention learns, it’s not next-word prediction routing.
This falsifies the hypothesis that attention weights are “just” transition matrix projections. Paper 57 showed the ASYMMETRY source aligns (cos=0.923 at L0), but the actual weight values do not.
| Source | Weight Variance | R^2 |
|---|---|---|
| Corpus (linear, K=32) | 38.3% | 0.383 |
| Training | 61.7% | — |
| Direct corpus prediction | -96.8% | -0.968 |
| Random | -100.1% | -1.001 |
Optimal linear projection from corpus features explains 38.3%. Direct (nonlinear) corpus prediction fails catastrophically. Training contributes the majority of weight variance.
From just PR=1.01 and V=15007: - Architecture: L_opt = ceil(1.01/3) = 1 layer (actual: 8) - Basis: 32 corpus eigenvectors (8,448 params, free) - Scores: 1,484,736 params (requires training) - Compression: 8x (1.48M vs 11.9M raw weight params)
The depth prediction fails because the symmetric corpus PR is only 1.01 (nearly rank-1). The actual model uses 8 layers, suggesting the depth formula requires a different PR metric or that 8 layers serve purposes beyond spectral mode capture.
Q weight vectors at L0 align 0.75 with T_proj’s leading SVD vector. This is the only strong alignment found — L0’s query matrix partially inherits the dominant corpus direction, but all other layers diverge. The corpus signal is consumed by L0 and does not propagate deeper.
The two-number model achieves 38.3% — far from the 100% needed for zero-training CT. Over 60% of weight variance comes from training. The dream of f(PR, V) → complete model is falsified at 10.2M scale.
The 38/62 split is remarkably consistent with Paper 71’s finding that corpus basis is 94% as good as weight basis (for the basis) but Paper 73’s finding that the residual is 197-dimensional and diffuse. The corpus provides the coordinate system; training provides the coordinates.
QK^T ≈ 0 correlation with T_proj definitively shows that attention does NOT learn corpus bigram routing. Attention discovers its own routing strategy that is orthogonal to corpus statistics. This is the clearest evidence yet that attention patterns are emergent, not derivable.
PR_symmetric = 1.01 gives L_opt = 1, but the model has 8 layers. Either: - The depth formula should use a different PR (perhaps from trained weights, not corpus) - 8 layers serve a different purpose than spectral mode capture - The formula only applies to certain PR regimes (PR >> 1)
| What | Source | Params | Free? |
|---|---|---|---|
| Architecture | PR → L_opt (wrong) | 0 | Partially |
| Basis directions | E^T@E eigenvectors | 8,448 | Yes |
| Basis weights (scores) | Training | 1,484,736 | No |
| Non-weight params | Training | 4,352,512 | No |
| Weight structure | Basis reconstruction | 10,393,152 | Yes |
At K=32 basis: 87.5% of weight params are derivable, 12.5% require training.
“The corpus gives you the stage. The script requires rehearsal.”