Authors: John Mobley & MASCOM PhotonicMind
Date: 2026-03-07 Status: PARTIAL – PR
locked at 2.0, but corpus init still helps SFT
Experiment:
mascom_data/ct_experiment/pr_spreading_exp.py
Paper 57 showed corpus bigram asymmetry matches trained attention asymmetry (spectral cosine 0.923 at L0) but the Participation Ratio gap (PR=2.0 corpus vs PR=8.75 trained) prevents direct injection. This paper tests whether higher-order transition powers (T^2 through T^8) can spread the PR. They CANNOT – PR is locked at exactly 2.00 across all powers. The antisymmetric component of the projected transition matrix is fundamentally 2-dimensional. However, corpus-derived asymmetry injection still improves SFT convergence by 3.2% over random initialization.
| Transition Power | Antisymmetric PR |
|---|---|
| T^1 | 2.00 |
| T^2 | 2.00 |
| T^3 | 2.00 |
| T^4 | 2.00 |
| T^5 | 2.00 |
| T^6 | 2.00 |
| T^7 | 2.00 |
| T^8 | 2.00 |
Target PR: 8.75 (mean trained attention asymmetry PR)
The PR=2.0 lock is a fundamental property: when the transition matrix T is projected through embeddings E (T_proj = E^T @ T @ E), the antisymmetric component lives in a 2D subspace regardless of T’s power. This is because the projection through E imposes a rank constraint – the embedding basis itself determines the PR.
| Init Strategy | Loss @ 100 steps | vs Random |
|---|---|---|
| Random | 2.156 | baseline |
| Direct T^1 | 2.095 | -2.8% |
| PR-matched blend | 2.086 | -3.2% |
| Trained oracle | 1.951 | -9.5% |
| Layer | Blend <-> Trained cos |
|---|---|
| L0 | 0.907 |
| L7 | 0.718 |
| L5 | 0.703 |
| Mean | 0.678 |
The antisymmetric PR=2.0 is not a coincidence – it reflects a deep structural property. When T is projected through E (d_model x vocab), the resulting d_model x d_model matrix’s antisymmetric part is constrained by E’s spectral structure. The embedding matrix E has PR ~ 1.0 in its symmetric gram matrix (E^T@E), meaning most embedding variance is in a single direction. The antisymmetric projection inherits this concentration.
The PR gap is not solvable by corpus statistics alone. The spreading from PR=2 to PR=8.75 is a learned property of attention that cannot be derived from the corpus. This is a fundamental limit of training-free initialization.
All corpus-derived asymmetry lives in a 2D subspace. The remaining 6.75 dimensions of trained asymmetry are CREATED by training, not inherited from the corpus. This is the irreducible training contribution to attention asymmetry.
Zero-SFT attention remains blocked. The corpus provides the DIRECTION of asymmetry (cosine 0.907 at L0) but not the SPREAD. Training must discover how to distribute the asymmetric signal across more dimensions.
Despite the fundamental limit, corpus-derived init provides a 3.2% head start. For large-scale training where every step is expensive, this is worth using.
“The corpus shows you two directions. Training discovers the other seven.”