Paper 65: PR Spreading – Bridging the Concentration Gap

Authors: John Mobley & MASCOM PhotonicMind Date: 2026-03-07 Status: PARTIAL – PR locked at 2.0, but corpus init still helps SFT Experiment: mascom_data/ct_experiment/pr_spreading_exp.py

Abstract

Paper 57 showed corpus bigram asymmetry matches trained attention asymmetry (spectral cosine 0.923 at L0) but the Participation Ratio gap (PR=2.0 corpus vs PR=8.75 trained) prevents direct injection. This paper tests whether higher-order transition powers (T^2 through T^8) can spread the PR. They CANNOT – PR is locked at exactly 2.00 across all powers. The antisymmetric component of the projected transition matrix is fundamentally 2-dimensional. However, corpus-derived asymmetry injection still improves SFT convergence by 3.2% over random initialization.

Key Results

PR is Locked at 2.0

Transition Power Antisymmetric PR
T^1 2.00
T^2 2.00
T^3 2.00
T^4 2.00
T^5 2.00
T^6 2.00
T^7 2.00
T^8 2.00

Target PR: 8.75 (mean trained attention asymmetry PR)

The PR=2.0 lock is a fundamental property: when the transition matrix T is projected through embeddings E (T_proj = E^T @ T @ E), the antisymmetric component lives in a 2D subspace regardless of T’s power. This is because the projection through E imposes a rank constraint – the embedding basis itself determines the PR.

SFT Convergence

Init Strategy Loss @ 100 steps vs Random
Random 2.156 baseline
Direct T^1 2.095 -2.8%
PR-matched blend 2.086 -3.2%
Trained oracle 1.951 -9.5%

Spectral Alignment

Layer Blend <-> Trained cos
L0 0.907
L7 0.718
L5 0.703
Mean 0.678

Why PR is Locked

The antisymmetric PR=2.0 is not a coincidence – it reflects a deep structural property. When T is projected through E (d_model x vocab), the resulting d_model x d_model matrix’s antisymmetric part is constrained by E’s spectral structure. The embedding matrix E has PR ~ 1.0 in its symmetric gram matrix (E^T@E), meaning most embedding variance is in a single direction. The antisymmetric projection inherits this concentration.

The PR gap is not solvable by corpus statistics alone. The spreading from PR=2 to PR=8.75 is a learned property of attention that cannot be derived from the corpus. This is a fundamental limit of training-free initialization.

Implications

The 2D Asymmetry Floor

All corpus-derived asymmetry lives in a 2D subspace. The remaining 6.75 dimensions of trained asymmetry are CREATED by training, not inherited from the corpus. This is the irreducible training contribution to attention asymmetry.

For Zero-SFT

Zero-SFT attention remains blocked. The corpus provides the DIRECTION of asymmetry (cosine 0.907 at L0) but not the SPREAD. Training must discover how to distribute the asymmetric signal across more dimensions.

Practical Value

Despite the fundamental limit, corpus-derived init provides a 3.2% head start. For large-scale training where every step is expensive, this is worth using.


“The corpus shows you two directions. Training discovers the other seven.”