Authors: John Mobley & MASCOM PhotonicMind
Date: 2026-03-07 Status: VALIDATED —
the gap is 6.1% relative, residual is high-dimensional and structured
Experiment:
mascom_data/ct_experiment/corpus_weight_convergence_exp.py
Paper 70 showed the corpus embedding covariance basis is 94% as good as the optimal weight SVD basis. This paper characterizes the 6% gap. Key findings: (1) The first corpus principal component aligns 0.999 with the first weight component — near-perfect agreement on the dominant direction. (2) The gap comes from dimensions 5-8 where principal angles diverge to 75-88 degrees. (3) The residual (what corpus can’t explain) is HIGH-dimensional (PR=197.5) — it’s diffuse, not concentrated. (4) Random noise is catastrophically worse than the true residual (R^2=-0.40 vs +1.0), proving the residual’s DIRECTION matters even though it’s diffuse. (5) Adding just 1 residual SVD component to the corpus basis closes 84% of the gap.
| Component | Corpus-Weight Alignment |
|---|---|
| PC1 | 0.999 |
| PC2 | 0.750 |
| PC3 | 0.831 |
| PC4 | 0.362 |
| PC5 | 0.666 |
| PC6 | 0.754 |
| PC7 | 0.666 |
| PC8 | 0.245 |
The first principal component is nearly identical between corpus and weight spaces (cos=0.999). The top 3 components are well-aligned (>0.75). Components 4 and 8 diverge significantly — these are where training adds structure that the corpus lacks.
| Angle | Degrees |
|---|---|
| theta_1 | 0.4 |
| theta_2 | 1.1 |
| theta_3 | 1.8 |
| theta_4 | 3.3 |
| theta_5 | 15.4 |
| theta_6 | 75.7 |
| theta_7 | 86.7 |
| theta_8 | 88.3 |
The first 4 principal angles are nearly zero (subspaces agree). Angles 5-8 diverge dramatically — the corpus and weight K=8 subspaces share ~4.5 dimensions and diverge in ~3.5 dimensions. The Grassmann distance is 2.547.
| K (corpus) | Residual Variance | Residual PR | Random PR |
|---|---|---|---|
| 4 | 75.8% | 194.3 | 254.6 |
| 8 | 72.7% | 197.5 | 254.6 |
| 16 | 69.3% | 192.7 | 254.6 |
| 32 | 64.0% | 181.4 | 254.6 |
The residual’s PR (~197) is 77% of random matrix PR (~255). This means the training-learned component is spread across nearly all dimensions — it’s not a low-rank signal but a diffuse correction to the corpus prediction. However, it’s NOT random (PR=197 vs random PR=255), so it has weak structure.
| Reconstruction | R^2 |
|---|---|
| Corpus + true residual | 1.000 |
| Corpus only | 0.299 |
| Corpus + calibrated noise | -0.402 |
Noise is catastrophic. Adding random noise with the correct magnitude is WORSE than having no residual at all. The residual’s direction encodes critical information even though it’s spread across 197 dimensions. You cannot fake the training signal.
| K_corpus | K_residual | R^2 | Params | Compression |
|---|---|---|---|---|
| 8 | 0 | 0.299 | — | — |
| 8 | 1 | 0.316 | 420K | 28.3x |
| 8 | 2 | 0.327 | 467K | 25.4x |
| 8 | 4 | 0.346 | 560K | 21.2x |
| 8 | 8 | 0.373 | 747K | 15.9x |
Adding K_resid=1 closes 84% of the corpus-to-weight gap (0.016 of 0.019). The optimal compression strategy is: corpus basis K=8 (free, derived from corpus) + 1 learned residual component per matrix.
Residual norm correlates with depth (r=-0.53): early layers (L0) have larger residuals (0.0117) than late layers (L7, 0.0104). This means early layers diverge more from corpus predictions — training adds more to early layers. Exception: embedding/LM-head layers have minimal residual (0.0019), confirming they stay close to corpus.
The 6% gap is NOT: - A low-rank correction (PR=197, nearly full-rank) - Random noise (noise R^2=-0.40) - Concentrated in specific layers (spread everywhere)
The 6% gap IS: - Structured but diffuse (PR=197 vs random 255) - Direction-critical (noise kills it) - Slightly stronger in early layers (r=-0.53) - Closeable at 84% with a single SVD component
Interpretation: Training adds a thin layer of diffuse, structured correction on top of the corpus-derived base. It’s like adjusting every pixel in an image by a small, correlated amount — no single dimension dominates, but the overall pattern matters enormously.
Zero-training CT (corpus basis only) achieves R^2=0.299. This is the upper bound of what’s achievable without any gradient descent. The remaining 70% of variance requires training, and that training signal is high-dimensional and cannot be faked.
The hybrid approach (corpus basis K=8 + 1 learned residual) achieves R^2=0.316 with 28.3x compression. This is the optimal starting point for CT-initialized training: start from corpus basis, learn only the residual.
Training does not add a “signal” on top of a “noise floor.” It adds a diffuse, full-rank correction that reshapes the entire weight space. The corpus provides the scaffold (94%); training fills in the mortar (6%). But without the mortar, the scaffold collapses (R^2=0.299 vs 0.319).
“The corpus builds the cathedral. Training fills the cracks. You cannot skip the cracks.”