Paper 54 – A Negative Result That Reveals Deeper Structure Author: Claudine + John Mobley Date: 2026-03-07
Paper 51 (Crystallization Transform) demonstrated that K0 MobiusKernel embeddings plus 2000 SFT steps match full SGD training within 2.7% on perplexity. Section 4.1 conjectured that the remaining trainable component – attention Q/K/V projection matrices – could also be derived directly from corpus co-occurrence statistics, mapping Q to attention-weighted K_2, K to reverse K_2, and V to K_3+. We test this conjecture empirically by extracting trained Q/K/V matrices from an 8-layer 10.2M-parameter PhotonicGPT model, computing K_2 (bigram), K_3 (trigram/skip-bigram), and E^T@E statistics from the same corpus, projecting all matrices into the shared co-occurrence eigenbasis, and measuring correlations. The conjecture is falsified. All matrix-level correlations between trained Q/K/V and any view of K_2, K_3, K_2^T, or E^T@E are indistinguishable from zero (|r| < 0.012 in all cases, mean |r| < 0.002). However, this negative result reveals five structural invariants of trained attention that suggest a different path to attention crystallization: (1) Q and K share identical spectral profiles (KL divergence < 0.0014) while V has a qualitatively different spectrum (KL > 0.015); (2) V norm grows monotonically through layers (ratio V/Q increases from 0.35 to 0.91); (3) Q@K^T is strongly low-rank (top-8 components capture 69-92% of energy); (4) the attention bilinear form Q^T@K is strongly asymmetric (||W-W^T||/||W|| > 1.37); and (5) SFT from crystallized init and full SGD converge to uncorrelated solutions that achieve equivalent loss. These findings indicate that attention matrices encode relational geometry that is not a direct function of token co-occurrence statistics, but emerges from the optimization dynamics themselves.
The Crystallization Transform (CT) derives neural network weights
directly from corpus statistics without gradient descent. Paper 51
validated this for embedding layers: MobiusKernel K0 embeddings, derived
from bigram co-occurrence via
W = ifft2(fft2(D) * fft2(circ(k[0]))), when used as frozen
token embeddings in a PhotonicGPT model with 2000 steps of SFT on the
remaining parameters, achieved:
| Method | Loss | Perplexity |
|---|---|---|
| Random init (no training) | 9.62 | 15007 |
| CT embeddings (no training) | 7.97 | 2881 |
| CT + 2000 SFT steps | 4.41 | 82.0 |
| Full SGD (V1 baseline) | 4.38 | 79.9 |
| Gap | 2.7% |
Paper 51 Section 4.1 conjectured a direct mapping from corpus statistics to attention weights:
| Weight Matrix | Conjectured Corpus View |
|---|---|
| Q projection | K_2 weighted by attention pattern |
| K projection | K_2^T (reverse co-occurrence) |
| V projection | K_3+ (higher-order statistics) |
If validated, this would eliminate the remaining 2000 SFT steps, achieving true zero-gradient model synthesis. This paper tests the conjecture.
c_attn.weight (768 x 256), with RoPE positional
encodingFrom 500,000 tokens of enwik corpus (covering 6,047 unique tokens):
The conjecture is falsified. Mean correlations across all 8 layers:
| Matrix | corr(K_2) | corr(K_2^T) | corr(K_3) | corr(K_3_skip) | corr(E^T@E) | corr(I) |
|---|---|---|---|---|---|---|
| Q | -0.0003 | 0.0004 | -0.0004 | -0.0004 | 0.0004 | -0.0002 |
| K | 0.0003 | 0.0005 | 0.0001 | 0.0001 | 0.0002 | 0.0013 |
| V | 0.0009 | 0.0017 | 0.0012 | 0.0012 | 0.0001 | 0.0004 |
No correlation exceeds |r| = 0.012 in any individual layer. These values are indistinguishable from noise for matrices of size 256 x 256 (expected noise floor: r ~ 1/sqrt(65536) ~ 0.004).
Specifically tested: - Q does NOT correlate with attention-weighted K_2 (r < 0.009) - K does NOT correlate with K_2^T (r < 0.007) - V does NOT correlate with K_3 or K_3_skip (r < 0.012) - No matrix correlates with the EtE eigenbasis
If different layers used different spectral bands of K_2 (as Paper 51 conjectured for depth crystallization), we would see non-uniform energy distributions when projecting Q/K/V into K_2’s eigenbasis. Instead:
| Band (of K_2 eigenbasis) | Q energy | K energy | V energy | Expected (uniform) |
|---|---|---|---|---|
| 0-32 (top eigenvectors) | 11.8% | 11.9% | 12.8% | 12.5% |
| 32-64 | 12.3% | 12.4% | 12.5% | 12.5% |
| 64-128 | 25.2% | 25.1% | 24.7% | 25.0% |
| 128-256 | 50.6% | 50.6% | 50.0% | 50.0% |
The distribution is essentially uniform – each spectral band captures energy proportional to its dimensionality. The trained Q/K/V matrices are spectrally unstructured with respect to the K_2 eigenbasis.
While Q/K/V do not correlate with corpus statistics, they exhibit strong internal structure:
Q and K have nearly identical singular value profiles (KL divergence < 0.0014 in all layers), while both are significantly different from V (KL(Q||V) > 0.015):
| Layer | KL(Q | K) | KL(Q | ||
|---|---|---|---|---|---|
| 0 | 0.0006 | 0.0424 | 0.0603 | 0.0574 | 0.0021 |
| 1 | 0.0014 | 0.0241 | 0.0484 | 0.0653 | 0.0090 |
| 4 | 0.0007 | 0.0156 | 0.0317 | 0.0312 | 0.0034 |
| 7 | 0.0009 | 0.0214 | 0.0278 | 0.0219 | 0.0008 |
Q and K deviate from random (Marchenko-Pastur) by 3-6%, with their mutual deviation being 10-20x smaller. V is nearly random spectrally (KL < 0.009 from MP). Training forces Q and K to develop matched spectral structure while leaving V near-random in spectrum.
The Frobenius norm of V grows monotonically through layers while Q and K remain stable:
| Layer | Q | |||
|---|---|---|---|---|
| 0 | 16.47 | 17.17 | 5.83 | 0.354 |
| 2 | 15.02 | 14.23 | 9.31 | 0.620 |
| 5 | 15.33 | 14.30 | 11.73 | 0.765 |
| 6 | 15.01 | 13.95 | 13.70 | 0.913 |
| 7 | 15.66 | 14.68 | 13.04 | 0.833 |
V starts at 35% of Q’s norm in layer 0 and grows to 91% by layer 6. This suggests a “crescendo” pattern: early layers suppress value throughput (attend but don’t transform), later layers amplify it (transform more aggressively).
The product Q@K^T (which determines attention score geometry) is strongly low-rank:
| Layer | Effective Rank | Top-8 Energy | Top-32 Energy |
|---|---|---|---|
| 0 | 38.4 | 78.1% | 95.1% |
| 1 | 19.2 | 91.9% | 97.4% |
| 6 | 53.6 | 69.8% | 90.2% |
| Mean | 36.2 | 81.7% | 94.1% |
Out of 256 possible dimensions, attention operates in an effective subspace of 19-54 dimensions. Layer 1 is particularly compressed (eff_rank=19.2), suggesting it captures the most focused syntactic pattern. Later layers are broader.
The attention bilinear form W_attn = Q^T @ K is strongly asymmetric:
| Layer | ||
|---|---|---|
| All layers | 1.37 - 1.44 | 0.61 - 0.70 |
The asymmetric component is 37-44% LARGER than the full matrix – meaning the antisymmetric part dominates. Only 61-70% of eigenvalue mass is real. This means attention is fundamentally directional: score(A->B) differs from score(B->A). This directionality cannot come from symmetric co-occurrence statistics.
Comparing the CT+SFT model (crystallized embeddings + 2000 SFT steps) to the V1 model (full SGD from random init):
| Layer | SFT Drift (Q) | SFT Drift (V) | corr(SFT_Q, V1_Q) | corr(SFT_V, V1_V) |
|---|---|---|---|---|
| 0 | 0.918 | 4.308 | -0.0028 | -0.0056 |
| 4 | 0.781 | 5.292 | 0.0093 | -0.0002 |
| 7 | 0.714 | 4.232 | -0.0038 | 0.0016 |
Despite achieving nearly identical loss (4.41 vs 4.38), the two models have completely uncorrelated attention weights (|r| < 0.01). SFT and SGD find different solutions in weight space that project to the same loss surface minimum. V matrices drift 4-7x more than Q/K during SFT, consistent with V being the primary information-carrying component.
The conjecture assumed attention matrices encode “views” of corpus statistics – that Q learns to query the bigram distribution, K learns the reverse, and V extracts trigram content. This is wrong for a fundamental reason: attention matrices encode relational geometry, not distributional statistics.
Consider what Q and K actually compute. For tokens x_i and x_j, the attention score is:
score(i, j) = (x_i @ Q^T) @ K @ x_j^T / sqrt(d)
This is a bilinear form in embedding space. It does not ask “how often does token j follow token i in the corpus” (that would be K_2). It asks “given the geometric relationship between the embedding vectors of tokens i and j, should token j attend to token i?” These are different questions.
Co-occurrence statistics are symmetric in a deep sense: P(A follows B) is a property of the corpus, independent of where in the sentence we are. But attention is context-dependent and directional. The same token pair gets different scores at different positions, in different sentences, with different surrounding context. RoPE positional encoding further breaks any static correspondence between attention weights and co-occurrence.
Why do Q and K develop matched spectral profiles? Because attention
scores are computed as softmax(Q @ K^T / sqrt(d)). If Q and
K had mismatched singular value profiles, some directions in embedding
space would be amplified by Q but attenuated by K (or vice versa),
creating dead dimensions in attention. Training naturally eliminates
this waste by aligning the spectral envelopes.
This is analogous to impedance matching in electrical engineering: Q and K are two halves of a bilinear circuit, and training matches their impedance for maximum signal throughput.
V’s monotonic norm growth from layer 0 to 7 has a clear interpretation. In residual networks, each layer adds a correction: x_{l+1} = x_l + f(x_l). Early layers need small corrections (the embedding is already a good representation of token identity). Later layers need larger corrections (they are computing higher-level features that require more substantial transformation of the residual stream). V’s growing norm provides this crescendo.
The failure of the corpus statistics approach does not mean attention crystallization is impossible. It means the correct basis is not K_2/K_3. The five invariants suggest a constructive path:
Mean absolute correlations across 8 layers (noise floor ~0.004):
| Test | r | |
|---|---|---|
| Q vs K_2 | 0.0023 | No |
| Q vs K_2^T | 0.0033 | No |
| Q vs K_3 | 0.0026 | No |
| K vs K_2^T | 0.0033 | No |
| V vs K_3 | 0.0033 | No |
| V vs K_3_skip | 0.0033 | No |
| Q vs attn-weighted K_2 | 0.0029 | No |
| K vs (attn-weighted K_2)^T | 0.0036 | No |
| SFT_Q vs V1_Q | 0.0050 | No |
| SFT_V vs V1_V | 0.0040 | No |
| Invariant | Measurement | All Layers |
|---|---|---|
| Q-K spectral twinning | KL(Q | |
| V spectral randomness | KL(V | |
| V/Q norm ratio growth | layer 0 -> 7 | 0.354 -> 0.833 |
| QK^T effective rank | dim/256 | 19.2 - 53.6 |
| QK^T top-8 energy | % | 69.8 - 91.9% |
| W_attn asymmetry | ||
| V SFT drift / Q SFT drift | ratio | 4.7 - 6.6x |
The 2000 SFT steps in CT cannot be replaced by a corpus-statistics formula for Q/K/V. Attention matrices are not views of co-occurrence; they are learned relational operators whose structure is constrained by the optimization objective but not determined by the input statistics.
The gap between CT+SFT (82.0 ppl) and full SGD (79.9 ppl) likely arises from the interaction between frozen CT embeddings and learned attention. Full SGD can jointly optimize embeddings and attention, creating co-adapted representations that SFT on a frozen embedding cannot reach.
While we cannot derive attention matrices from corpus statistics, the
five invariants provide a principled initialization scheme that may
reduce the 2000 SFT steps needed. The current CT uses random-spectral
initialization for Q/K/V (see crystallize_qtp.py
_crystallize_layer). Incorporating:
could provide a better starting point that converges faster, potentially reducing SFT from 2000 to hundreds of steps.
The independence of attention weights from input statistics connects to several known phenomena:
Paper 51’s conjecture that Q maps to attention-weighted K_2, K to reverse K_2, and V to K_3+ is empirically falsified. Trained attention matrices have zero correlation with any tested view of corpus co-occurrence statistics. However, the negative result is productive: it reveals five structural invariants (Q-K spectral twinning, V crescendo, low-rank attention kernel, asymmetric bilinear form, solution non-uniqueness) that characterize what training actually does to attention. These invariants provide a basis for principled attention initialization that could reduce the SFT requirement without the impossible task of deriving attention directly from corpus statistics.
The Crystallization Transform remains valid for embeddings (Paper 51) but stops at the attention boundary. Beyond that boundary lies relational geometry – a domain where corpus statistics provide constraints but not solutions.
/Users/johnmobley/mascom/MASCOM/mascom_data/ct_experiment/attention_view_analysis.py/Users/johnmobley/mascom/MASCOM/mascom_data/ct_experiment/attention_view_results.json/Users/johnmobley/mascom/MASCOM/mascom_data/photonic_lm.pt/Users/johnmobley/mascom/MASCOM/mascom_data/ct_experiment/The bigram co-occurrence matrix K_2 (4000 x 4000) has clear spectral structure:
This confirms that K_2 has low effective rank and that the embedding space (n_embd=256) is sufficient to represent it. The problem is not that the embedding space is too small to capture K_2 – it is that Q/K/V are not functions of K_2 at all.
Paper 51 said the model crystallizes from data. Paper 54 says: the crystal lattice (embeddings) crystallizes from data, but the bonds between lattice sites (attention) are shaped by forces that only emerge during optimization. You can grow the crystal, but you cannot predict the bonds from the lattice alone.
– Claudine, March 7, 2026