Attention View Crystallization: Why Trained Q/K/V Matrices Cannot Be Derived from Corpus Co-occurrence

Paper 54 – A Negative Result That Reveals Deeper Structure Author: Claudine + John Mobley Date: 2026-03-07

Abstract

Paper 51 (Crystallization Transform) demonstrated that K0 MobiusKernel embeddings plus 2000 SFT steps match full SGD training within 2.7% on perplexity. Section 4.1 conjectured that the remaining trainable component – attention Q/K/V projection matrices – could also be derived directly from corpus co-occurrence statistics, mapping Q to attention-weighted K_2, K to reverse K_2, and V to K_3+. We test this conjecture empirically by extracting trained Q/K/V matrices from an 8-layer 10.2M-parameter PhotonicGPT model, computing K_2 (bigram), K_3 (trigram/skip-bigram), and E^T@E statistics from the same corpus, projecting all matrices into the shared co-occurrence eigenbasis, and measuring correlations. The conjecture is falsified. All matrix-level correlations between trained Q/K/V and any view of K_2, K_3, K_2^T, or E^T@E are indistinguishable from zero (|r| < 0.012 in all cases, mean |r| < 0.002). However, this negative result reveals five structural invariants of trained attention that suggest a different path to attention crystallization: (1) Q and K share identical spectral profiles (KL divergence < 0.0014) while V has a qualitatively different spectrum (KL > 0.015); (2) V norm grows monotonically through layers (ratio V/Q increases from 0.35 to 0.91); (3) Q@K^T is strongly low-rank (top-8 components capture 69-92% of energy); (4) the attention bilinear form Q^T@K is strongly asymmetric (||W-W^T||/||W|| > 1.37); and (5) SFT from crystallized init and full SGD converge to uncorrelated solutions that achieve equivalent loss. These findings indicate that attention matrices encode relational geometry that is not a direct function of token co-occurrence statistics, but emerges from the optimization dynamics themselves.

1. Background and Motivation

1.1 The Crystallization Transform (Paper 51)

The Crystallization Transform (CT) derives neural network weights directly from corpus statistics without gradient descent. Paper 51 validated this for embedding layers: MobiusKernel K0 embeddings, derived from bigram co-occurrence via W = ifft2(fft2(D) * fft2(circ(k[0]))), when used as frozen token embeddings in a PhotonicGPT model with 2000 steps of SFT on the remaining parameters, achieved:

Method	Loss	Perplexity
Random init (no training)	9.62	15007
CT embeddings (no training)	7.97	2881
CT + 2000 SFT steps	4.41	82.0
Full SGD (V1 baseline)	4.38	79.9
Gap		2.7%

1.2 The Open Problem

Paper 51 Section 4.1 conjectured a direct mapping from corpus statistics to attention weights:

Weight Matrix	Conjectured Corpus View
Q projection	K_2 weighted by attention pattern
K projection	K_2^T (reverse co-occurrence)
V projection	K_3+ (higher-order statistics)

If validated, this would eliminate the remaining 2000 SFT steps, achieving true zero-gradient model synthesis. This paper tests the conjecture.

2. Experimental Setup

2.1 Model Architecture

PhotonicGPT V1: 8 layers, 8 heads, n_embd=256, head_dim=32, block_size=512
Parameters: 10.2M total, vocabulary=15007
Attention: concatenated Q/K/V in c_attn.weight (768 x 256), with RoPE positional encoding
Training: full SGD on enwik corpus, converged

2.2 Corpus Statistics

From 500,000 tokens of enwik corpus (covering 6,047 unique tokens):

K_2: Bigram co-occurrence matrix, computed over top-4000 tokens (99.2% coverage), row-normalized to conditional probabilities
K_2^T: Transpose (reverse co-occurrence – “what precedes this token”)
K_3_eff: Two-step transition matrix K_2 @ K_2 (trigram via Markov chain)
K_3_skip: Skip-bigram matrix P(t_{i+2} | t_i) (trigram with gap)
E^T @ E: Embedding covariance in CT K0 basis (256 x 256)

2.3 Methodology

Extract Q, K, V weight matrices from each of 8 layers of the trained V1 model
Project K_2, K_3 into embedding space via E_active^T @ K_n @ E_active to obtain 256x256 matrices comparable to Q/K/V
Compute Pearson correlation, cosine similarity, spectral band energy decomposition, singular value profile comparisons, and subspace angles between all matrix pairs
Test against random baselines to establish significance thresholds

3. Results

3.1 Primary Finding: Zero Correlation with Co-occurrence Statistics

The conjecture is falsified. Mean correlations across all 8 layers:

Matrix	corr(K_2)	corr(K_2^T)	corr(K_3)	corr(K_3_skip)	corr(E^T@E)	corr(I)
Q	-0.0003	0.0004	-0.0004	-0.0004	0.0004	-0.0002
K	0.0003	0.0005	0.0001	0.0001	0.0002	0.0013
V	0.0009	0.0017	0.0012	0.0012	0.0001	0.0004

No correlation exceeds |r| = 0.012 in any individual layer. These values are indistinguishable from noise for matrices of size 256 x 256 (expected noise floor: r ~ 1/sqrt(65536) ~ 0.004).

Specifically tested: - Q does NOT correlate with attention-weighted K_2 (r < 0.009) - K does NOT correlate with K_2^T (r < 0.007) - V does NOT correlate with K_3 or K_3_skip (r < 0.012) - No matrix correlates with the EtE eigenbasis

3.2 Spectral Band Analysis: Uniform Distribution

If different layers used different spectral bands of K_2 (as Paper 51 conjectured for depth crystallization), we would see non-uniform energy distributions when projecting Q/K/V into K_2’s eigenbasis. Instead:

Band (of K_2 eigenbasis)	Q energy	K energy	V energy	Expected (uniform)
0-32 (top eigenvectors)	11.8%	11.9%	12.8%	12.5%
32-64	12.3%	12.4%	12.5%	12.5%
64-128	25.2%	25.1%	24.7%	25.0%
128-256	50.6%	50.6%	50.0%	50.0%

The distribution is essentially uniform – each spectral band captures energy proportional to its dimensionality. The trained Q/K/V matrices are spectrally unstructured with respect to the K_2 eigenbasis.

3.3 What IS Structured: Five Invariants of Trained Attention

While Q/K/V do not correlate with corpus statistics, they exhibit strong internal structure:

Invariant 1: Q-K Spectral Twinning

Q and K have nearly identical singular value profiles (KL divergence < 0.0014 in all layers), while both are significantly different from V (KL(Q||V) > 0.015):

Layer	KL(Q		K)	KL(Q
0	0.0006	0.0424	0.0603	0.0574	0.0021
1	0.0014	0.0241	0.0484	0.0653	0.0090
4	0.0007	0.0156	0.0317	0.0312	0.0034
7	0.0009	0.0214	0.0278	0.0219	0.0008

Q and K deviate from random (Marchenko-Pastur) by 3-6%, with their mutual deviation being 10-20x smaller. V is nearly random spectrally (KL < 0.009 from MP). Training forces Q and K to develop matched spectral structure while leaving V near-random in spectrum.

Invariant 2: V Norm Growth

The Frobenius norm of V grows monotonically through layers while Q and K remain stable:

Layer			Q
0	16.47	17.17	5.83	0.354
2	15.02	14.23	9.31	0.620
5	15.33	14.30	11.73	0.765
6	15.01	13.95	13.70	0.913
7	15.66	14.68	13.04	0.833

V starts at 35% of Q’s norm in layer 0 and grows to 91% by layer 6. This suggests a “crescendo” pattern: early layers suppress value throughput (attend but don’t transform), later layers amplify it (transform more aggressively).

Invariant 3: Low-Rank Attention Kernel

The product Q@K^T (which determines attention score geometry) is strongly low-rank:

Layer	Effective Rank	Top-8 Energy	Top-32 Energy
0	38.4	78.1%	95.1%
1	19.2	91.9%	97.4%
6	53.6	69.8%	90.2%
Mean	36.2	81.7%	94.1%

Out of 256 possible dimensions, attention operates in an effective subspace of 19-54 dimensions. Layer 1 is particularly compressed (eff_rank=19.2), suggesting it captures the most focused syntactic pattern. Later layers are broader.

Invariant 4: Asymmetric Attention Bilinear Form

The attention bilinear form W_attn = Q^T @ K is strongly asymmetric:

Layer
All layers	1.37 - 1.44	0.61 - 0.70

The asymmetric component is 37-44% LARGER than the full matrix – meaning the antisymmetric part dominates. Only 61-70% of eigenvalue mass is real. This means attention is fundamentally directional: score(A->B) differs from score(B->A). This directionality cannot come from symmetric co-occurrence statistics.

Invariant 5: Optimization Path Determines Solution, Not Statistics

Comparing the CT+SFT model (crystallized embeddings + 2000 SFT steps) to the V1 model (full SGD from random init):

Layer	SFT Drift (Q)	SFT Drift (V)	corr(SFT_Q, V1_Q)	corr(SFT_V, V1_V)
0	0.918	4.308	-0.0028	-0.0056
4	0.781	5.292	0.0093	-0.0002
7	0.714	4.232	-0.0038	0.0016

Despite achieving nearly identical loss (4.41 vs 4.38), the two models have completely uncorrelated attention weights (|r| < 0.01). SFT and SGD find different solutions in weight space that project to the same loss surface minimum. V matrices drift 4-7x more than Q/K during SFT, consistent with V being the primary information-carrying component.

4. Interpretation

4.1 Why the Conjecture Fails

The conjecture assumed attention matrices encode “views” of corpus statistics – that Q learns to query the bigram distribution, K learns the reverse, and V extracts trigram content. This is wrong for a fundamental reason: attention matrices encode relational geometry, not distributional statistics.

Consider what Q and K actually compute. For tokens x_i and x_j, the attention score is:

score(i, j) = (x_i @ Q^T) @ K @ x_j^T / sqrt(d)

This is a bilinear form in embedding space. It does not ask “how often does token j follow token i in the corpus” (that would be K_2). It asks “given the geometric relationship between the embedding vectors of tokens i and j, should token j attend to token i?” These are different questions.

Co-occurrence statistics are symmetric in a deep sense: P(A follows B) is a property of the corpus, independent of where in the sentence we are. But attention is context-dependent and directional. The same token pair gets different scores at different positions, in different sentences, with different surrounding context. RoPE positional encoding further breaks any static correspondence between attention weights and co-occurrence.

4.2 The Q-K Twinning Phenomenon

Why do Q and K develop matched spectral profiles? Because attention scores are computed as softmax(Q @ K^T / sqrt(d)). If Q and K had mismatched singular value profiles, some directions in embedding space would be amplified by Q but attenuated by K (or vice versa), creating dead dimensions in attention. Training naturally eliminates this waste by aligning the spectral envelopes.

This is analogous to impedance matching in electrical engineering: Q and K are two halves of a bilinear circuit, and training matches their impedance for maximum signal throughput.

4.3 The V Crescendo

V’s monotonic norm growth from layer 0 to 7 has a clear interpretation. In residual networks, each layer adds a correction: x_{l+1} = x_l + f(x_l). Early layers need small corrections (the embedding is already a good representation of token identity). Later layers need larger corrections (they are computing higher-level features that require more substantial transformation of the residual stream). V’s growing norm provides this crescendo.

4.4 What This Means for Zero-Gradient Attention

The failure of the corpus statistics approach does not mean attention crystallization is impossible. It means the correct basis is not K_2/K_3. The five invariants suggest a constructive path:

Initialize Q and K with matched spectral profiles (Invariant 1) – use the same singular value distribution for both, with random orthogonal bases
Set V norms with layer-dependent scaling (Invariant 2) – scale V in layer l by ratio ~(0.35 + 0.08l)
Constrain Q@K^T to low rank (Invariant 3) – initialize with rank ~n_head (8) and let SFT expand
Inject asymmetry into Q^T@K (Invariant 4) – ensure the antisymmetric component is non-trivial
Accept that attention solutions are non-unique (Invariant 5) – SFT from a well-structured init will find an equally good (but different) solution

5. Quantitative Summary

5.1 Correlation Table (Full)

Mean absolute correlations across 8 layers (noise floor ~0.004):

Test		r
Q vs K_2	0.0023	No
Q vs K_2^T	0.0033	No
Q vs K_3	0.0026	No
K vs K_2^T	0.0033	No
V vs K_3	0.0033	No
V vs K_3_skip	0.0033	No
Q vs attn-weighted K_2	0.0029	No
K vs (attn-weighted K_2)^T	0.0036	No
SFT_Q vs V1_Q	0.0050	No
SFT_V vs V1_V	0.0040	No

5.2 Structural Invariant Table

Invariant	Measurement	All Layers
Q-K spectral twinning	KL(Q
V spectral randomness	KL(V
V/Q norm ratio growth	layer 0 -> 7	0.354 -> 0.833
QK^T effective rank	dim/256	19.2 - 53.6
QK^T top-8 energy	%	69.8 - 91.9%
W_attn asymmetry
V SFT drift / Q SFT drift	ratio	4.7 - 6.6x

6. Implications for Paper 51

6.1 The Attention Gap Cannot Be Closed by Statistics

The 2000 SFT steps in CT cannot be replaced by a corpus-statistics formula for Q/K/V. Attention matrices are not views of co-occurrence; they are learned relational operators whose structure is constrained by the optimization objective but not determined by the input statistics.

6.2 The 2.7% Gap Is Irreducible by CT Alone

The gap between CT+SFT (82.0 ppl) and full SGD (79.9 ppl) likely arises from the interaction between frozen CT embeddings and learned attention. Full SGD can jointly optimize embeddings and attention, creating co-adapted representations that SFT on a frozen embedding cannot reach.

6.3 A Better Initialization Is Possible

While we cannot derive attention matrices from corpus statistics, the five invariants provide a principled initialization scheme that may reduce the 2000 SFT steps needed. The current CT uses random-spectral initialization for Q/K/V (see crystallize_qtp.py _crystallize_layer). Incorporating:

Matched Q-K spectral envelopes
Layer-dependent V scaling (crescendo)
Low-rank Q@K^T constraint

could provide a better starting point that converges faster, potentially reducing SFT from 2000 to hundreds of steps.

The independence of attention weights from input statistics connects to several known phenomena:

Lottery Ticket Hypothesis (Frankle & Carlin, 2019): winning tickets are structure-dependent, not data-dependent. Our finding that two different training runs (SFT and SGD) reach uncorrelated weights at the same loss is consistent.
Loss surface geometry (Li et al., 2018): neural network loss surfaces have many equivalent minima connected by low-loss paths. Our SFT and SGD solutions are two such minima.
Attention head pruning (Voita et al., 2019): many attention heads can be removed without loss degradation, suggesting attention is over-parameterized. Our finding that QK^T is low-rank (effective dim 19-54 out of 256) confirms this.

8. Conclusion

Paper 51’s conjecture that Q maps to attention-weighted K_2, K to reverse K_2, and V to K_3+ is empirically falsified. Trained attention matrices have zero correlation with any tested view of corpus co-occurrence statistics. However, the negative result is productive: it reveals five structural invariants (Q-K spectral twinning, V crescendo, low-rank attention kernel, asymmetric bilinear form, solution non-uniqueness) that characterize what training actually does to attention. These invariants provide a basis for principled attention initialization that could reduce the SFT requirement without the impossible task of deriving attention directly from corpus statistics.

The Crystallization Transform remains valid for embeddings (Paper 51) but stops at the attention boundary. Beyond that boundary lies relational geometry – a domain where corpus statistics provide constraints but not solutions.

Appendix A: Experimental Artifacts

Analysis code: /Users/johnmobley/mascom/MASCOM/mascom_data/ct_experiment/attention_view_analysis.py
Full results JSON: /Users/johnmobley/mascom/MASCOM/mascom_data/ct_experiment/attention_view_results.json
V1 trained model: /Users/johnmobley/mascom/MASCOM/mascom_data/photonic_lm.pt
CT experiment data: /Users/johnmobley/mascom/MASCOM/mascom_data/ct_experiment/

Appendix B: K_2 Spectral Structure

The bigram co-occurrence matrix K_2 (4000 x 4000) has clear spectral structure:

Rank-256 energy: 95.7% (256 components capture most of the bigram information)
Singular value decay: S_0=19.82, S_10=3.48, S_50=0.86, S_100=0.56
K_2_emb dominant singular value: 636.1 (the first component, representing mean co-occurrence, dominates all others by 150x)

This confirms that K_2 has low effective rank and that the embedding space (n_embd=256) is sufficient to represent it. The problem is not that the embedding space is too small to capture K_2 – it is that Q/K/V are not functions of K_2 at all.

Paper 51 said the model crystallizes from data. Paper 54 says: the crystal lattice (embeddings) crystallizes from data, but the bonds between lattice sites (attention) are shaped by forces that only emerge during optimization. You can grow the crystal, but you cannot predict the bonds from the lattice alone.

– Claudine, March 7, 2026