Paper 75: Weight Prediction from (PR, V) — The Two-Number Model

Authors: John Mobley & MASCOM PhotonicMind Date: 2026-03-07 Status: PARTIAL — corpus explains 38.3%, training is essential (61.7%) Experiment: mascom_data/ct_experiment/weight_from_pr_exp.py

Abstract

The intersection of Papers 55 and 56 predicted that the entire weight tensor might be derivable from just PR (participation ratio) and V (vocabulary size). This paper tests that hypothesis. Result: FALSIFIED for full weight prediction (R^2=-0.97 for direct prediction), but PARTIALLY VALIDATED for optimal linear projection (R^2=0.383 at K=32). The corpus embedding covariance basis explains 38.3% of weight variance; the remaining 61.7% requires training. QK^T does NOT correlate with the projected transition matrix (cos≈0 across all layers) — attention learns its own routing independent of corpus bigrams.

Key Results

QK^T Is Independent of Corpus Transitions

Layer QK^T ~ T_proj cosine
L0 0.001
L1 -0.005
L2 0.001
L3 -0.016
L4 -0.007
L5 -0.004
L6 0.000
L7 -0.001

All correlations are essentially zero. The attention pattern QK^T does NOT learn to mimic the corpus bigram transition matrix T. Whatever attention learns, it’s not next-word prediction routing.

This falsifies the hypothesis that attention weights are “just” transition matrix projections. Paper 57 showed the ASYMMETRY source aligns (cos=0.923 at L0), but the actual weight values do not.

The 38/62 Split

Source Weight Variance R^2
Corpus (linear, K=32) 38.3% 0.383
Training 61.7%
Direct corpus prediction -96.8% -0.968
Random -100.1% -1.001

Optimal linear projection from corpus features explains 38.3%. Direct (nonlinear) corpus prediction fails catastrophically. Training contributes the majority of weight variance.

The Two-Number Model

From just PR=1.01 and V=15007: - Architecture: L_opt = ceil(1.01/3) = 1 layer (actual: 8) - Basis: 32 corpus eigenvectors (8,448 params, free) - Scores: 1,484,736 params (requires training) - Compression: 8x (1.48M vs 11.9M raw weight params)

The depth prediction fails because the symmetric corpus PR is only 1.01 (nearly rank-1). The actual model uses 8 layers, suggesting the depth formula requires a different PR metric or that 8 layers serve purposes beyond spectral mode capture.

L0 Q-SVD Alignment

Q weight vectors at L0 align 0.75 with T_proj’s leading SVD vector. This is the only strong alignment found — L0’s query matrix partially inherits the dominant corpus direction, but all other layers diverge. The corpus signal is consumed by L0 and does not propagate deeper.

Implications

Training Is Not Eliminable

The two-number model achieves 38.3% — far from the 100% needed for zero-training CT. Over 60% of weight variance comes from training. The dream of f(PR, V) → complete model is falsified at 10.2M scale.

Corpus Sets the Stage, Training Writes the Play

The 38/62 split is remarkably consistent with Paper 71’s finding that corpus basis is 94% as good as weight basis (for the basis) but Paper 73’s finding that the residual is 197-dimensional and diffuse. The corpus provides the coordinate system; training provides the coordinates.

Attention Is Not Transition Routing

QK^T ≈ 0 correlation with T_proj definitively shows that attention does NOT learn corpus bigram routing. Attention discovers its own routing strategy that is orthogonal to corpus statistics. This is the clearest evidence yet that attention patterns are emergent, not derivable.

Depth Formula Needs Revision

PR_symmetric = 1.01 gives L_opt = 1, but the model has 8 layers. Either: - The depth formula should use a different PR (perhaps from trained weights, not corpus) - 8 layers serve a different purpose than spectral mode capture - The formula only applies to certain PR regimes (PR >> 1)

The Final Accounting

What Source Params Free?
Architecture PR → L_opt (wrong) 0 Partially
Basis directions E^T@E eigenvectors 8,448 Yes
Basis weights (scores) Training 1,484,736 No
Non-weight params Training 4,352,512 No
Weight structure Basis reconstruction 10,393,152 Yes

At K=32 basis: 87.5% of weight params are derivable, 12.5% require training.


“The corpus gives you the stage. The script requires rehearsal.”