Authors: John Mobley & MASCOM PhotonicMind
Date: 2026-03-07 Status: PARTIALLY
VALIDATED – amplitudes are moderately structured but NOT
corpus-predictable Experiment:
mascom_data/ct_experiment/amplitude_structure_exp.py
Paper 61 identified 16,384 Gaussian amplitudes as the irreducible training signal in a 10.2M parameter model (623x compression). This paper tests whether those amplitudes have internal structure that could enable further compression or even full crystallization. The amplitudes ARE moderately structured (global PR=4.26 out of 8, with one layer at PR=1.03), but the structure is NOT predictable from corpus statistics (score prediction R^2=-0.027). Amplitude-only SFT achieves 35.4% of full SFT efficiency with 5.2% of parameters, confirming that weight information concentrates in projection layers.
| Metric | Value | Interpretation |
|---|---|---|
| Global PR | 4.26 / 8 | Moderately structured (~53% of max dimensionality) |
| PC1 variance | 42.5% | Dominant but not overwhelming |
| Components for 95% | 7 / 8 | Nearly full rank globally |
| Score prediction R^2 | -0.027 | NOT predictable from corpus |
| PC-EtE alignment | 0.674 | Moderate basis alignment |
| Amp-only SFT efficiency | 35.4% | 1/3 of full SFT with 5.2% params |
The structure varies dramatically across layers:
| Layer | PR | PC1% | Max Corr | Kurtosis |
|---|---|---|---|---|
| L0 c_attn | 2.95 | 53.9% | 0.742 | 89.0 |
| L3 c_attn | 2.84 | 53.3% | 0.734 | 150.9 |
| L4 c_proj | 1.03 | 98.5% | 0.997 | 246.4 |
| L5 c_proj | 3.41 | 48.8% | 0.810 | 56.7 |
| L7 c_proj | 6.18 | 25.8% | 0.424 | 1.3 |
L4 c_proj is essentially 1-dimensional – a single principal component captures 98.5% of its amplitude variance. This matrix is fully crystallizable. In contrast, L7 c_proj is nearly full-rank (PR=6.18).
Rather than a sharp crystallization boundary, the model exhibits a GRADIENT: - L4 c_proj: PR=1.03 – CRYSTALLIZABLE (1 free parameter per row) - L0/L3 attention: PR~2.9 – COMPRESSIBLE (3 free parameters per row) - L7 c_proj: PR=6.18 – IRREDUCIBLE (needs all 8 amplitudes)
This suggests early layers learn simpler patterns (lower PR) while later layers encode more complex, higher-dimensional information.
Training only c_proj weights (5.2% of total parameters): - Baseline loss: 6.35 - Full SFT (50 steps): 2.26 (64.4% improvement) - Amp-only SFT (50 steps): 4.90 (22.8% improvement) - Efficiency ratio: 35.4%
This confirms that projection layers carry disproportionate information, but attention matrices and MLPs also contribute significantly (the remaining 64.6%).
| Stage | Parameters | Compression |
|---|---|---|
| Raw model | 10,200,000 | 1x |
| L1 amplitudes only | 16,384 | 623x |
| Subspace (PR=4.3) | 14,392 | 709x |
The subspace compression is modest (623x to 709x) because the global PR is 4.26 – close to the maximum of 8. The amplitudes are structured enough to be interesting but not enough for dramatic further compression.
Full crystallization remains blocked. The amplitude subspace basis does NOT align well enough with E^T@E eigenvectors (mean alignment 0.674, but R^2=-0.03 for score prediction). The amplitudes encode training-trajectory information that corpus statistics alone cannot provide.
The PR gradient suggests an adaptive training strategy: freeze low-PR layers early (they converge fast), keep high-PR layers trainable longer (they need more exploration). This could accelerate SFT by 2-3x.
“The model is a gradient of crystallization – some weights are frozen music, others are still being composed.”