Authors: John Mobley & MASCOM PhotonicMind
Date: 2026-03-07 Status: VALIDATED –
but via redundancy, not low dimensionality Experiment:
mascom_data/ct_experiment/l4_crystallization_exp.py
Paper 62 found L4 c_proj amplitudes have PR=1.03 in the L1 Gaussian basis. This paper tests whether L4 c_proj can be crystallized (frozen at a rank-1 approximation) without quality loss. The answer is YES – crystal-L4 SFT matches full SFT within 0.02% – but the mechanism is unexpected. The weight matrix itself has PR=84.52 (high-dimensional), yet replacing it with rank-1 or even RANDOM values has negligible impact on converged loss. L4 c_proj is crystallizable not because it is low-dimensional, but because it is REDUNDANT – the rest of the network compensates during SFT.
| Strategy | Final Loss (50 SFT steps) | vs Full SFT |
|---|---|---|
| Full SFT (all params) | 2.2266 | baseline |
| Frozen-L4 (trained values) | 2.2211 | -0.2% (better!) |
| Crystal-L4 (rank-1 + frozen) | 2.2261 | 0.0% |
| Random-L4 (random + frozen) | 2.1907 | -1.6% (best!) |
The weight matrix PR=84.52 means L4 c_proj is NOT low-dimensional. But three observations prove it is redundant:
This means L4 c_proj is in a “dead zone” – its specific values contribute nothing to the model’s learned function. The other 15 weight matrices carry all the information.
All 16 attention weight matrices have high PR (75-220). None are low-dimensional:
| Layer | c_attn PR | c_proj PR | c_proj PC1% |
|---|---|---|---|
| L0 | 202.7 | 145.3 | 12.0% |
| L3 | 198.6 | 75.8 | 21.7% |
| L4 | 196.8 | 84.5 | 22.4% |
| L6 | 205.4 | 86.6 | 29.6% |
| L7 | 205.9 | 94.9 | 36.6% |
Note: c_proj matrices consistently have lower PR than c_attn, suggesting projection layers are more compressible than attention mixing layers. L7 c_proj has the highest PC1 (36.6%) – most structured despite highest PR among c_proj.
Paper 62 found amplitude PR=1.03 for L4 c_proj. This paper finds weight PR=84.52. These are different measurements: - Amplitude PR: measures the dimensionality of L1 Gaussian fitting parameters across rows - Weight PR: measures the dimensionality of the actual weight values
The amplitude PR captures structure in the SHAPE of each row’s Gaussian mixture, while weight PR captures the actual numerical variation. A matrix can have highly stereotyped amplitude patterns (low amplitude PR) while still having high-dimensional raw values.
If L4 c_proj is redundant, other layers may be too. This suggests progressive layer freezing during SFT: start by freezing redundant layers, then progressively freeze more as the remaining layers compensate. The model may tolerate 25-50% of its weight matrices being frozen.
Not all 10.2M parameters contribute information. Some are “along for the ride” – maintained by the architecture but not used by the learned function. The effective parameter count is LOWER than the trainable count.
“The model carries weights that carry nothing. Not every parameter is a parameter.”