Authors: John Mobley & MASCOM PhotonicMind
Date: 2026-03-07 Status: VALIDATED —
50% sparsity retains 52% efficiency, knee at 50%
Experiment:
mascom_data/ct_experiment/sparse_residual_sft_exp.py
Paper 73 proved the training residual is sparse (kurtosis=33.16). Paper 72 showed amplitude-only SFT achieves 92% quality with 2.3% of parameters. This paper combines both: train only the largest-magnitude score coefficients. At 50% sparsity (1.14% of total params), efficiency is 52% — roughly linear with parameter count. At 25% (0.57%), efficiency drops to 23%. Below 10%, training actually HURTS (negative improvement). The residual is sparse in VALUE but not in LOCATION — you cannot predict which scores need training without trying all of them.
| Fraction Trained | Params | % of Total | Improvement | Efficiency |
|---|---|---|---|---|
| 100% | 371,184 | 2.29% | 0.917 | 100% |
| 50% | 185,592 | 1.14% | 0.475 | 52% |
| 25% | 92,796 | 0.57% | 0.214 | 23% |
| 10% | 37,132 | 0.23% | -0.278 | -30% |
| 5% | 18,574 | 0.11% | -0.473 | -52% |
Efficiency scales roughly linearly with the fraction of scores trained: - 100% scores -> 100% efficiency - 50% scores -> 52% efficiency (near-linear) - 25% scores -> 23% efficiency (near-linear)
Below 25%, training becomes destructive. The top-magnitude scores (which we keep) interfere with the frozen scores (which can’t adapt), creating a mismatch that hurts more than helps.
The sparsity mask selects scores by initial magnitude: large scores are trainable, small scores are frozen. But training needs to CHANGE small scores into large ones (and vice versa). By freezing small scores, we prevent the model from discovering that some initially-small directions become important.
The kurtosis=33 from Paper 73 describes the TRAINED residual, not the training trajectory. The sparsity emerges DURING training, not before. You can’t skip the journey by knowing the destination.
The optimal operating point is amplitude-only SFT (2.3% of params, 92% quality) WITHOUT further sparsification. Sparsifying scores gains nothing because the quality loss is linear with parameter reduction — there’s no “free” sparsity to exploit.
Paper 73’s kurtosis=33 is a property of the OUTCOME, not the PROCESS. The training signal starts dense and CONVERGES to sparse. This is classic in optimization: gradients are dense early, sparse late. The final sparse structure cannot be predicted from the initial state.
The 50% sparsity point (1.14% of total params, 52% efficiency) is useful if training budget is extremely constrained. Each parameter at 50% sparsity is exactly as efficient as at 100% — there’s no efficiency loss per param, just fewer params.
At 0.23% of total params (37K trainable scores), training goes negative. This is the empirical floor — below ~0.5% of total parameters, the model cannot learn from the amplitude-only signal. The true minimum viable SFT signal is approximately 93K parameters (0.57%).
“The residual is sparse at the end. But the path to sparse is dense.”