Paper 74: Sparse Residual SFT — Training Only the Needles

Authors: John Mobley & MASCOM PhotonicMind Date: 2026-03-07 Status: VALIDATED — 50% sparsity retains 52% efficiency, knee at 50% Experiment: mascom_data/ct_experiment/sparse_residual_sft_exp.py

Abstract

Paper 73 proved the training residual is sparse (kurtosis=33.16). Paper 72 showed amplitude-only SFT achieves 92% quality with 2.3% of parameters. This paper combines both: train only the largest-magnitude score coefficients. At 50% sparsity (1.14% of total params), efficiency is 52% — roughly linear with parameter count. At 25% (0.57%), efficiency drops to 23%. Below 10%, training actually HURTS (negative improvement). The residual is sparse in VALUE but not in LOCATION — you cannot predict which scores need training without trying all of them.

Key Results

Sparsity Sweep

Fraction Trained Params % of Total Improvement Efficiency
100% 371,184 2.29% 0.917 100%
50% 185,592 1.14% 0.475 52%
25% 92,796 0.57% 0.214 23%
10% 37,132 0.23% -0.278 -30%
5% 18,574 0.11% -0.473 -52%

The Knee at 50%

Efficiency scales roughly linearly with the fraction of scores trained: - 100% scores -> 100% efficiency - 50% scores -> 52% efficiency (near-linear) - 25% scores -> 23% efficiency (near-linear)

Below 25%, training becomes destructive. The top-magnitude scores (which we keep) interfere with the frozen scores (which can’t adapt), creating a mismatch that hurts more than helps.

Why Location-Sparse Fails

The sparsity mask selects scores by initial magnitude: large scores are trainable, small scores are frozen. But training needs to CHANGE small scores into large ones (and vice versa). By freezing small scores, we prevent the model from discovering that some initially-small directions become important.

The kurtosis=33 from Paper 73 describes the TRAINED residual, not the training trajectory. The sparsity emerges DURING training, not before. You can’t skip the journey by knowing the destination.

Implications

For Parameter Efficiency

The optimal operating point is amplitude-only SFT (2.3% of params, 92% quality) WITHOUT further sparsification. Sparsifying scores gains nothing because the quality loss is linear with parameter reduction — there’s no “free” sparsity to exploit.

For Understanding Sparsity

Paper 73’s kurtosis=33 is a property of the OUTCOME, not the PROCESS. The training signal starts dense and CONVERGES to sparse. This is classic in optimization: gradients are dense early, sparse late. The final sparse structure cannot be predicted from the initial state.

For Compressed Training

The 50% sparsity point (1.14% of total params, 52% efficiency) is useful if training budget is extremely constrained. Each parameter at 50% sparsity is exactly as efficient as at 100% — there’s no efficiency loss per param, just fewer params.

The Minimum Viable Training Signal

At 0.23% of total params (37K trainable scores), training goes negative. This is the empirical floor — below ~0.5% of total parameters, the model cannot learn from the amplitude-only signal. The true minimum viable SFT signal is approximately 93K parameters (0.57%).


“The residual is sparse at the end. But the path to sparse is dense.”