Date: 2026-03-07 Author: MASCOM doScience track Enabled by: Paper 81 (Dynamic K Scheduling) Status: CONFIRMED with unexpected discovery
Paper 81 established that fixed K=32 beats dynamic K schedules at fixed steps. This experiment asks: does K=32 still win when total compute budget (param-steps) is held equal across all K values? We test K ∈ {2, 4, 8, 16, 32, 64} each with budget-matched step counts. Discovery: the K-budget curve is non-monotonic. K=16 is a local maximum (least efficient at budget parity), K=8 beats K=16, and the curve recovers monotonically above K=16. At budget parity, K=64 wins marginally. The K=32 crossover over K=2 occurs at only 20 steps (0.5× standard budget) — far earlier than expected.
Budget anchor: K=32 × 40 steps × 1,536,992 params = 61.5M param-steps. For each K, steps_equiv = budget / trainable_params(K).
| K | Params | Steps | Total Param-Steps |
|---|---|---|---|
| 2 | 96,062 | 640 | 61.5M |
| 4 | 192,124 | 320 | 61.5M |
| 8 | 384,248 | 160 | 61.5M |
| 16 | 768,496 | 80 | 61.5M |
| 32 | 1,536,992 | 40 | 61.5M |
| 64 | 3,073,984 | 20 | 61.5M |
| K | Steps | Final Loss | Note |
|---|---|---|---|
| 2 | 640 | 9.0486 | Worst |
| 4 | 320 | 8.8309 | |
| 8 | 160 | 8.6614 | Local minimum |
| 16 | 80 | 8.6786 | Local maximum (efficiency dip) |
| 32 | 40 | 8.6455 | Recovers |
| 64 | 20 | 8.6268 | Best |
Loss vs K (budget-matched):
9.05 |*
8.88 | *
8.66 | *
8.68 | * ← K=16 dip
8.65 | *
8.63 | *
K=2 4 8 16 32 64
Key finding: K=16 is a local maximum in the budget-efficiency curve. K=8 (more steps) and K=32 (fewer steps, more dimensions) both outperform K=16 at budget parity.
| Budget | Steps | Final Loss | vs K=32×40 |
|---|---|---|---|
| 1× | 640 | 9.0486 | worse |
| 2× | 1,280 | 8.6755 | slightly worse (+0.03) |
| 4× | 2,560 | 8.4536 | better (-0.19) |
| 8× | 5,120 | 8.3275 | much better (-0.32) |
K=2 with 4× the budget beats K=32 standard. K=2 ultimately converges better — it just needs proportionally more time.
| K=32 Steps | K=32 Loss | K=2 Equiv Steps | K=2 Loss | K=32 Wins? |
|---|---|---|---|---|
| 5 | 9.8509 | 80 | 9.7864 | No |
| 10 | 9.6645 | 160 | 9.6001 | No |
| 20 | 9.2946 | 320 | 9.3516 | Yes |
| 40 | 8.6455 | 640 | 9.0486 | Yes |
| 80 | 7.7454 | 1,280 | 8.6755 | Yes |
K=32 crossover at 20 steps — only half the standard 40-step budget. Below 20 steps, K=2 with proportionally more steps wins.
The non-monotonic curve at K=16 has a specific cause: the K=16 step count (80) is insufficient to cover the increased dimensionality but K=16 also doesn’t have enough dimensions to win on pure expressivity.
Think of it as two competing regimes: - Low K regime (K=2..8): more steps → better gradient averaging → efficient convergence - High K regime (K=32..64): more dimensions → can express fine structure in fewer steps
K=16 sits at the inefficient boundary: enough dimensions to spend budget fast, but not enough to compensate with expressivity. It’s the “worst of both worlds” point.
This boundary appears to be near the effective rank of the gradient signal — once K exceeds the effective rank, additional dimensions carry gradient noise rather than signal, and the efficiency drops.
For a fixed budget B (param-steps):
Optimal K ≈ argmin_K loss(B / trainable_params(K), K)
≈ K=8 for B ≤ 30M param-steps
≈ K=32+ for B ≥ 60M param-steps
The crossover: K=32 becomes budget-efficient above B_crossover ≈ 20 × 1.5M = 30M param-steps.
K=2 at 8× budget (5,120 steps) achieves 8.3275 — better than K=32 at 40 steps (8.6455). This means K=2 is the long-run winner if you have unlimited compute.
This has a profound implication for CT-SFT scaling: as models get larger and compute becomes abundant, smaller K with more steps is better. The optimal K decreases as training budget increases. Rich training → prefer K=2.
Non-monotonic K-budget curve: K=16 is a local efficiency minimum (not a maximum). K=8 beats K=16 at budget parity. The curve has a clear “dead zone” around K=12–20.
K=32 crossover at 20 steps (not 40): only 0.5× standard budget is needed for K=32 to beat proportionally-scaled K=2.
K=2 wins at 4× budget: the high-efficiency of small K eventually dominates. Budget-rich training prefers small K.
K=64 is the current budget-matched optimum: marginally beats K=32 at equal param-steps, confirming that above K=32 the efficiency curve flattens.
if budget < 30M param-steps: use K=8 (efficient convergence)
elif budget < 100M param-steps: use K=32 (expressivity wins)
elif budget > 400M param-steps: use K=2 (more steps, better long-run)
else: use K=64 (current sweet spot ~60M)
The K=16 dead zone implies an effective rank threshold in the gradient signal. Paper 83: Gradient Effective Rank in CT-SFT — measure the effective rank of the gradient matrix at each step, and show it matches the K=16 dip. If confirmed, we can select K dynamically by measuring gradient effective rank — a principled, automatic K selection rule.
Budget: 61,479,680 param-steps
K=2 640 steps: final=9.0486
K=4 320 steps: final=8.8309
K=8 160 steps: final=8.6614 ← local min
K=16 80 steps: final=8.6786 ← local max (dip)
K=32 40 steps: final=8.6455
K=64 20 steps: final=8.6268
K=2 at 2×budget (1280 steps): 8.6755
K=2 at 4×budget (2560 steps): 8.4536 (beats K=32×40)
K=2 at 8×budget (5120 steps): 8.3275
K=32 crossover: at 20 steps, K=32 (9.2946) < K=2 equiv (9.3516) — K=32 wins