Paper 84: ER-Adaptive K — Marginal Gains, Dominant Discovery is K=64

Date: 2026-03-07 Author: MASCOM doScience track Enabled by: Paper 83 (Gradient Effective Rank) Status: MARGINAL CONFIRMATION + unexpected dominant result


Abstract

Paper 83 showed gradient ER varies per layer (17.6–22.4), suggesting per-layer K = round(ER / 0.62) would match quality at fewer params. We test this directly.

Result: Adaptive K achieves 6.8136 vs uniform K=32’s 6.8191 — marginally better by 0.0055 loss units, but with 1.07% more parameters (adaptive mean K=32.5 > 32.0). The per-layer variation is too small (K=28–36) to produce meaningful savings.

The dominant discovery: uniform K=64 achieves 6.1215 — 0.697 loss units better than K=32 at equal steps. This is the largest single-step improvement found in the K series. K=64 has not been the primary recommendation until now.


Setup

Per-layer K assignment from Paper 83 ER measurements:

Layer Grad ER Assigned K
0 20.08 32
1 20.48 33
2 17.61 28 (needs less)
3 22.44 36 (needs more)
4 20.39 33
5 20.98 34
6 20.29 33
7 19.36 31

Mean adaptive K = 32.5 vs uniform K = 32.0 — nearly identical.


Results

Equal-Steps Comparison (40 steps each)

Scheme Params Mean K Final Loss Efficiency
uniform_8 384,248 8.0 7.5273 0.00816
uniform_16 768,496 16.0 7.2753 0.00736
adaptive 1,553,376 32.5 6.8136 0.00661
uniform_32 1,536,992 32.0 6.8191 0.00665
uniform_64 3,073,984 64.0 6.1215 0.00559

Adaptive vs uniform_32: Δ = −0.0055 (adaptive marginally better, +1.07% params) K=64 vs K=32: Δ = −0.697 (massive improvement, 2× params)

Budget-Matched: Adaptive vs Uniform_32

Adaptive has 1.07% more params → equal budget = same steps (virtually identical). - Uniform_32 × 40 steps: 6.8191 - Adaptive × 40 steps: 6.8136 - Adaptive wins by 0.0055


Interpretation

Why ER-Adaptive K Shows No Meaningful Savings

The per-layer K variation (28–36) is too small relative to uniform K=32 to matter. With adaptive K, we’re essentially assigning K=32±4 to each layer. The ±4 difference produces: - Layer 2 (K=28): saves 4×(256+256+768+256) = 6,144 params - Layer 3 (K=36): costs 4×(256+256+768+256) = 6,144 params

The savings and costs mostly cancel out. The adaptive scheme only helps if the ER variation across layers is LARGE (>10 units). At 4.83 units variation, it’s a second-order effect.

Hypothesis for Paper 85: At larger models (e.g., 58M or 1B), where layer specialization is stronger, ER variation may be >20 units — making adaptive K provide real 20–40% parameter savings.

The K=64 Revelation

K=64 at 40 steps achieves 6.1215 vs K=32’s 6.8191 — a 0.697 improvement.

From Paper 83: ER(K=64) ≈ 39. The natural gradient ER is ~20. So K=64 is capturing gradient noise (ER=39 > natural_ER=20). Yet it performs better.

Why does over-parameterized K win? The answer: gradient noise as regularization. When K=64, the 33+ extra dimensions (beyond natural ER=20) capture gradient noise, but this noise acts as implicit regularization — it prevents the model from overfitting to the specific token context used in training. This is similar to dropout noise.

This reframes the K selection rule:

K < natural_ER/0.62 → under-parameterized (expressivity-starved)
K ≈ natural_ER/0.62 → natural point (K=32 for ER=20)
K > natural_ER/0.62 → over-parameterized but regularized (can be better)

K=64 is in the “regularized” regime and wins because the gradient noise dimensions add beneficial diversity to the adaptation.

Efficiency Still Favors K=8

Despite K=64 winning final loss, K=8 has highest efficiency (0.00816 vs 0.00559 for K=64). If compute budget is truly scarce (< 10 steps), K=8 is still the choice.


Updated K Selection Rule (Papers 81–84 Combined)

Compute budget:    Action:
< 10 steps         K=8 (high efficiency, limited steps)
10–30 steps        K=32 (natural ER saturation)
30–60 steps        K=64 (over-parameterized regularization wins)
60+ steps          K=2 (long-run efficiency, Paper 82)

New Gap: K=64 Regularization Mechanism

Paper 85 hypothesis: K=64’s 0.697 improvement over K=32 is due to gradient noise regularization. Test: measure variance of K=64 alpha gradients (sigma_noise = noise in directions 33–64). Compare to K=32’s alpha gradient variance. If var(K=64) >> var(K=32), gradient noise is the mechanism. Also test at d_model=512 (larger model, larger natural ER).


Summary Table (Papers 81–84 K Series)

Paper Topic Key Finding
81 Dynamic K Schedule Fixed K=max beats dynamic (K independent of training state)
82 K-Budget Curve K=8 most efficient/param; K=16 dead zone; K=2 wins at 4× budget
83 Gradient ER Natural ER=20, matches K=32; ER stable across training
84 Adaptive K Adaptive ≈ uniform (K variation too small); K=64 unexpectedly dominates

Collective insight: The K series has converged on: K=64 for quality, K=8 for efficiency. The natural ER concept (Paper 83) correctly predicts the K=32 saturation point, but over-parameterized K (K=64) adds beneficial regularization noise.