Paper 86: True Unconstrained Gradient ER — Dynamic Gradient Dimensionality

Date: 2026-03-07 Author: MASCOM doScience track Enabled by: Paper 85 (K=64 bottom dims are signal, not noise) Status: CONFIRMED — true gradient ER is dynamic, not static


Abstract

Paper 83 measured “natural ER ≈ 20” through the K=32 LoRA basis. Paper 85 showed K=64 dims are real signal (not noise). This paper measures the TRUE unconstrained gradient ER directly from ∂L/∂W (no K-basis limitation).

Key findings: 1. True gradient ER is DYNAMIC: 5.55 → 19.17 → 9.33 across training steps 2. At step 1, only ER=5.55 is needed — K=8 achieves 98% coverage 3. At peak (step 20), ER=19.17 — K=32 is barely adequate (ER=19.87 ≈ 19.17) 4. K=32’s measured ER of 20 in Paper 83 was NOT the true gradient ER — it was the ER of what K=32 CAN express, not what the gradient NEEDS 5. Per-layer true ER at step 40: 7.74–9.88 (optimal K=12–16 per layer)

Reframing the K selection problem: K selection should match the PEAK true gradient ER across training, not the step-1 value. Peak is ~19-34 (steps 10-20), confirming K=32 as the natural match. K=64’s advantage remains unexplained by ER alone — its mechanism is likely SVD basis coverage (hypothesis for Paper 87).


Setup

Full gradient measurement: create parameter copies of all 2D weight matrices with requires_grad=True, run SGD for N steps, capture ∂L/∂W at the final step.

Effective rank via Roy-Vetterli formula: ER = exp(H(σ/||σ||₁))

All SVD computed on CPU (MPS svdvals fails silently for large matrices — a critical methodological fix from initial failed run).


Phase A: True Gradient ER Across Training Steps

Step Loss True ER (mean) True ER (max) K32 covers K64 covers
1 7.8407 5.55 14.25 358.2% 704.3%
5 4.9717 11.07 31.36 181.8% 352.8%
10 4.6185 15.49 33.77 131.1% 252.2%
20 3.9913 19.17 34.91 106.3% 203.7%
40 3.6535 9.33 25.16 215.7% 418.7%

The gradient dimensionality grows and then shrinks: - Steps 1–5: Low-dimensional gradients (ER=5–11). Model is far from optimum, gradient signal concentrates in a few principal directions. - Steps 10–20: Peak complexity (ER=15–19). Model is fitting, gradients spread across more directions as different features require adjustment. - Steps 40+: Simplification (ER=9). Model has fit main directions; residual gradient is lower-dimensional (refinement phase).

This contradicts Paper 83’s conclusion: Paper 83 measured ER(K=32-projected gradient) ≈ 20 across all steps. That measurement was stable because it was measuring what K=32 CAN express, not what the true gradient needs. The true gradient is only ~20 at step 20 — not throughout.


Phase B: K-Saturation Curve at Step 1

K K-Constrained ER % of True ER (5.55) Status
2 1.650 29.7% severely under-parameterized
4 2.995 54.0% under-parameterized
8 5.435 98.0% near-natural (95%+)
16 10.146 182.9% over-parameterized
32 19.872 358.2% massively over-parameterized
64 39.065 704.3% massively over-parameterized

At step 1: K=8 is OPTIMAL — it achieves 98% of the true gradient ER with 4× fewer parameters than K=32. K=32 is massively over-parameterized at step 1.

K needed for 95% true ER coverage at step 1: K≈8 K needed for peak true ER (step 20, ER=19.17): K≈31


Phase C: Per-Layer True Gradient ER at Step 40

Layer True ER K32 Covers Optimal K
0 9.88 203.3% K≈16
1 9.18 223.1% K≈15
2 9.31 189.2% K≈15
3 9.14 245.4% K≈15
4 8.92 228.6% K≈14
5 8.64 242.7% K≈14
6 7.93 255.7% K≈13
7 7.74 250.2% K≈12

At step 40 (refinement phase), all layers need only K=12-16. K=32 is 2-2.5× over-parameterized at this stage. Earlier layers (L0-L2) need slightly more K than later layers (L6-L7).

Note: Paper 83’s per-layer K assignments (K=28–36) were based on K=32-constrained ER, not true ER. True optimal K at step 40 is K=12-16 (much smaller). This suggests Paper 84’s adaptive K should use TRUE ER, not K-constrained ER.


Interpretation

Why Paper 83’s ER=20 Was Not the True ER

Paper 83 computed: ER(G_approx) = ER(U_32 @ diag(∂L/∂alpha) @ Vh_32)

This measures the effective rank of the K=32-projected gradient, which is always ≤ K=32. It measured ~20 because the K=32 basis was being used nearly fully — 20 out of 32 basis vectors were receiving significant gradient signal. But it did NOT measure the true dimensionality of ∂L/∂W.

The true ∂L/∂W has fewer significant directions than 32 (true ER = 5–19 depending on step). K=32 was over-parameterized: it provided 32 directions to capture a gradient that only needed 5–19.

Dynamic Gradient ER Explains Paper 81’s Dynamic K Failure

Paper 81 found dynamic K schedules (ascending K or descending K) beat static K only for a brief period, and static K=max won overall. Now we understand why:

The true gradient ER evolves as: low (5–11) → peak (15–19) → lower (9). An ascending K schedule that increases K as training progresses would match this curve. Paper 81 tested ascending schedules like K: 2→4→8→16, which does match the increasing true ER. But those experiments failed because: 1. The K transitions destroyed accumulated alpha information 2. The K increments were too coarse (doubling K vs gradual increase needed)

Paper 81 failure was implementation, not concept. Ascending K matched to true ER dynamics is theoretically sound but requires smooth K transitions. This is Paper 88’s hypothesis.

The K=64 Paradox (Open Question for Paper 87)

K=64 is massively over-parameterized relative to true gradient ER (K=64 ER=39 vs true ER 5–19). Yet K=64 wins by 0.697 loss units over K=32. This cannot be explained by ER alone.

Hypothesis (Paper 87): K=64’s advantage is SVD basis COVERAGE, not ER. The SVD basis of size K includes vectors that better align with the true gradient direction. With K=64, the projection of ∂L/∂W onto the K-basis has higher cosine similarity to the full gradient than K=32’s projection. Even if the true gradient has ER=10, K=64 captures a more ACCURATE direction within the 10-dimensional subspace.

Test: compute cos_sim(∂L/∂W, proj_K(∂L/∂W)) for K=8, 16, 32, 64. If K=64 cos_sim > K=32 cos_sim even when true ER < 32, this is the mechanism.


Updated K Selection Framework (Papers 81–86)

Phase of training:         True ER needed:    Optimal K:
Step 1–5 (initial)         5–11               K=8  (K=8 achieves 98% at step 1)
Step 5–15 (rising)         11–16              K=16 (transition zone)
Step 15–25 (peak)          17–20              K=32 (matches natural peak)
Step 25–40 (refinement)    9–15               K=12 (over-parameterized with K=32)

IDEAL schedule: K=8 → K=32 → K=16 with smooth transitions (Paper 88)

But: K=64 still dominates at 40 steps despite being over-parameterized. This suggests an additional mechanism (SVD coverage, Paper 87) beyond ER matching.


New Gaps

Paper 87: SVD basis coverage — cos_sim(true gradient, K-projection) vs K. Paper 88: True-ER-matched dynamic K schedule — ascending K=8→32→16 matched to true ER dynamics.


Summary Table (Papers 81–86 K Series)

Paper Topic Key Finding
81 Dynamic K Schedule Fixed K beats tested dynamic schedules
82 K Budget Curve K=16 dead zone; K=64 best quality
83 Gradient ER K-constrained ER=20; concluded K=32 natural (REVISED)
84 Adaptive K Adaptive≈uniform; K=64 dominates by 0.697
85 K64 Noise Mechanism K=64 dims are signal; true ER≥39 (REVISED AGAIN)
86 True Gradient ER True ER=5→19→9 (dynamic); K=8 optimal at step 1; Paper 83 measured artifact

The K series has revealed: gradient dimensionality is a DYNAMIC quantity, not static. Early training needs K=8; mid-training needs K=32; late training needs K=12-16. K=64 wins via an unexplained mechanism (SVD coverage, Paper 87), not ER matching.