Paper 86: True Unconstrained Gradient ER — Dynamic Gradient Dimensionality

Date: 2026-03-07 Author: MASCOM doScience track Enabled by: Paper 85 (K=64 bottom dims are signal, not noise) Status: CONFIRMED — true gradient ER is dynamic, not static

Abstract

Paper 83 measured “natural ER ≈ 20” through the K=32 LoRA basis. Paper 85 showed K=64 dims are real signal (not noise). This paper measures the TRUE unconstrained gradient ER directly from ∂L/∂W (no K-basis limitation).

Key findings: 1. True gradient ER is DYNAMIC: 5.55 → 19.17 → 9.33 across training steps 2. At step 1, only ER=5.55 is needed — K=8 achieves 98% coverage 3. At peak (step 20), ER=19.17 — K=32 is barely adequate (ER=19.87 ≈ 19.17) 4. K=32’s measured ER of 20 in Paper 83 was NOT the true gradient ER — it was the ER of what K=32 CAN express, not what the gradient NEEDS 5. Per-layer true ER at step 40: 7.74–9.88 (optimal K=12–16 per layer)

Reframing the K selection problem: K selection should match the PEAK true gradient ER across training, not the step-1 value. Peak is ~19-34 (steps 10-20), confirming K=32 as the natural match. K=64’s advantage remains unexplained by ER alone — its mechanism is likely SVD basis coverage (hypothesis for Paper 87).

Setup

Full gradient measurement: create parameter copies of all 2D weight matrices with requires_grad=True, run SGD for N steps, capture ∂L/∂W at the final step.

Effective rank via Roy-Vetterli formula: ER = exp(H(σ/||σ||₁))

All SVD computed on CPU (MPS svdvals fails silently for large matrices — a critical methodological fix from initial failed run).

Phase A: True Gradient ER Across Training Steps

Step	Loss	True ER (mean)	True ER (max)	K32 covers	K64 covers
1	7.8407	5.55	14.25	358.2%	704.3%
5	4.9717	11.07	31.36	181.8%	352.8%
10	4.6185	15.49	33.77	131.1%	252.2%
20	3.9913	19.17	34.91	106.3%	203.7%
40	3.6535	9.33	25.16	215.7%	418.7%

The gradient dimensionality grows and then shrinks: - Steps 1–5: Low-dimensional gradients (ER=5–11). Model is far from optimum, gradient signal concentrates in a few principal directions. - Steps 10–20: Peak complexity (ER=15–19). Model is fitting, gradients spread across more directions as different features require adjustment. - Steps 40+: Simplification (ER=9). Model has fit main directions; residual gradient is lower-dimensional (refinement phase).

This contradicts Paper 83’s conclusion: Paper 83 measured ER(K=32-projected gradient) ≈ 20 across all steps. That measurement was stable because it was measuring what K=32 CAN express, not what the true gradient needs. The true gradient is only ~20 at step 20 — not throughout.

Phase B: K-Saturation Curve at Step 1

K	K-Constrained ER	% of True ER (5.55)	Status
2	1.650	29.7%	severely under-parameterized
4	2.995	54.0%	under-parameterized
8	5.435	98.0%	near-natural (95%+)
16	10.146	182.9%	over-parameterized
32	19.872	358.2%	massively over-parameterized
64	39.065	704.3%	massively over-parameterized

At step 1: K=8 is OPTIMAL — it achieves 98% of the true gradient ER with 4× fewer parameters than K=32. K=32 is massively over-parameterized at step 1.

K needed for 95% true ER coverage at step 1: K≈8 K needed for peak true ER (step 20, ER=19.17): K≈31

Phase C: Per-Layer True Gradient ER at Step 40

Layer	True ER	K32 Covers	Optimal K
0	9.88	203.3%	K≈16
1	9.18	223.1%	K≈15
2	9.31	189.2%	K≈15
3	9.14	245.4%	K≈15
4	8.92	228.6%	K≈14
5	8.64	242.7%	K≈14
6	7.93	255.7%	K≈13
7	7.74	250.2%	K≈12

At step 40 (refinement phase), all layers need only K=12-16. K=32 is 2-2.5× over-parameterized at this stage. Earlier layers (L0-L2) need slightly more K than later layers (L6-L7).

Note: Paper 83’s per-layer K assignments (K=28–36) were based on K=32-constrained ER, not true ER. True optimal K at step 40 is K=12-16 (much smaller). This suggests Paper 84’s adaptive K should use TRUE ER, not K-constrained ER.

Interpretation

Why Paper 83’s ER=20 Was Not the True ER

Paper 83 computed: ER(G_approx) = ER(U_32 @ diag(∂L/∂alpha) @ Vh_32)

This measures the effective rank of the K=32-projected gradient, which is always ≤ K=32. It measured ~20 because the K=32 basis was being used nearly fully — 20 out of 32 basis vectors were receiving significant gradient signal. But it did NOT measure the true dimensionality of ∂L/∂W.

The true ∂L/∂W has fewer significant directions than 32 (true ER = 5–19 depending on step). K=32 was over-parameterized: it provided 32 directions to capture a gradient that only needed 5–19.

Dynamic Gradient ER Explains Paper 81’s Dynamic K Failure

Paper 81 found dynamic K schedules (ascending K or descending K) beat static K only for a brief period, and static K=max won overall. Now we understand why:

The true gradient ER evolves as: low (5–11) → peak (15–19) → lower (9). An ascending K schedule that increases K as training progresses would match this curve. Paper 81 tested ascending schedules like K: 2→4→8→16, which does match the increasing true ER. But those experiments failed because: 1. The K transitions destroyed accumulated alpha information 2. The K increments were too coarse (doubling K vs gradual increase needed)

Paper 81 failure was implementation, not concept. Ascending K matched to true ER dynamics is theoretically sound but requires smooth K transitions. This is Paper 88’s hypothesis.

The K=64 Paradox (Open Question for Paper 87)

K=64 is massively over-parameterized relative to true gradient ER (K=64 ER=39 vs true ER 5–19). Yet K=64 wins by 0.697 loss units over K=32. This cannot be explained by ER alone.

Hypothesis (Paper 87): K=64’s advantage is SVD basis COVERAGE, not ER. The SVD basis of size K includes vectors that better align with the true gradient direction. With K=64, the projection of ∂L/∂W onto the K-basis has higher cosine similarity to the full gradient than K=32’s projection. Even if the true gradient has ER=10, K=64 captures a more ACCURATE direction within the 10-dimensional subspace.

Test: compute cos_sim(∂L/∂W, proj_K(∂L/∂W)) for K=8, 16, 32, 64. If K=64 cos_sim > K=32 cos_sim even when true ER < 32, this is the mechanism.

Updated K Selection Framework (Papers 81–86)

Phase of training:         True ER needed:    Optimal K:
Step 1–5 (initial)         5–11               K=8  (K=8 achieves 98% at step 1)
Step 5–15 (rising)         11–16              K=16 (transition zone)
Step 15–25 (peak)          17–20              K=32 (matches natural peak)
Step 25–40 (refinement)    9–15               K=12 (over-parameterized with K=32)

IDEAL schedule: K=8 → K=32 → K=16 with smooth transitions (Paper 88)

But: K=64 still dominates at 40 steps despite being over-parameterized. This suggests an additional mechanism (SVD coverage, Paper 87) beyond ER matching.

New Gaps

Paper 87: SVD basis coverage — cos_sim(true gradient, K-projection) vs K. Paper 88: True-ER-matched dynamic K schedule — ascending K=8→32→16 matched to true ER dynamics.

Summary Table (Papers 81–86 K Series)

Paper	Topic	Key Finding
81	Dynamic K Schedule	Fixed K beats tested dynamic schedules
82	K Budget Curve	K=16 dead zone; K=64 best quality
83	Gradient ER	K-constrained ER=20; concluded K=32 natural (REVISED)
84	Adaptive K	Adaptive≈uniform; K=64 dominates by 0.697
85	K64 Noise Mechanism	K=64 dims are signal; true ER≥39 (REVISED AGAIN)
86	True Gradient ER	True ER=5→19→9 (dynamic); K=8 optimal at step 1; Paper 83 measured artifact

The K series has revealed: gradient dimensionality is a DYNAMIC quantity, not static. Early training needs K=8; mid-training needs K=32; late training needs K=12-16. K=64 wins via an unexplained mechanism (SVD coverage, Paper 87), not ER matching.