Paper 88: Coverage Threshold — K=16 Dead Zone is a Paper 82 Artifact

Date: 2026-03-07 Author: MASCOM doScience track Enabled by: Paper 87 (Coverage ≈ 0.00205×K, r=-0.9963) Status: HYPOTHESIS FALSIFIED — new critical discovery made

Abstract

Paper 87 hypothesized a per-step coverage threshold (~0.05-0.07) to explain K=16’s dead zone at budget parity. We tested via gradient direction cosine similarity (step t vs t+1) and budget-matched loss curves.

Threshold hypothesis FALSIFIED: All K values (K=2 through K=64) show gradient cos_sim ≈ +1.0 — gradient directions are essentially identical across consecutive steps regardless of K. There is no coverage threshold for directional consistency.

K=16 dead zone DOES NOT APPEAR in budget-matched results with correct weight-tying: K=64 (6.7051) > K=32 (6.7678) > K=16 (6.8041) > K=8 (6.8745) — monotonically ordered.

Critical discovery: The K=16 dead zone from Paper 82 was an experimental artifact caused by using a randomly-initialized output head instead of the weight-tied tok_emb head. Papers 83+ fixed this bug. With correct implementation, budget-matched quality is monotonically increasing with K.

Phase A: Gradient Direction Consistency (60 equal steps)

K	Coverage	Cos Sim	Monotonic%	Final Loss
2	0.003	+1.0000	100.0%	7.7271
4	0.008	+0.9999	100.0%	7.5878
8	0.017	+0.9999	100.0%	7.3700
16	0.033	+0.9999	100.0%	7.0093
32	0.068	+0.9998	100.0%	6.4183
64	0.131	+0.9995	100.0%	5.6759

All K values show cos_sim > 0.999. The gradient direction is essentially identical between consecutive steps for every K tested. 100% of steps are monotonically improving.

The coverage threshold hypothesis requires that low-K gradients would be directionally inconsistent (cos_sim ≈ 0). Instead, even K=2 (cov=0.003) has perfect directional consistency. The projected gradient, while small, always points in the correct direction.

Why the gradient is always consistent: In CT-LoRA, the alpha parameters are the only learnable variables. The gradient ∂L/∂alpha = U_K.T @ (∂L/∂W) @ Vh_K.T is computed exactly. Even a low-dimensional projection gives a consistent gradient direction — the imprecision is in MAGNITUDE (how much of the gradient is captured), not DIRECTION.

Phase B: Budget-Matched Convergence

Budget = 1280 K-steps (K × steps = constant)

K	Steps	Total Coverage	Final Loss	Cos Sim
2	640	1.92	7.3367	+1.0000
4	320	2.56	7.0787	+1.0000
8	160	2.72	6.8745	+1.0000
16	80	2.64	6.8041	+0.9999
32	40	2.72	6.7678	+0.9997
64	20	2.62	6.7051	+0.9993

Budget-matched ranking: K=64 > K=32 > K=16 > K=8 > K=4 > K=2

This is PERFECTLY MONOTONIC. K=16 (6.8041) is correctly ordered between K=8 (6.8745) and K=32 (6.7678). There is no K=16 dead zone.

The Paper 82 Artifact: Why the Dead Zone Appeared

Paper 82 used the PhotonicGPT class to build the model. The model’s head.weight was randomly initialized (not tied to tok_emb.weight). This caused:

Property	Paper 82 (random head)	Papers 83-88 (weight-tied)
Initial loss	~9.9	~7.8
Loss scale	High (~8.6 after 40 steps)	Low (~6.1 after 40 steps)
Gradient magnitude	Larger (from high loss)	Smaller (from lower loss)

With a randomly-initialized head and high baseline loss, the gradient landscape is structurally different. The K=16 dead zone was real in that experimental context — but it was measuring adaptation quality for an incorrect model, not the true CT model.

The weight-tying fix is massive: comparing Paper 82 (K=64 → 8.6268) to Paper 88 (K=64 → 6.7051), the correct model achieves 1.72 lower loss at K=64. This is a 20% improvement in convergence quality from fixing one bug.

Implications for the K series: - Papers 83-84’s absolute loss values (K=32: 6.82, K=64: 6.12) are the correct reference - Papers 81-82’s conclusions about dynamic K and dead zones should be re-evaluated with the correct weight-tied model

Interpretation: What Actually Determines Quality at Budget Parity

From Phase B, the budget-matched results show K=64 wins by 0.065 over K=32, K=32 by 0.037 over K=16, K=16 by 0.070 over K=8. The marginal quality gain from increasing K diminishes as K increases, but remains positive throughout.

The total coverage at budget parity is nearly constant: - K=8: total_cov = 2.72 - K=16: total_cov = 2.64 - K=32: total_cov = 2.72 - K=64: total_cov = 2.62

Yet K=64 still wins despite similar total coverage. This means quality is NOT purely determined by total coverage. There is a per-step coverage quality that makes high-K updates more efficient per unit of total coverage.

New hypothesis (Paper 89): Higher-K updates are more informationally dense — they capture more of the gradient direction space in a single step, allowing the model to make progress in more directions simultaneously. Even with equal total coverage budget, K=64 makes higher-quality updates because each update covers a larger fraction of the true gradient direction.

The quality differential comes from the STRUCTURE of coverage, not just the amount.

Updated K Selection Rule (Papers 81–88)

With correct weight-tying (all Papers 83+):

Budget-matched quality: monotonically K=64 > K=32 > K=16 > K=8 > K=4 > K=2
No dead zones in correct model.

K selection (for weight-tied CT-LoRA):
  Param budget very tight     → K=8 (most steps per param, 100% monotonic)
  Param budget moderate       → K=32 (good balance)
  Param budget available      → K=64 (quality-maximizing)
  Extreme step budget (4×)    → K=2 still wins (Paper 82, but needs re-test with weight-tied)

Note: Paper 82's K=16 dead zone DOES NOT EXIST in the correct implementation.
The ranking at budget parity is monotonically increasing with K.

Key Correction to Earlier Papers

Paper 82 needs re-evaluation: The K-budget curve with the correct weight-tied model should show monotonic K quality at budget parity, not a K=16 dead zone. Paper 82’s K=2 wins at 4× budget finding may also be an artifact.

Paper 83’s “natural ER=20” is still an artifact (confirmed by Paper 86 showing true ER=5-19).

Papers 84-87 findings remain valid as they all used the weight-tied model.

New Gap: Why Does K=64 Win Despite Equal Total Coverage?

Paper 89 hypothesis: K=64’s per-step coverage (0.131) allows the model to make progress in more gradient directions simultaneously, reducing path-dependence in optimization. With K=32 (coverage=0.068), consecutive updates must align with only half the gradient directions per step — the model must “take turns” addressing different gradient components over successive steps.

Test: measure the gradient direction diversity across 20 steps of K=32 (80 unique directions ×0.068 coverage each) vs 10 steps of K=64 (20 ×0.131 coverage each). At equal budget, K=64 should show higher cumulative gradient direction diversity.

Summary (Papers 81–88 K Series)

Paper	Finding	Status
81	Fixed K beats dynamic schedules	Real, but may be artifact
82	K=16 dead zone at budget parity	ARTIFACT — random head bug
83	K-constrained ER=20, natural K=32	Artifact
84	K=64 beats K=32 by 0.697	Confirmed (weight-tied)
85	K=64 dims are signal, not noise	Confirmed
86	True gradient ER dynamic (5→19→9)	Confirmed
87	Coverage = 0.00205K, SVD=random	Confirmed
88	K=16 dead zone is Paper 82 artifact; gradient always directionally consistent	Confirmed

Collective insight: The K series from Papers 84-88 (weight-tied) shows that CT-LoRA quality is monotonically increasing with K at budget parity. K=64 is the current quality champion. The “dead zone” narrative from Paper 82 was incorrect. Quality is determined by per-step coverage, with K=64 providing the most coverage per step.