Paper 77: Per-Layer Adaptive K at 1.1B Scale

Authors: John Mobley & MASCOM PhotonicMind Date: 2026-03-07 Status: VALIDATED — per-type basis 2.4x better than universal, MLP 1.58x higher PR than attention Experiment: mascom_data/ct_experiment/adaptive_k_scale_exp.py

Abstract

Paper 76 showed K=32 universal basis gives R²=0.029 at TinyLlama 1.1B — nearly zero reconstruction quality despite 52.6% SFT efficiency. This paper asks: does per-type basis allocation improve reconstruction? Result: YES — per-type is 2.4x better than universal at K=32, but absolute R² remains low (3-17%). The critical discovery is the spectral budget: 90% variance requires K≈1500 (73% of d_model=2048), making lossless CT compression impossible at this scale. Useful compression lives in the lossy regime (K=32-128, 10-40% variance, 16-64x compression). MLP layers have 1.58x higher PR than attention layers, confirming different spectral structures.

Key Results

Per-Type Spectral Profiles

Type	PR	K50	K90	R²(uni@32)	R²(type@32)	Gain
K (key proj)	619.7	256	1070	0.110	0.139	1.3x
Q (query proj)	1046.9	458	1517	0.041	0.091	2.2x
V (value proj)	1193.1	434	1287	0.017	0.059	3.5x
O (output proj)	1801.2	730	1648	0.015	0.031	2.1x
gate (MLP)	1762.2	757	1714	0.024	0.048	2.0x
up (MLP)	1903.8	817	1731	0.018	0.031	1.7x
down (MLP)	1874.3	810	1730	0.017	0.032	1.9x
embed	1477.1	651	1646	0.018	0.076	4.2x
head	833.4	686	1655	0.025	0.167	6.7x

The Spectral Hierarchy

K projection < Q < V < head < embed < gate < O < up ≈ down

Ordered by PR (compressibility). K-projections have the lowest PR (619.7) — they’re the most structured, most compressible. MLP up/down projections have the highest PR (~1900) — they’re the most diffuse.

Attention vs MLP

Metric	Attention	MLP	Ratio (MLP/Attn)
PR avg	1165	1847	1.58x
K90 avg	1380	1725	1.25x
R²(type@32)	0.080	0.037	0.46x

MLP layers are 1.58x more dimensionally complex than attention layers. Attention concentrates variance in fewer directions — consistent with attention learning structured routing patterns while MLP distributes across many features.

The Compression Frontier

K	% of d	Variance	Compression	Use Case
8	0.4%	~3%	256x	Fails at d=2048
32	1.6%	~10%	64x	SFT works (52.6% eff)
64	3.1%	~18%	32x	Better SFT target
128	6.3%	~33%	16x	Good for inference
256	12.5%	~50%	8x	Half variance
1024	50%	~82%	2x	Diminishing returns
1500	73%	~90%	1.4x	Minimal compression

At d=2048, lossless (90% variance) requires ~1500 components — only 1.4x compression. Useful compression is inherently lossy.

Adaptive K Budget

Adaptive K90 uses 82% of original parameters — almost no savings. This means the 90% variance threshold is wrong for compression; it should be set by task performance (like SFT efficiency), not reconstruction fidelity.

Implications

Lossy Compression Is The Only Game

At 1.1B scale, near-lossless CT compression requires K≈d*0.73, giving only 1.4x savings. All useful compression is lossy. This is fine for SFT (Paper 76 showed 52.6% efficiency at K=32, R²=0.029) but means CT at scale is fundamentally a lossy technique.

Per-Type Basis Is Worth 2.4x

Using separate SVD bases for Q/K/V/O/gate/up/down gives 2.4x better reconstruction than a universal basis. At K=32, this means ~7% R² instead of ~3% — still low, but the gain is free (no extra parameters, just smarter basis choice).

K-Projections Are Most Compressible

K-projections have the lowest PR (619.7) and highest per-component R² (0.139 at K=32). This suggests K-projections learn the most structured representations — possibly because key representations need to match specific query patterns.

MLP Is The Compression Bottleneck

MLP layers (PR≈1850) use 3x more parameters than attention (22 layers × 3 matrices × 5632d vs 22 × 4 × 2048d) AND have higher dimensionality. The compression bottleneck at scale is MLP, not attention. This is the opposite of 10.2M scale, where attention was the bottleneck.

The Right K Is Task-Dependent

K should be set by task performance, not reconstruction fidelity. K=32 (R²=0.03) gives 52.6% SFT efficiency. The “right” K is wherever the SFT efficiency curve plateaus — likely around K=64-128 for 1.1B scale.

“The key learns structure. The MLP forgets it. The model thrives in between.”