Paper 77: Per-Layer Adaptive K at 1.1B Scale

Authors: John Mobley & MASCOM PhotonicMind Date: 2026-03-07 Status: VALIDATED — per-type basis 2.4x better than universal, MLP 1.58x higher PR than attention Experiment: mascom_data/ct_experiment/adaptive_k_scale_exp.py

Abstract

Paper 76 showed K=32 universal basis gives R²=0.029 at TinyLlama 1.1B — nearly zero reconstruction quality despite 52.6% SFT efficiency. This paper asks: does per-type basis allocation improve reconstruction? Result: YES — per-type is 2.4x better than universal at K=32, but absolute R² remains low (3-17%). The critical discovery is the spectral budget: 90% variance requires K≈1500 (73% of d_model=2048), making lossless CT compression impossible at this scale. Useful compression lives in the lossy regime (K=32-128, 10-40% variance, 16-64x compression). MLP layers have 1.58x higher PR than attention layers, confirming different spectral structures.

Key Results

Per-Type Spectral Profiles

Type PR K50 K90 R²(uni@32) R²(type@32) Gain
K (key proj) 619.7 256 1070 0.110 0.139 1.3x
Q (query proj) 1046.9 458 1517 0.041 0.091 2.2x
V (value proj) 1193.1 434 1287 0.017 0.059 3.5x
O (output proj) 1801.2 730 1648 0.015 0.031 2.1x
gate (MLP) 1762.2 757 1714 0.024 0.048 2.0x
up (MLP) 1903.8 817 1731 0.018 0.031 1.7x
down (MLP) 1874.3 810 1730 0.017 0.032 1.9x
embed 1477.1 651 1646 0.018 0.076 4.2x
head 833.4 686 1655 0.025 0.167 6.7x

The Spectral Hierarchy

K projection < Q < V < head < embed < gate < O < up ≈ down

Ordered by PR (compressibility). K-projections have the lowest PR (619.7) — they’re the most structured, most compressible. MLP up/down projections have the highest PR (~1900) — they’re the most diffuse.

Attention vs MLP

Metric Attention MLP Ratio (MLP/Attn)
PR avg 1165 1847 1.58x
K90 avg 1380 1725 1.25x
R²(type@32) 0.080 0.037 0.46x

MLP layers are 1.58x more dimensionally complex than attention layers. Attention concentrates variance in fewer directions — consistent with attention learning structured routing patterns while MLP distributes across many features.

The Compression Frontier

K % of d Variance Compression Use Case
8 0.4% ~3% 256x Fails at d=2048
32 1.6% ~10% 64x SFT works (52.6% eff)
64 3.1% ~18% 32x Better SFT target
128 6.3% ~33% 16x Good for inference
256 12.5% ~50% 8x Half variance
1024 50% ~82% 2x Diminishing returns
1500 73% ~90% 1.4x Minimal compression

At d=2048, lossless (90% variance) requires ~1500 components — only 1.4x compression. Useful compression is inherently lossy.

Adaptive K Budget

Adaptive K90 uses 82% of original parameters — almost no savings. This means the 90% variance threshold is wrong for compression; it should be set by task performance (like SFT efficiency), not reconstruction fidelity.

Implications

Lossy Compression Is The Only Game

At 1.1B scale, near-lossless CT compression requires K≈d*0.73, giving only 1.4x savings. All useful compression is lossy. This is fine for SFT (Paper 76 showed 52.6% efficiency at K=32, R²=0.029) but means CT at scale is fundamentally a lossy technique.

Per-Type Basis Is Worth 2.4x

Using separate SVD bases for Q/K/V/O/gate/up/down gives 2.4x better reconstruction than a universal basis. At K=32, this means ~7% R² instead of ~3% — still low, but the gain is free (no extra parameters, just smarter basis choice).

K-Projections Are Most Compressible

K-projections have the lowest PR (619.7) and highest per-component R² (0.139 at K=32). This suggests K-projections learn the most structured representations — possibly because key representations need to match specific query patterns.

MLP Is The Compression Bottleneck

MLP layers (PR≈1850) use 3x more parameters than attention (22 layers × 3 matrices × 5632d vs 22 × 4 × 2048d) AND have higher dimensionality. The compression bottleneck at scale is MLP, not attention. This is the opposite of 10.2M scale, where attention was the bottleneck.

The Right K Is Task-Dependent

K should be set by task performance, not reconstruction fidelity. K=32 (R²=0.03) gives 52.6% SFT efficiency. The “right” K is wherever the SFT efficiency curve plateaus — likely around K=64-128 for 1.1B scale.


“The key learns structure. The MLP forgets it. The model thrives in between.”