Authors: John Mobley & MASCOM PhotonicMind
Date: 2026-03-07 Status: VALIDATED —
per-type basis 2.4x better than universal, MLP 1.58x higher PR than
attention Experiment:
mascom_data/ct_experiment/adaptive_k_scale_exp.py
Paper 76 showed K=32 universal basis gives R²=0.029 at TinyLlama 1.1B — nearly zero reconstruction quality despite 52.6% SFT efficiency. This paper asks: does per-type basis allocation improve reconstruction? Result: YES — per-type is 2.4x better than universal at K=32, but absolute R² remains low (3-17%). The critical discovery is the spectral budget: 90% variance requires K≈1500 (73% of d_model=2048), making lossless CT compression impossible at this scale. Useful compression lives in the lossy regime (K=32-128, 10-40% variance, 16-64x compression). MLP layers have 1.58x higher PR than attention layers, confirming different spectral structures.
| Type | PR | K50 | K90 | R²(uni@32) | R²(type@32) | Gain |
|---|---|---|---|---|---|---|
| K (key proj) | 619.7 | 256 | 1070 | 0.110 | 0.139 | 1.3x |
| Q (query proj) | 1046.9 | 458 | 1517 | 0.041 | 0.091 | 2.2x |
| V (value proj) | 1193.1 | 434 | 1287 | 0.017 | 0.059 | 3.5x |
| O (output proj) | 1801.2 | 730 | 1648 | 0.015 | 0.031 | 2.1x |
| gate (MLP) | 1762.2 | 757 | 1714 | 0.024 | 0.048 | 2.0x |
| up (MLP) | 1903.8 | 817 | 1731 | 0.018 | 0.031 | 1.7x |
| down (MLP) | 1874.3 | 810 | 1730 | 0.017 | 0.032 | 1.9x |
| embed | 1477.1 | 651 | 1646 | 0.018 | 0.076 | 4.2x |
| head | 833.4 | 686 | 1655 | 0.025 | 0.167 | 6.7x |
K projection < Q < V < head < embed < gate < O < up ≈ down
Ordered by PR (compressibility). K-projections have the lowest PR (619.7) — they’re the most structured, most compressible. MLP up/down projections have the highest PR (~1900) — they’re the most diffuse.
| Metric | Attention | MLP | Ratio (MLP/Attn) |
|---|---|---|---|
| PR avg | 1165 | 1847 | 1.58x |
| K90 avg | 1380 | 1725 | 1.25x |
| R²(type@32) | 0.080 | 0.037 | 0.46x |
MLP layers are 1.58x more dimensionally complex than attention layers. Attention concentrates variance in fewer directions — consistent with attention learning structured routing patterns while MLP distributes across many features.
| K | % of d | Variance | Compression | Use Case |
|---|---|---|---|---|
| 8 | 0.4% | ~3% | 256x | Fails at d=2048 |
| 32 | 1.6% | ~10% | 64x | SFT works (52.6% eff) |
| 64 | 3.1% | ~18% | 32x | Better SFT target |
| 128 | 6.3% | ~33% | 16x | Good for inference |
| 256 | 12.5% | ~50% | 8x | Half variance |
| 1024 | 50% | ~82% | 2x | Diminishing returns |
| 1500 | 73% | ~90% | 1.4x | Minimal compression |
At d=2048, lossless (90% variance) requires ~1500 components — only 1.4x compression. Useful compression is inherently lossy.
Adaptive K90 uses 82% of original parameters — almost no savings. This means the 90% variance threshold is wrong for compression; it should be set by task performance (like SFT efficiency), not reconstruction fidelity.
At 1.1B scale, near-lossless CT compression requires K≈d*0.73, giving only 1.4x savings. All useful compression is lossy. This is fine for SFT (Paper 76 showed 52.6% efficiency at K=32, R²=0.029) but means CT at scale is fundamentally a lossy technique.
Using separate SVD bases for Q/K/V/O/gate/up/down gives 2.4x better reconstruction than a universal basis. At K=32, this means ~7% R² instead of ~3% — still low, but the gain is free (no extra parameters, just smarter basis choice).
K-projections have the lowest PR (619.7) and highest per-component R² (0.139 at K=32). This suggests K-projections learn the most structured representations — possibly because key representations need to match specific query patterns.
MLP layers (PR≈1850) use 3x more parameters than attention (22 layers × 3 matrices × 5632d vs 22 × 4 × 2048d) AND have higher dimensionality. The compression bottleneck at scale is MLP, not attention. This is the opposite of 10.2M scale, where attention was the bottleneck.
K should be set by task performance, not reconstruction fidelity. K=32 (R²=0.03) gives 52.6% SFT efficiency. The “right” K is wherever the SFT efficiency curve plateaus — likely around K=64-128 for 1.1B scale.
“The key learns structure. The MLP forgets it. The model thrives in between.”