Paper 64: Progressive Layer Freezing

Authors: John Mobley & MASCOM PhotonicMind Date: 2026-03-07 Status: VALIDATED – individual redundancy confirmed, collective freezing fails Experiment: mascom_data/ct_experiment/progressive_freezing_exp.py

Abstract

Paper 63 showed L4 c_proj can be frozen with zero quality loss. This paper tests how many layers can be frozen simultaneously. Individual c_proj freezing costs < 1% per layer, but ALL c_proj frozen together costs 5.3%. Freezing a full layer costs 10.6%. The model exhibits HIGH INDIVIDUAL REDUNDANCY but LOW COLLECTIVE SUBSTITUTABILITY – each layer’s contribution is small but unique, and compensations don’t stack.

Key Results

Individual c_proj Freezing (1 at a time)

Layer Loss Delta Verdict
L1 -0.02% Redundant
L4 -0.20% Redundant
L6 -0.32% Redundant
L7 -0.29% Redundant
L0 +0.60% Near-redundant
L3 +0.89% Near-redundant

All individual c_proj freezes are < 1%. Five of eight actually IMPROVE performance (negative delta), suggesting slight overfitting in those projection layers.

Progressive Freezing

What’s Frozen Frozen % Loss Delta
1 c_proj 0.6% < 1%
ALL c_proj 5.2% +5.3%
c_proj + c_attn 20.7% +15.2%
All attention 20.7% +15.9%
Attn + MLP 62.1% +79.3%
Everything except embeddings 62.2% +81.2%
Everything except lm_head 100% +165.8%

Layer-by-Layer Progressive Freeze

Layers Frozen Loss Delta
0 2.22 -0.6%
1 2.47 +10.6%
2 2.77 +23.8%
4 3.14 +40.4%
8 4.02 +79.9%

Breaking point: freezing even 1 complete layer exceeds 5% degradation.

The Individual-vs-Collective Paradox

Each c_proj contributes < 1% individually, but ALL together contribute 5.3%. This means:

This is analogous to how removing any single letter from the alphabet barely hurts communication, but removing 8 letters makes text unreadable.

Implications

For Training Efficiency

For Effective Parameters

For the Crystallization Boundary

Individual weight matrices CAN be crystallized (Paper 63), but the model needs most of its matrices trainable. The crystallization boundary is: - Per-matrix: any single matrix can be frozen (proven) - Collective: must train >= 95% of matrices (proven) - Sweet spot: freeze 3-4 c_proj layers (verified: L1, L4, L6, L7 = best candidates)


“Each brick can be removed and the wall stands. Remove half the bricks and the wall falls.”