Spectral Deconvolution Predicts Converged Transformer Embeddings

Understanding what neural language models learn remains a central question in deep learning. For static embedding methods, Levy & Goldberg (2014) provided a landmark answer: Word2Vec with skip-gram and negative sampling implicitly factorizes the shifted pointwise mutual information (PMI) matrix. This result, with over 8,000 citations, connected neural word embeddings to classical distributional semantics.

We ask the analogous question for transformers: what do weight-tied transformer embeddings converge to?

The MobiusKernel theory (Mobley, 2026a) establishes that for weight-tied transformers:

where W = tok_emb @ tok_emb^T is the embedding similarity matrix, D = log(1 + cooccurrence) is the log-cooccurrence matrix (computable from the corpus in a single pass), and K is a near-circulant kernel matrix. The theory proves K is data-determined (cross-seed cosine 0.974) and nearly perfectly circulant (score 0.93–0.95), meaning K is fully specified by its first row k₀. The open question was: what is k₀?

We answer this question. The kernel k₀ is the first row of the spectral deconvolution of D^ε by D:

where (Q, λ) is the eigendecomposition of D and ε is any small positive constant. This formula achieves 0.967 cosine similarity with ground-truth kernels from trained models.

1.1 Significance

2. Background and Related Work

Word2Vec and PMI. Levy & Goldberg (2014) proved that skip-gram with negative sampling (SGNS) implicitly factorizes M_ij = PMI(w_i, w_j) - log(k), where k is the number of negative samples. Subsequent work (Li et al., 2015; Arora et al., 2016) connected this to random walk models on text.

GloVe. Pennington et al. (2014) explicitly optimized embeddings to factorize log(cooccurrence), making the connection to corpus statistics direct. Our D = log(1 + cooccurrence) is the smoothed version of this matrix.

Transformer embedding analysis. Ethayarajh (2019) showed that contextual representations in BERT occupy a narrow cone in embedding space, suggesting anisotropy. Bis et al. (2021) found that static token embeddings in transformers differ systematically from Word2Vec-style embeddings. Our result explains why: transformers drive their embeddings toward orthogonality, a fundamentally different convergence target.

Weight tying. Press & Wolf (2017) showed that tying input and output embeddings improves language model perplexity. Our analysis applies specifically to this architecture, where W = E E^T is well-defined.

Circulant structure in neural networks. Cheng et al. (2015) used circulant projections for model compression. We observe circulant structure emerging naturally in the kernel K, without architectural constraints.

3. Experimental Setup

3.1 Model Architecture

We trained MiniGPT models — small weight-tied transformers with: - 2–3 layers, 2–4 attention heads - Vocabulary sizes V ∈ {50, 100, 150, 300} - Embedding dimensions d ∈ {16, 32, 48, 64, 96} - RoPE positional encoding, dropout 0.1 - Weight tying: tok_emb = lm_head

3.2 Training Protocol

Models were trained on tokenized English Wikipedia text using AdamW (lr=3e-4, weight decay 0.01) for 20–500 epochs. We verified convergence by monitoring both cross-entropy loss and the circulant score of K (§3.3).

3.3 Kernel Extraction and Validation

For each trained model: 1. Extract W = tok_emb @ tok_emb^T 2. Compute D = log(1 + cooccurrence) from the training corpus 3. Compute K_true = IFFT2(FFT2(W) / FFT2(D)) 4. Measure circulant score: median cosine similarity between K[i,:] and roll(K[0,:], i) across all rows 5. Extract k₀_true = K_true[0,:]

3.4 Candidate Functions

Table 1. Candidate k₀ derivation functions and their cosine similarity with ground truth on converged models (V=100, d=64, 200 epochs).

4. The Discovery: Spectral Deconvolution

4.1 D^α with α → 0

ID	Formula	cos(k₀_pred, k₀_true)
H1	Spectral whitening: IFFT(1/√\|FFT(d₀)\|)	0.220
H2	Row energy: ‖D_i‖₂	0.085
H3	Log frequency: -log(diag(D))	0.102
H4	Deconvolution prior: D^(α-1)	see §4
H5	PMI kernel: IFFT2(FFT2(PPMI)/FFT2(D))	0.204
H6	Spectral gap: SVD truncation	0.171
H7	FFT low-pass filter	0.065
H8	Mean row profile	0.143
H9	Residual D - low_rank(D)	0.092
H10	Eigenvalue autocorrelation	0.078
H11	SVD-PPMI / D at various ranks	0.186
H12	Power-law D^α (α ∈ [0.001, 1.0])	0.967
H_cI	Identity deconvolution: d · IFFT2(1/FFT2(D))	0.247
H_logD	Log-matrix deconvolution	-0.845

The decisive breakthrough came from H12 (power-law transformation). Scanning α from 1.0 down to 0.001 revealed that cosine similarity with ground truth increases monotonically as α approaches zero:

Table 2. Cosine similarity as a function of α for D^α deconvolution (V=100, d=64, seed=0, 200 epochs).

4.2 Disambiguating from the Identity

Since D^α → I as α → 0, one might suspect the kernel is simply K = D^{-1} (i.e., W = I). This hypothesis is incorrect. Direct comparison:

The D^ε formula produces a k₀ vector with 40× larger variance, pointing in a substantially different direction from pure identity deconvolution.

4.3 Taylor Expansion Reveals the Active Signal

α	cos(k₀_pred, k₀_true)
1.000	0.189
0.500	0.412
0.100	0.801
0.050	0.831
0.010	0.965
0.005	0.966
0.003	0.967
0.001	0.966

Table 3. Decomposition of the D^ε formula into identity and log-matrix components.

Component	cos with k₀_true
Pure D^{-1} (identity deconv)	0.247
Pure logm(D) deconv	-0.845
I + 0.003 · logm(D) combined	0.967

The log-matrix deconvolution term is anti-correlated with k₀_true. The identity term provides a weak positive baseline. The combined formula works because it takes the identity baseline and subtracts a small fraction of the log-matrix structure — performing 99.7% spectral inversion while retaining 0.3% of the original cooccurrence geometry.

4.4 The Eigenvalue Interpretation

D eigenvalue λ	K eigenvalue λ^(-0.997)	Effect
100 (strong cooccurrence)	0.0102	98.9% suppressed
10	0.1026	89.7% suppressed
1 (neutral)	1.0000	Unchanged
0.1 (weak cooccurrence)	9.7412	9.7× amplified
0.01 (rare)	94.816	94.8× amplified

The kernel nearly inverts the spectrum of D. Strong cooccurrence signals (the-of, is-a) are suppressed to near-zero. Rare cooccurrence patterns are amplified by two orders of magnitude. The result is near-complete spectral whitening of the language statistics.

5. Robustness

5.1 Stability Across Random Seeds

5.2 ε-Insensitivity

Seed	Final Loss	Circulant Score	cos(D^ε, k₀_true)	Best ε
0	1.401	0.928	0.953	0.003
1	1.412	0.937	0.968	0.002
2	1.396	0.922	0.955	0.004
3	1.399	0.947	0.978	0.011
4	1.411	0.948	0.973	0.005
Mean	1.404	0.936	0.965 ± 0.010

Table 5. Mutual cosine similarity between k₀ vectors derived at different ε values.

Any ε ∈ [0.001, 0.01] yields essentially the same k₀ direction (cos > 0.99).

6. Relationship to Levy & Goldberg (2014)

Our result concerns a different architecture (weight-tied transformers) and reveals a different convergence target. Direct comparison:

ε_A	ε_B	cos(k₀_A, k₀_B)
0.001	0.003	0.9998
0.001	0.010	0.9955
0.001	0.020	0.9802
0.003	0.010	0.9973
0.003	0.020	0.9841

Property	Word2Vec (SGNS)	Weight-Tied Transformers
Architecture	Bilinear (w · c)	Self-attention + FFN
Converges to	Shifted PMI matrix	D^ε spectral deconvolution
Embedding geometry	Encodes similarity	Orthogonalizes (W → cI)
PMI cosine	1.0 (by construction)	0.204 (poor fit)
D^ε cosine	Not applicable	0.967
Semantic structure	In embeddings	In attention layers

The key difference: Word2Vec embeddings encode co-occurrence relationships. Transformer embeddings erase them, delegating semantic structure entirely to the attention mechanism. This explains why transformers generalize better than static embeddings — their token representations are decontextualized, with all context provided dynamically by attention.

7. Implications and Applications

7.1 Training-Free Embedding Initialization

D = log(1 + cooccurrence(C, window=5))
k₀ = derive_k0(D, epsilon=0.003)
W = reconstruct_W(D, k₀)  # W = D conv circ(k₀)
E = svd_sqrt(W)            # E such that E @ E^T ≈ W
# Initialize tok_emb = lm_head = E

This initializes the embedding layer at its approximate convergence point. For a model with V=50,000 and d=768, the embedding layer contains 38.4M parameters (comparable to GPT-2 small’s total 124M). Skipping convergence of this layer could substantially accelerate training.

7.2 Convergence Diagnostic

The cosine similarity between a checkpoint’s empirical k₀ and the analytically derived k₀ measures convergence without validation data:

cos(k₀_empirical, k₀_derived)	Training Stage
-0.55	Early (embeddings still random)
0.00	Pre-convergence
0.50	Mid-training
0.90	Near convergence
0.95+	Converged

Our production PhotonicGPT (V=15,007, d=256, epoch 3) has cos = -0.548, confirming it is in early training. This provides a testable prediction: as training continues, this value will climb toward 0.95+.

7.3 The Embedding Orthogonality Principle

The dominant term is a scaled identity. Token embeddings become nearly orthogonal. All semantic relationships between tokens are captured exclusively by the attention and feedforward layers, not by embedding geometry. This is the Embedding Orthogonality Principle: for weight-tied transformers, embeddings converge to maximally distinct representations.

8. Limitations

9. Conclusion

We have derived a closed-form function that predicts the converged kernel of weight-tied transformer embeddings directly from corpus statistics:

This formula achieves 0.967 cosine similarity with ground truth, is robust across random seeds and ε values, and reveals that transformers perform near-complete spectral inversion of language co-occurrence statistics. The result extends the Levy & Goldberg (2014) connection between neural embeddings and corpus statistics from Word2Vec to transformers, revealing a fundamentally different convergence target: spectral deconvolution rather than PMI factorization.

The Embedding Orthogonality Principle — that weight-tied transformer embeddings orthogonalize at convergence — provides a new lens for understanding why transformers outperform static embeddings: by driving their token representations toward orthogonality, they force all semantic computation into attention layers, where it can be context-dependent.

References

Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2016). A latent variable model approach to PMI-based word embeddings. TACL, 4, 385–399.

Bis, D., Podkorytov, M., & Liu, X. (2021). Too much in common: Shifting of embeddings in transformer language models and its implications. NAACL.

Cheng, Y., Yu, F., Feris, R., Kumar, S., Choudhary, A., & Chang, S. (2015). An exploration of parameter redundancy in deep networks with circulant projections. ICCV.

Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. EMNLP.

Levy, O. & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. NeurIPS.

Mobley, J. A. (2026a). The MobiusKernel: Training-free weight derivation via circular convolution. MobCorp Research Technical Report.

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. EMNLP.

Press, O. & Wolf, L. (2017). Using the output embedding to improve language models. EACL.

Appendix A: Complete Evidence Chain

Appendix B: Implementation

Step	Claim	Evidence
1	W = D conv K is exact	Correlation = 1.0000, MSE < 10^{-6}
2	K is circulant	Score 0.93–0.95 on converged models
3	k₀ is data-determined	Cross-seed cosine = 0.974
4	k₀ = D^(ε-1) spectral deconv	cos = 0.967 ± 0.010 (5 seeds)
5	ε-insensitive	Cross-ε cos > 0.98 for ε ∈ [0.001, 0.02]
6	logm(D) deconv is active signal	40× magnitude vs identity term
7	Eigenvalues ∝ λ^(-0.997)	Near-complete spectral inversion

import numpy as np

def derive_k0(D, epsilon=0.003):
    """Derive converged kernel k0 from corpus cooccurrence matrix.

    Args:
        D: V x V matrix, D = log(1 + cooccurrence)
        epsilon: Small positive constant (default 0.003, insensitive)

    Returns:
        k0: V-length vector, first row of circulant kernel K
    """
    eigenvalues, eigenvectors = np.linalg.eigh(D)
    eigenvalues = np.maximum(eigenvalues, 1e-15)
    k_eigenvalues = eigenvalues ** (epsilon - 1.0)
    K = eigenvectors @ np.diag(k_eigenvalues) @ eigenvectors.T
    return K[0]


def derive_k0_fft(D, epsilon=0.003):
    """FFT-based k0 derivation (O(V^2 log V), preferred for large V)."""
    eigenvalues, eigenvectors = np.linalg.eigh(D)
    eigenvalues = np.maximum(eigenvalues, 1e-15)
    D_eps = eigenvectors @ np.diag(eigenvalues ** epsilon) @ eigenvectors.T
    D_fft = np.fft.fft2(D)
    D_fft_safe = np.where(np.abs(D_fft) > 1e-10, D_fft, 1e-10)
    K = np.real(np.fft.ifft2(np.fft.fft2(D_eps) / D_fft_safe))
    return K[0]


def training_free_embeddings(token_ids, vocab_size, n_embd,
                              window=5, epsilon=0.003):
    """Complete pipeline: corpus -> embedding matrix.

    Returns E such that E @ E^T approximates the converged W.
    """
    # Step 1: Cooccurrence
    D = np.zeros((vocab_size, vocab_size))
    for i in range(len(token_ids)):
        for j in range(max(0, i - window), min(len(token_ids), i + window + 1)):
            if i != j:
                D[token_ids[i], token_ids[j]] += 1
    D = np.log1p(D)

    # Step 2: Kernel
    k0 = derive_k0(D, epsilon)

    # Step 3: W = D conv circ(k0)
    K = np.zeros_like(D)
    for i in range(vocab_size):
        K[i] = np.roll(k0, i)
    W = np.real(np.fft.ifft2(np.fft.fft2(D) * np.fft.fft2(K)))

    # Step 4: E via truncated SVD
    W = (W + W.T) / 2  # Ensure symmetric
    eigenvalues, eigenvectors = np.linalg.eigh(W)
    eigenvalues = np.maximum(eigenvalues, 0)
    # Take top n_embd components
    idx = np.argsort(eigenvalues)[::-1][:n_embd]
    E = eigenvectors[:, idx] * np.sqrt(eigenvalues[idx])

    return E, k0, D

1. Introduction