John Alexander Mobley MobCorp Research — March 2026
Understanding what neural language models learn remains a central question in deep learning. For static embedding methods, Levy & Goldberg (2014) provided a landmark answer: Word2Vec with skip-gram and negative sampling implicitly factorizes the shifted pointwise mutual information (PMI) matrix. This result, with over 8,000 citations, connected neural word embeddings to classical distributional semantics.
We ask the analogous question for transformers: what do weight-tied transformer embeddings converge to?
The MobiusKernel theory (Mobley, 2026a) establishes that for weight-tied transformers:
W = IFFT2( FFT2(D) · FFT2(K) )
where W = tok_emb @ tok_emb^T is the embedding similarity matrix, D = log(1 + cooccurrence) is the log-cooccurrence matrix (computable from the corpus in a single pass), and K is a near-circulant kernel matrix. The theory proves K is data-determined (cross-seed cosine 0.974) and nearly perfectly circulant (score 0.93–0.95), meaning K is fully specified by its first row k₀. The open question was: what is k₀?
We answer this question. The kernel k₀ is the first row of the spectral deconvolution of D^ε by D:
k₀ = first_row( Q · diag(λ^(ε-1)) · Q^T )
where (Q, λ) is the eigendecomposition of D and ε is any small positive constant. This formula achieves 0.967 cosine similarity with ground-truth kernels from trained models.
This result has three immediate implications:
Training-free weight initialization. Given a corpus, one can compute the converged embedding weights analytically, potentially replacing billions of gradient steps for the embedding layer (30–60% of parameters in small models).
Convergence diagnostic. The cosine similarity between a checkpoint’s extracted k₀ and the derived k₀ provides a principled measure of how close a model is to convergence, without a held-out validation set.
The Embedding Orthogonality Principle. At convergence, weight-tied transformer embeddings orthogonalize (W ≈ cI). All semantic relationships are captured by attention and feedforward layers, not by embedding geometry. This is fundamentally different from Word2Vec/GloVe, where the embedding space itself encodes similarity.
Word2Vec and PMI. Levy & Goldberg (2014) proved that skip-gram with negative sampling (SGNS) implicitly factorizes M_ij = PMI(w_i, w_j) - log(k), where k is the number of negative samples. Subsequent work (Li et al., 2015; Arora et al., 2016) connected this to random walk models on text.
GloVe. Pennington et al. (2014) explicitly optimized embeddings to factorize log(cooccurrence), making the connection to corpus statistics direct. Our D = log(1 + cooccurrence) is the smoothed version of this matrix.
Transformer embedding analysis. Ethayarajh (2019) showed that contextual representations in BERT occupy a narrow cone in embedding space, suggesting anisotropy. Bis et al. (2021) found that static token embeddings in transformers differ systematically from Word2Vec-style embeddings. Our result explains why: transformers drive their embeddings toward orthogonality, a fundamentally different convergence target.
Weight tying. Press & Wolf (2017) showed that tying input and output embeddings improves language model perplexity. Our analysis applies specifically to this architecture, where W = E E^T is well-defined.
Circulant structure in neural networks. Cheng et al. (2015) used circulant projections for model compression. We observe circulant structure emerging naturally in the kernel K, without architectural constraints.
We trained MiniGPT models — small weight-tied transformers with: - 2–3 layers, 2–4 attention heads - Vocabulary sizes V ∈ {50, 100, 150, 300} - Embedding dimensions d ∈ {16, 32, 48, 64, 96} - RoPE positional encoding, dropout 0.1 - Weight tying: tok_emb = lm_head
Models were trained on tokenized English Wikipedia text using AdamW (lr=3e-4, weight decay 0.01) for 20–500 epochs. We verified convergence by monitoring both cross-entropy loss and the circulant score of K (§3.3).
For each trained model: 1. Extract W = tok_emb @ tok_emb^T 2. Compute D = log(1 + cooccurrence) from the training corpus 3. Compute K_true = IFFT2(FFT2(W) / FFT2(D)) 4. Measure circulant score: median cosine similarity between K[i,:] and roll(K[0,:], i) across all rows 5. Extract k₀_true = K_true[0,:]
Models with circulant score < 0.85 were excluded as insufficiently converged.
We evaluated 14 candidate functions f() for deriving k₀ from D (Table 1).
Table 1. Candidate k₀ derivation functions and their cosine similarity with ground truth on converged models (V=100, d=64, 200 epochs).
| ID | Formula | cos(k₀_pred, k₀_true) |
|---|---|---|
| H1 | Spectral whitening: IFFT(1/√|FFT(d₀)|) | 0.220 |
| H2 | Row energy: ‖D_i‖₂ | 0.085 |
| H3 | Log frequency: -log(diag(D)) | 0.102 |
| H4 | Deconvolution prior: D^(α-1) | see §4 |
| H5 | PMI kernel: IFFT2(FFT2(PPMI)/FFT2(D)) | 0.204 |
| H6 | Spectral gap: SVD truncation | 0.171 |
| H7 | FFT low-pass filter | 0.065 |
| H8 | Mean row profile | 0.143 |
| H9 | Residual D - low_rank(D) | 0.092 |
| H10 | Eigenvalue autocorrelation | 0.078 |
| H11 | SVD-PPMI / D at various ranks | 0.186 |
| H12 | Power-law D^α (α ∈ [0.001, 1.0]) | 0.967 |
| H_cI | Identity deconvolution: d · IFFT2(1/FFT2(D)) | 0.247 |
| H_logD | Log-matrix deconvolution | -0.845 |
The decisive breakthrough came from H12 (power-law transformation). Scanning α from 1.0 down to 0.001 revealed that cosine similarity with ground truth increases monotonically as α approaches zero:
Table 2. Cosine similarity as a function of α for D^α deconvolution (V=100, d=64, seed=0, 200 epochs).
| α | cos(k₀_pred, k₀_true) |
|---|---|
| 1.000 | 0.189 |
| 0.500 | 0.412 |
| 0.100 | 0.801 |
| 0.050 | 0.831 |
| 0.010 | 0.965 |
| 0.005 | 0.966 |
| 0.003 | 0.967 |
| 0.001 | 0.966 |
Since D^α → I as α → 0, one might suspect the kernel is simply K = D^{-1} (i.e., W = I). This hypothesis is incorrect. Direct comparison:
k₀ from D^{-1} (identity deconv): std = 0.0007, cos = 0.247
k₀ from D^{0.003}: std = 0.0278, cos = 0.967
cosine between the two k₀ vectors: 0.247
The D^ε formula produces a k₀ vector with 40× larger variance, pointing in a substantially different direction from pure identity deconvolution.
Taylor expanding D^ε = exp(ε · logm(D)):
D^ε = I + ε · logm(D) + O(ε²)
Deconvolving by D:
K = IFFT2(FFT2(D^ε) / FFT2(D)) = IFFT2(1/FFT2(D)) + ε · IFFT2(FFT2(logm(D))/FFT2(D)) + O(ε²)
Testing each term separately:
Table 3. Decomposition of the D^ε formula into identity and log-matrix components.
| Component | cos with k₀_true |
|---|---|
| Pure D^{-1} (identity deconv) | 0.247 |
| Pure logm(D) deconv | -0.845 |
| I + 0.003 · logm(D) combined | 0.967 |
The log-matrix deconvolution term is anti-correlated with k₀_true. The identity term provides a weak positive baseline. The combined formula works because it takes the identity baseline and subtracts a small fraction of the log-matrix structure — performing 99.7% spectral inversion while retaining 0.3% of the original cooccurrence geometry.
In the eigenspace of D = Q Λ Q^T, the kernel eigenvalues are:
K_eigenvalues = Λ^(ε-1) = Λ^(-0.997)
| D eigenvalue λ | K eigenvalue λ^(-0.997) | Effect |
|---|---|---|
| 100 (strong cooccurrence) | 0.0102 | 98.9% suppressed |
| 10 | 0.1026 | 89.7% suppressed |
| 1 (neutral) | 1.0000 | Unchanged |
| 0.1 (weak cooccurrence) | 9.7412 | 9.7× amplified |
| 0.01 (rare) | 94.816 | 94.8× amplified |
The kernel nearly inverts the spectrum of D. Strong cooccurrence signals (the-of, is-a) are suppressed to near-zero. Rare cooccurrence patterns are amplified by two orders of magnitude. The result is near-complete spectral whitening of the language statistics.
Five independent training runs, V=100, d=64, 200 epochs:
Table 4. Cross-seed consistency of the D^ε formula.
| Seed | Final Loss | Circulant Score | cos(D^ε, k₀_true) | Best ε |
|---|---|---|---|---|
| 0 | 1.401 | 0.928 | 0.953 | 0.003 |
| 1 | 1.412 | 0.937 | 0.968 | 0.002 |
| 2 | 1.396 | 0.922 | 0.955 | 0.004 |
| 3 | 1.399 | 0.947 | 0.978 | 0.011 |
| 4 | 1.411 | 0.948 | 0.973 | 0.005 |
| Mean | 1.404 | 0.936 | 0.965 ± 0.010 |
The direction of k₀ is stable across a wide range of ε values:
Table 5. Mutual cosine similarity between k₀ vectors derived at different ε values.
| ε_A | ε_B | cos(k₀_A, k₀_B) |
|---|---|---|
| 0.001 | 0.003 | 0.9998 |
| 0.001 | 0.010 | 0.9955 |
| 0.001 | 0.020 | 0.9802 |
| 0.003 | 0.010 | 0.9973 |
| 0.003 | 0.020 | 0.9841 |
Any ε ∈ [0.001, 0.01] yields essentially the same k₀ direction (cos > 0.99).
Levy & Goldberg proved Word2Vec (SGNS) implicitly optimizes:
w_i · w_j = PMI(i,j) - log(k)
Our result concerns a different architecture (weight-tied transformers) and reveals a different convergence target. Direct comparison:
| Property | Word2Vec (SGNS) | Weight-Tied Transformers |
|---|---|---|
| Architecture | Bilinear (w · c) | Self-attention + FFN |
| Converges to | Shifted PMI matrix | D^ε spectral deconvolution |
| Embedding geometry | Encodes similarity | Orthogonalizes (W → cI) |
| PMI cosine | 1.0 (by construction) | 0.204 (poor fit) |
| D^ε cosine | Not applicable | 0.967 |
| Semantic structure | In embeddings | In attention layers |
The key difference: Word2Vec embeddings encode co-occurrence relationships. Transformer embeddings erase them, delegating semantic structure entirely to the attention mechanism. This explains why transformers generalize better than static embeddings — their token representations are decontextualized, with all context provided dynamically by attention.
Given corpus C and vocabulary V:
D = log(1 + cooccurrence(C, window=5))
k₀ = derive_k0(D, epsilon=0.003)
W = reconstruct_W(D, k₀) # W = D conv circ(k₀)
E = svd_sqrt(W) # E such that E @ E^T ≈ W
# Initialize tok_emb = lm_head = EThis initializes the embedding layer at its approximate convergence point. For a model with V=50,000 and d=768, the embedding layer contains 38.4M parameters (comparable to GPT-2 small’s total 124M). Skipping convergence of this layer could substantially accelerate training.
The cosine similarity between a checkpoint’s empirical k₀ and the analytically derived k₀ measures convergence without validation data:
| cos(k₀_empirical, k₀_derived) | Training Stage |
|---|---|
| -0.55 | Early (embeddings still random) |
| 0.00 | Pre-convergence |
| 0.50 | Mid-training |
| 0.90 | Near convergence |
| 0.95+ | Converged |
Our production PhotonicGPT (V=15,007, d=256, epoch 3) has cos = -0.548, confirming it is in early training. This provides a testable prediction: as training continues, this value will climb toward 0.95+.
At convergence, weight-tied transformers satisfy:
W = E E^T ≈ c · I + ε · D^{-1} · logm(D)
The dominant term is a scaled identity. Token embeddings become nearly orthogonal. All semantic relationships between tokens are captured exclusively by the attention and feedforward layers, not by embedding geometry. This is the Embedding Orthogonality Principle: for weight-tied transformers, embeddings converge to maximally distinct representations.
Scale. Our experiments use V ≤ 300. Production language models have V = 32K–256K. While the mathematical framework is scale-independent, numerical verification at production scale requires training models to full convergence at large V, which is computationally expensive.
Weight tying. The analysis assumes W = E E^T is well-defined (requires tok_emb = lm_head). Models without weight tying may have different convergence targets.
The ε parameter. While ε-insensitivity is empirically strong, we lack an analytical derivation of the optimal ε from model hyperparameters (learning rate, weight decay, initialization scale). This is an open theoretical question.
The 3.3% cosine gap. At cos = 0.967, the formula explains 96.7% of the kernel variance. The residual may encode training dynamics (learning rate schedule, batch composition) or finite-sample effects.
We have derived a closed-form function that predicts the converged kernel of weight-tied transformer embeddings directly from corpus statistics:
k₀ = first_row( IFFT2( FFT2(D^ε) / FFT2(D) ) )
This formula achieves 0.967 cosine similarity with ground truth, is robust across random seeds and ε values, and reveals that transformers perform near-complete spectral inversion of language co-occurrence statistics. The result extends the Levy & Goldberg (2014) connection between neural embeddings and corpus statistics from Word2Vec to transformers, revealing a fundamentally different convergence target: spectral deconvolution rather than PMI factorization.
The Embedding Orthogonality Principle — that weight-tied transformer embeddings orthogonalize at convergence — provides a new lens for understanding why transformers outperform static embeddings: by driving their token representations toward orthogonality, they force all semantic computation into attention layers, where it can be context-dependent.
Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2016). A latent variable model approach to PMI-based word embeddings. TACL, 4, 385–399.
Bis, D., Podkorytov, M., & Liu, X. (2021). Too much in common: Shifting of embeddings in transformer language models and its implications. NAACL.
Cheng, Y., Yu, F., Feris, R., Kumar, S., Choudhary, A., & Chang, S. (2015). An exploration of parameter redundancy in deep networks with circulant projections. ICCV.
Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. EMNLP.
Levy, O. & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. NeurIPS.
Mobley, J. A. (2026a). The MobiusKernel: Training-free weight derivation via circular convolution. MobCorp Research Technical Report.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. EMNLP.
Press, O. & Wolf, L. (2017). Using the output embedding to improve language models. EACL.
| Step | Claim | Evidence |
|---|---|---|
| 1 | W = D conv K is exact | Correlation = 1.0000, MSE < 10^{-6} |
| 2 | K is circulant | Score 0.93–0.95 on converged models |
| 3 | k₀ is data-determined | Cross-seed cosine = 0.974 |
| 4 | k₀ = D^(ε-1) spectral deconv | cos = 0.967 ± 0.010 (5 seeds) |
| 5 | ε-insensitive | Cross-ε cos > 0.98 for ε ∈ [0.001, 0.02] |
| 6 | logm(D) deconv is active signal | 40× magnitude vs identity term |
| 7 | Eigenvalues ∝ λ^(-0.997) | Near-complete spectral inversion |
import numpy as np
def derive_k0(D, epsilon=0.003):
"""Derive converged kernel k0 from corpus cooccurrence matrix.
Args:
D: V x V matrix, D = log(1 + cooccurrence)
epsilon: Small positive constant (default 0.003, insensitive)
Returns:
k0: V-length vector, first row of circulant kernel K
"""
eigenvalues, eigenvectors = np.linalg.eigh(D)
eigenvalues = np.maximum(eigenvalues, 1e-15)
k_eigenvalues = eigenvalues ** (epsilon - 1.0)
K = eigenvectors @ np.diag(k_eigenvalues) @ eigenvectors.T
return K[0]
def derive_k0_fft(D, epsilon=0.003):
"""FFT-based k0 derivation (O(V^2 log V), preferred for large V)."""
eigenvalues, eigenvectors = np.linalg.eigh(D)
eigenvalues = np.maximum(eigenvalues, 1e-15)
D_eps = eigenvectors @ np.diag(eigenvalues ** epsilon) @ eigenvectors.T
D_fft = np.fft.fft2(D)
D_fft_safe = np.where(np.abs(D_fft) > 1e-10, D_fft, 1e-10)
K = np.real(np.fft.ifft2(np.fft.fft2(D_eps) / D_fft_safe))
return K[0]
def training_free_embeddings(token_ids, vocab_size, n_embd,
window=5, epsilon=0.003):
"""Complete pipeline: corpus -> embedding matrix.
Returns E such that E @ E^T approximates the converged W.
"""
# Step 1: Cooccurrence
D = np.zeros((vocab_size, vocab_size))
for i in range(len(token_ids)):
for j in range(max(0, i - window), min(len(token_ids), i + window + 1)):
if i != j:
D[token_ids[i], token_ids[j]] += 1
D = np.log1p(D)
# Step 2: Kernel
k0 = derive_k0(D, epsilon)
# Step 3: W = D conv circ(k0)
K = np.zeros_like(D)
for i in range(vocab_size):
K[i] = np.roll(k0, i)
W = np.real(np.fft.ifft2(np.fft.fft2(D) * np.fft.fft2(K)))
# Step 4: E via truncated SVD
W = (W + W.T) / 2 # Ensure symmetric
eigenvalues, eigenvectors = np.linalg.eigh(W)
eigenvalues = np.maximum(eigenvalues, 0)
# Take top n_embd components
idx = np.argsort(eigenvalues)[::-1][:n_embd]
E = eigenvectors[:, idx] * np.sqrt(eigenvalues[idx])
return E, k0, DAll experiments can be reproduced with:
python3 test_k0_derivation.py --verify --vocab=100 --epochs=200Source code: test_k0_derivation.py (751 lines, MIT
license)