TL;DR
A small-scale reproduction of DeepSeek’s manifold-constrained Hyper-Connections (mHC) shows unconstrained mixing matrices can amplify signals exponentially, producing unstable training at scale. Constraining those matrices to be doubly stochastic via Sinkhorn stabilizes signal magnitude but can impose a performance cost at small model sizes.
What happened
The author reproduced DeepSeek’s Hyper-Connections (HC) idea on a GPT-2–style model of roughly 10M parameters trained on TinyShakespeare. HC expands the single residual stream into n parallel streams mixed by three learned matrices (H_res, H_pre, H_post). Unconstrained mixing matrices can amplify signals; in the reproduction HC showed early amplification (peaking near 9.2× at certain depths) and dramatic seed-dependent variance. DeepSeek reported much larger amplification at scale (peaks around 3,000× at 27B parameters). To prevent explosion, DeepSeek projects the learned matrices onto the manifold of doubly stochastic matrices using the Sinkhorn–Knopp algorithm (iterated row/column normalization after exponentiation). In the 10M experiments, the manifold-constrained variant (mHC) held Amax at 1.0 across runs while HC achieved lower mean validation loss but with much higher variance and amplification. The author also swept depth (6–24 layers) and observed nonmonotonic effects: depth 20 yielded best validation loss, while depth 24 regressed due to a width bottleneck.
Why it matters
- Unconstrained mixing matrices compound small amplifications into catastrophic signal explosion at large parameter scale, creating a scaling risk for more expressive residual designs.
- A principled constraint (doubly stochastic projection) enforces signal conservation and eliminates large, seed-dependent variance across runs.
- There is a trade-off between expressivity and robustness: HC can give better small-scale loss but is fragile; mHC adds stability that appears necessary at high parameter counts.
- Residual connections act like a conservation law in deep nets—preserving or bounding signal magnitude matters for both training stability and reproducibility.
Key facts
- Reproduction used a GPT-2 architecture of ~10M parameters on TinyShakespeare (~1M characters), trained 5,000 steps with AdamW (β1=0.9, β2=0.95), weight decay 0.1 and cosine LR decay on Apple M-series (MPS).
- HC mixes streams with three learned matrices: H_res (residual path), H_pre (pre-layer mix), H_post (post-layer distribution).
- Unconstrained HC showed amplification up to ~9.2× in the 10M reproduction; DeepSeek reported Amax peaks around 3,000× at 27B parameters.
- mHC uses a doubly stochastic constraint on mixing matrices enforced by the Sinkhorn–Knopp procedure (exponentiate to positive, alternate row/column normalization) so the mixing acts as weighted averages and cannot amplify.
- Only H_res required the full Sinkhorn doubly stochastic treatment; H_pre and H_post were bounded via a sigmoid in the reproduction.
- Across three seeds at depth 24: HC validation loss 0.884 ± 0.033 with max Amax 6.77 ± 0.60; mHC validation loss 1.116 ± 0.012 with max Amax exactly 1.00 ± 0.00.
- Depth sweep (6–24 layers) kept parameters ~11M: validation loss improved up to depth 20 (0.85) then regressed at depth 24 (0.93) due to a width bottleneck (dim shrunk to 192).
- Seed values used for variation tests were 42, 123, and 456.
- Sinkhorn projections are differentiable; gradients backpropagate through the normalization iterations (the author used ~20 iterations as sufficient).
What to watch next
- Part 2 of the series will scale experiments toward 1B parameters on A100s using the C4 dataset to probe where instability emerges.
- Author indicates code will be released alongside Part 2.
- Whether mHC’s stability-preserving constraint retains competitive performance on large, real-world datasets and downstream tasks is not confirmed in the source.
Quick glossary
- Residual connection: A shortcut that adds a layer’s input to its output (x + F(x)), helping gradients flow and enabling very deep networks.
- Hyper-Connection (HC): A residual design that routes information across multiple parallel streams using learned mixing matrices instead of a single identity shortcut.
- Doubly stochastic matrix: A nonnegative square matrix whose rows and columns each sum to one; it performs weighted averaging and cannot amplify vector norms.
- Sinkhorn–Knopp algorithm: An iterative procedure that alternately normalizes rows and columns (after making entries positive) to project a matrix onto the doubly stochastic manifold.
- Amax: A measure used in the experiments: the maximum of absolute row and column sums of a mixing matrix, indicating how much it can amplify signals.
Reader FAQ
What is mHC?
mHC is Hyper-Connections with a manifold constraint: mixing matrices are projected to be doubly stochastic (via Sinkhorn) so they route and blend streams without amplifying signals.
Does mHC hurt performance?
At the reproduced 10M parameter scale, mHC had higher mean validation loss than unconstrained HC but produced far lower variance and eliminated amplification; the effect at large scale is the subject of Part 2.
Why apply Sinkhorn only to H_res?
The source notes H_res compounds across layers and thus needs strict doubly stochastic projection; input/output mixers (H_pre, H_post) are simply bounded via sigmoid in the reproduction.
Is code available to reproduce these results?
The source states code will be released with Part 2.

DeepSeek's mHC: When Residual Connections Explode January 11, 2026 Every transformer you’ve ever used has the same residual connection design from 2016. GPT-5, Claude, Llama, Gemini. Under the hood, they…
Sources
- Reproducing DeepSeek's MHC: When Residual Connections Explode
- mHC: Manifold-Constrained Hyper-Connections
- DeepSeek's mHC: Manifold-Constrained Hyper-Connections
Related posts
- Anthropic’s API Lockdown for Claude Subscriptions Risks Alienating Developers
- Malaysia and Indonesia block Grok over AI-made sexualized deepfakes; Apple silent
- Google adds buy buttons to Gemini and unveils AI shopping standard