Reproducing DeepSeek's mHC: When Hyper-Connections Cause Explosions

TL;DR

A small-scale reproduction of DeepSeek’s manifold-constrained Hyper-Connections (mHC) shows unconstrained mixing matrices can amplify signals exponentially, producing unstable training at scale. Constraining those matrices to be doubly stochastic via Sinkhorn stabilizes signal magnitude but can impose a performance cost at small model sizes.

What happened

The author reproduced DeepSeek’s Hyper-Connections (HC) idea on a GPT-2–style model of roughly 10M parameters trained on TinyShakespeare. HC expands the single residual stream into n parallel streams mixed by three learned matrices (H_res, H_pre, H_post). Unconstrained mixing matrices can amplify signals; in the reproduction HC showed early amplification (peaking near 9.2× at certain depths) and dramatic seed-dependent variance. DeepSeek reported much larger amplification at scale (peaks around 3,000× at 27B parameters). To prevent explosion, DeepSeek projects the learned matrices onto the manifold of doubly stochastic matrices using the Sinkhorn–Knopp algorithm (iterated row/column normalization after exponentiation). In the 10M experiments, the manifold-constrained variant (mHC) held Amax at 1.0 across runs while HC achieved lower mean validation loss but with much higher variance and amplification. The author also swept depth (6–24 layers) and observed nonmonotonic effects: depth 20 yielded best validation loss, while depth 24 regressed due to a width bottleneck.

Why it matters

Unconstrained mixing matrices compound small amplifications into catastrophic signal explosion at large parameter scale, creating a scaling risk for more expressive residual designs.
A principled constraint (doubly stochastic projection) enforces signal conservation and eliminates large, seed-dependent variance across runs.
There is a trade-off between expressivity and robustness: HC can give better small-scale loss but is fragile; mHC adds stability that appears necessary at high parameter counts.
Residual connections act like a conservation law in deep nets—preserving or bounding signal magnitude matters for both training stability and reproducibility.

Key facts

Reproduction used a GPT-2 architecture of ~10M parameters on TinyShakespeare (~1M characters), trained 5,000 steps with AdamW (β1=0.9, β2=0.95), weight decay 0.1 and cosine LR decay on Apple M-series (MPS).
HC mixes streams with three learned matrices: H_res (residual path), H_pre (pre-layer mix), H_post (post-layer distribution).
Unconstrained HC showed amplification up to ~9.2× in the 10M reproduction; DeepSeek reported Amax peaks around 3,000× at 27B parameters.
mHC uses a doubly stochastic constraint on mixing matrices enforced by the Sinkhorn–Knopp procedure (exponentiate to positive, alternate row/column normalization) so the mixing acts as weighted averages and cannot amplify.
Only H_res required the full Sinkhorn doubly stochastic treatment; H_pre and H_post were bounded via a sigmoid in the reproduction.
Across three seeds at depth 24: HC validation loss 0.884 ± 0.033 with max Amax 6.77 ± 0.60; mHC validation loss 1.116 ± 0.012 with max Amax exactly 1.00 ± 0.00.
Depth sweep (6–24 layers) kept parameters ~11M: validation loss improved up to depth 20 (0.85) then regressed at depth 24 (0.93) due to a width bottleneck (dim shrunk to 192).
Seed values used for variation tests were 42, 123, and 456.
Sinkhorn projections are differentiable; gradients backpropagate through the normalization iterations (the author used ~20 iterations as sufficient).

What to watch next

Part 2 of the series will scale experiments toward 1B parameters on A100s using the C4 dataset to probe where instability emerges.
Author indicates code will be released alongside Part 2.
Whether mHC’s stability-preserving constraint retains competitive performance on large, real-world datasets and downstream tasks is not confirmed in the source.

Quick glossary

Residual connection: A shortcut that adds a layer’s input to its output (x + F(x)), helping gradients flow and enabling very deep networks.
Hyper-Connection (HC): A residual design that routes information across multiple parallel streams using learned mixing matrices instead of a single identity shortcut.
Doubly stochastic matrix: A nonnegative square matrix whose rows and columns each sum to one; it performs weighted averaging and cannot amplify vector norms.
Sinkhorn–Knopp algorithm: An iterative procedure that alternately normalizes rows and columns (after making entries positive) to project a matrix onto the doubly stochastic manifold.
Amax: A measure used in the experiments: the maximum of absolute row and column sums of a mixing matrix, indicating how much it can amplify signals.

Reader FAQ

What is mHC?
mHC is Hyper-Connections with a manifold constraint: mixing matrices are projected to be doubly stochastic (via Sinkhorn) so they route and blend streams without amplifying signals.

Does mHC hurt performance?
At the reproduced 10M parameter scale, mHC had higher mean validation loss than unconstrained HC but produced far lower variance and eliminated amplification; the effect at large scale is the subject of Part 2.

Why apply Sinkhorn only to H_res?
The source notes H_res compounds across layers and thus needs strict doubly stochastic projection; input/output mixers (H_pre, H_post) are simply bounded via sigmoid in the reproduction.

Is code available to reproduce these results?
The source states code will be released with Part 2.

DeepSeek's mHC: When Residual Connections Explode January 11, 2026 Every transformer you’ve ever used has the same residual connection design from 2016. GPT-5, Claude, Llama, Gemini. Under the hood, they…

Reproducing DeepSeek’s mHC: When Hyper-Connections Cause Explosions

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

Fabrice Bellard’s ts_zip: LLM-based text compressor promising high compression ratios

Fabrice Bellard’s ts_zip: LLM-backed text compression using RWKV 169M v4

Adobe Firefly Adds OpenAI GPT-Image 1.5, Offers Temporary Unlimited Images

Leave a Reply Cancel reply

You missed

Why Amazon Bought Bee: Bringing a Wearable AI Companion Beyond Alexa

Superhuman AI exfiltrates emails — PromptArmor details zero-click data leak

Fabrice Bellard’s ts_zip: LLM-based text compressor promising high compression ratios

Danish developer converts floppy drive into a simple TV remote for child