Governance as Advantage

DeepSeek mHC: How a 1967 Technique Is Reinventing Neural Networks

TV
Thiago Victorino
12 min read

Every language model we use today — GPT, Claude, Gemini, Llama — depends on a technique introduced in 2015: residual connections. When Microsoft Research published the ResNet paper, it solved the vanishing gradient problem, enabling networks with hundreds of layers. Since then, the technique has become invisible through sheer ubiquity.

In December 2025, DeepSeek published a paper that may represent the next evolution of this fundamental mechanism. The most fascinating part: the solution comes from a 1967 mathematical theorem.

The Problem: Parallel Residual Connections Explode

In 2024, ByteDance researchers published the concept of Hyper-Connections (HC), expanding the single residual flow of traditional networks into multiple parallel flows. The idea was to allow data to flow through trainable paths between layers, accelerating convergence by up to 1.8 times.

The problem emerged at scale. When tested on 27-billion parameter models, the signal amplified by over 3,000 times as it traversed the network. Training diverged catastrophically.

The concept worked, but without mathematical constraints, it was unusable at production scale.

The Solution: Doubly Stochastic Matrices

DeepSeek responded with mHC — Manifold-Constrained Hyper-Connections. The core idea is projecting the connection matrices between layers onto a mathematical space called the Birkhoff Polytope: the set of all doubly stochastic matrices.

A doubly stochastic matrix has three properties:

  1. Row sums = 1: the total signal leaving each layer is conserved
  2. Column sums = 1: the total signal received by each layer is conserved
  3. Non-negative values: eliminates destructive signal cancellations

With these constraints, signal magnitude can neither grow nor shrink as information traverses the network — regardless of depth.

To enforce this constraint during training, DeepSeek used the Sinkhorn-Knopp algorithm, published in 1967 by Richard Sinkhorn and Paul Knopp in the Pacific Journal of Mathematics. The algorithm alternates row and column normalization iteratively until converging to a doubly stochastic matrix. Since the connection matrices are small (typically 4×4), computational overhead is minimal.

Concrete Results

On a 27-billion parameter model based on the DeepSeek-V3 architecture, mHC demonstrated consistent gains over both the baseline (standard residual connections) and ByteDance’s HC:

  • BBH (reasoning): baseline 43.8 → HC 48.9 → mHC 51.0
  • DROP (reading): baseline 39.2 → HC 43.1 → mHC 45.7
  • GSM8K (math): baseline 62.4 → HC 68.1 → mHC 71.5

The additional training cost was approximately 6.7%. For models where training costs tens of millions of dollars, this is not negligible. But the gain in reasoning capability without requiring additional data may justify the investment.

AWS’s Response

AWS positioned itself as the first cloud provider to make DeepSeek-R1 available as a managed model on Amazon Bedrock. The strategy reflects three observations shared by Andy Jassy during AWS re:Invent:

Compute cost matters. DeepSeek-R1 offers competitive performance at $2.19 per million tokens — three to four times cheaper than Western alternatives.

Model diversity is essential. When builders have freedom to choose, they use different models for different tasks. AWS Bedrock positions itself as a unified marketplace.

Enterprise security is a differentiator. Open-source models raise privacy and compliance concerns. AWS Bedrock Guardrails offers sensitive information filtering and customizable controls.

The DeepSeek Platform

mHC does not exist in isolation. It is part of an ecosystem of innovations that positioned DeepSeek as a benchmark for efficiency:

  • Mixture of Experts (MoE): 671 billion total parameters, with only 37 billion activated per token
  • Multi-head Latent Attention (MLA): compresses KV cache into lower-dimensional space
  • Multi-Token Prediction (MTP): predicts multiple sequential tokens with 85-90% acceptance rate
  • DualPipe: innovative parallelism that overlaps computation and communication

The full model (DeepSeek-V3) was trained with 2.78 million H800 GPU hours on 14.8 trillion tokens.

Necessary Caveats

Before making decisions based on mHC, consider:

Limited test scale. Published results cover 3B, 9B, and 27B parameter models. Performance at 70B+ has not been publicly demonstrated.

Adoption depends on reproduction. Independent labs need to validate results on their own architectures. Works like “mHC-lite” (arXiv:2601.05732) are already seeking to simplify implementation.

6.7% is not always negligible. For training runs costing $100 million or more, this represents millions of additional dollars.

Geopolitical context. DeepSeek models are open-source but originated in China. Companies in regulated sectors should evaluate compliance when considering direct use via DeepSeek’s API.

The Deeper Lesson

mHC demonstrates a recurring pattern: the most impactful solutions in technology frequently come from rediscovering and recontextualizing existing knowledge.

  • 1967: Sinkhorn and Knopp publish theory on doubly stochastic matrices
  • 2015: He et al. create residual connections (ResNet)
  • 2024: ByteDance expands to parallel residual flows
  • 2025: DeepSeek combines HC + Sinkhorn-Knopp = mHC

Innovation lies in synthesis, not invention. Attention (2017) recontextualized translation alignment mechanisms. Transformers recontextualized self-attention. mHC recontextualizes matrix theory from 1967.

The strategic implication is clear: investing in mathematical foundations and teams with theoretical depth may generate more returns than chasing the latest architecture.

What to Do with This Information

For technology leaders: monitor which 2026 models implement mHC. Reassess build vs. buy decisions — smaller models with mHC may match larger models without it. Use platforms that allow swapping models without rewriting applications.

For technical teams: study the original paper (arXiv:2512.24880). Experiment with DeepSeek-R1-Distill on AWS Bedrock. Value fundamentals in linear algebra and convex optimization — they are increasingly relevant. Track variants like mHC-lite (arXiv:2601.05732).


References:

  • DeepSeek AI. “mHC: Manifold-Constrained Hyper-Connections.” arXiv:2512.24880 (2025)
  • Zhu et al. “Hyper-Connections.” arXiv:2409.19606 — ICLR 2025
  • He et al. “Deep Residual Learning for Image Recognition.” arXiv:1512.03385 (2015)
  • DeepSeek AI. “DeepSeek-V3 Technical Report.” arXiv:2412.19437 (2024)
  • Sinkhorn, R. & Knopp, P. “Concerning Nonnegative Matrices and Doubly Stochastic Matrices.” Pacific Journal of Mathematics, 21(2), 343-348 (1967)
  • AWS. “DeepSeek-R1 models now available on AWS.” aws.amazon.com/blogs/aws/deepseek-r1-models-now-available-on-aws/

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation