January 1, 2026 The DeepSeek research group published a technical document that can determine the vector of development of neural network architectures for the coming years. In a paper called "mHC: Manifold-Constrained Hyper-Connections" a new architecture is presented that solves the fundamental problem of scaling transformers — instability of gradients with increasing complexity of connections.
Instead of an extensive build-up of computing power, DeepSeek offers an algorithmic solution based on strict mathematical constraints.
From Residual to Hyper-Connections: In search of coherence
Modern large language models (LLM) are based on the architecture of Transformers using Residual Connections. This is a mechanism that allows a signal to pass through network layers, bypassing nonlinear transformations, which facilitates the training of deep networks. However, this structure is linear and restricts the flow of information between remote layers.
DeepSeek researchers have proposed the concept of Hyper-Connections (HC). The idea is to create a dense network of dynamic connections between the layers, where each layer has access to information from many previous levels, and not just the immediate "neighbor". Theoretically, this significantly increases the expressive power of the model.
Problem: Signal Explosion
The introduction of hyperlinks has faced a serious obstacle to learning neural networks. When testing the naive HC implementation, the researchers recorded the phenomenon of exponential growth in signal dispersion.
As it passed through the layers of the network, the amplitude of the signal and gradients increased uncontrollably. In experiments, the gain reached 3000x. This leads to two critical problems.:
-
Learning instability: The weights of the model are updated too aggressively, which leads to loss divergence.
-
Numerical instability: Values exceed the precision of the floating-point format (fp16/bf16).
The model became virtually untrainable.
Solution: Manifold-Constrained Hyper-Connections (mHC)
To curb the "explosion" without giving up the advantages of hyperlinks, the authors applied an approach based on the theory of manifolds (Manifold Theory). This is how the mHC architecture was born.
The key mathematical stabilization tool has become double Stochastic Matrices.
How does it work?
In linear algebra, a double stochastic matrix is a square matrix of non—negative real numbers in which:
-
The sum of the elements in each row is equal to 1 .
-
The sum of the elements in each column is equal to 1 .
DeepSeek applied this property to weight matrices that control hyperlinks. Instead of arbitrary mixing coefficients of the layers, the system forcibly projects weights onto the so-called Birkhoff polytope** is the space of all binary stochastic matrices.
This mathematical constraint guarantees the preservation of the signal norm. Since the sum of the weights is strictly normalized in both dimensions (input and output), the signal energy is not dissipated (vanishing gradients) and is not amplified (exploding gradients). The signal is simply redistributed in the multidimensional feature space.
Main results
In section 5 of the paper, the authors presented empirical evidence of the effectiveness of the method using the example of training models with up to 27 billion parameters (27B). The comparison was carried out between three architectures: Baseline (standard transformer), HC (standard hyperlinks) and mHC.
Learning Stability Analysis (Figure 5)
Figure 5 in the document demonstrates the dynamics of the Loss Curve and gradient norms during the training of model 27B.
-
Standard HC (Light blue line): Shows clear instability. The graph shows sharp spikes in the values of the loss function, especially noticeable in the area of 12,000 steps. This indicates the risk of model divergence when gradients behave erratically. Despite the high theoretical power, learning such a model involves great risks.
-
DeepSeek mHC (Blue Line): The learning curve is completely smooth and monotonously decreasing. It is practically indistinguishable from the Baseline reference line in terms of stability, but shows better results in terms of Loss value. No spikes or anomalies were recorded throughout the learning process.
-
Summary of the schedule: The mHC successfully suppresses the signal amplitude by keeping the gain within 1.6x, while for a conventional HC it reached 3000x.
Benchmark performance (Table 4)
The "Table 4" shows the results of testing the trained model 27B on popular academic benchmarks. Interestingly, Standard HC, despite its instability, also shows an increase over Baseline, but mHC surpasses both architectures, providing both quality and stability.
Here is a part of the results from the table given in the paper:
| Benchmark (Task) | Metric | Baseline (27B) | Standard HC (27B) | DeepSeek mHC (27B) | Growth (mHC vs Baseline) |
|---|---|---|---|---|---|
| ** MMLU** | Knowledge (5-shot) | 59.0 | 61.2 | 63.4 | +4.4 |
| GSM8K | Mathematics (8-shot) | 46.7 | 50.1 | 53.8 | +7.1 |
| ** BBH** | Reasoning (3-shot) | 43.8 | 48.9 | 51.0 | +7.2 |
| DROP | Reading/Understanding (3-shot) | 47.0 | 51.6 | 53.9 | +6.9 |
Note: The data for Standard HC shows that hyperlinks are effective on their own, but mHC allows you to get the most out of them without sacrificing the stability of the learning process.
Key findings from the data:
-
Solving complex problems: mHC demonstrates the greatest separation in tasks requiring multi-stage reasoning, such as GSM8K (mathematics) and BBH (algorithmic tasks). This confirms the hypothesis that a denser connectivity of layers helps the model build more complex logical chains.
-
Loss reduction: The final value of the mHC loss function was at 0.021 lower than Baseline, which is a significant improvement for models of this scale.
Conclusion
DeepSeek's work demonstrates the transition from empirical parameter selection ("alchemy") to rigorous mathematical architecture design. The use of dual stochastic matrices makes it possible to create significantly deeper and more connected networks, bypassing limitations that were previously considered insurmountable without radically increasing the hardware base.
