Huawei's openPangu 2.0 Is Now Open-Source — and It Was Trained Without a Single NVIDIA GPU
If you followed HDC 2026, watched Richard Yu open-source Pangu, or are weighing openPangu 2.0 against DeepSeek for 512K context and compliance-ready deployment, this article anchors on the June 30 Flash launch: event timeline, seven-component open-source roadmap, mHC/ModAttn architecture, Ascend hardware metrics, competitor comparison matrices, ModelArts/GitCode deployment paths, and a five-step Runbook.
Contents
- 1. Three selection pain points
- 2. Event background and timeline
- 3. Pro vs Flash specifications
- 4. Seven-component full-stack open source
- 5. Architecture deep dive
- 6. Ascend hardware and training breakthrough
- 7. Competitor comparison and selection matrix
- 8. Access and deployment guide
- 9. Strategic significance and license
- 10. Five-step Runbook
- 11. Citable technical facts
- 12. Conclusion
1. Three Selection Pain Points: Open-Source Depth, Hardware Lock-In, and Context Length
- "Open source" is not always full-stack open. Most frontier models release weights and inference code only—pre-training, post-training, and custom training operators stay closed. You cannot reproduce the training pipeline or run domain-specific continued pre-training.
- Hardware binding and compliance. DeepSeek, Qwen, Kimi, and Llama were all trained on NVIDIA hardware. Under US export controls, teams that need a frontier model trained without any NVIDIA GPU currently have one option: openPangu 2.0.
- Context window drives use cases. Full contracts, large codebases, and marathon chat histories often exceed 128K. Both openPangu 2.0 variants ship a unified 512K window—roughly the text of eight full-length novels in one pass.
2. Event Background and Timeline: HDC 2026 to GitCode Launch
| Date | Event |
|---|---|
| 2026-06-12 | Huawei Developer Conference (HDC 2026), Dongguan Songshan Lake—Richard Yu keynote officially launches openPangu 2.0 |
| 2026-06-30 | openPangu-2.0-Flash weights, base inference code, and training/inference operators go open source on GitCode |
| 2026-07 (planned) | openPangu-2.0-Pro weights and inference code release |
| H2 2026 (planned) | Pre-training code, post-training code (SFT/RLHF), and additional training operators roll out |
At HDC 2026, Richard Yu said: "In the dictionary of my remaining life, there is no second place—only first. We will go from number one in China to number one in the world."
3. Two Versions for Different Scenarios
| Pro | Flash | |
|---|---|---|
| Total parameters | 505B | 92B |
| Active parameters | 18B | 6B |
| Sparsity ratio | ~28:1 | ~15:1 |
| Context window | 512K | 512K |
| Release status | July (planned) | June 30 (live) |
Flash: 92B total parameters with only 6B active—near the cost of a 6B dense model while drawing from a 92B knowledge pool. Single Ascend 910B card inference is supported; community estimates suggest ~96GB unified memory systems may also work.
Pro: 505B total, 18B active—built for extreme long-document workloads. The 512K window can ingest full contracts, large repositories, and extended conversation history in one shot.
4. Seven-Component Full-Stack Open Source: Why the Release Matters
Most open LLMs ship weights + inference code only. openPangu 2.0 plans to open seven major components:
- Model architecture (structure definition) — ✅ released
- Model weights (Flash live June 30; Pro planned July)
- Technical report — ✅ released with weights
- Inference code + training/inference operators — ✅ released
- Pre-training code — 📋 H2 2026
- Post-training code (SFT/RLHF) — 📋 H2 2026
- Training operators (Ascend high-performance custom kernels) — 📋 H2 2026
The last three are extremely rare at this MoE scale—enabling true full-stack open source. Researchers can reproduce training; enterprises can run vertical continued pre-training.
5. Architecture Deep Dive
openPangu 2.0 uses a MoE (Mixture of Experts) design. Key techniques include:
- mHC (Multi-Head Combinatorial) routing: improves expert routing efficiency and reduces load imbalance
- Muon optimizer: Microsoft's second-order momentum scheme for more stable large-scale training
- ModAttn (Modular Attention): modular attention blocks tuned for 512K ultra-long context
- DSA+SWA ultra-sparse attention (Flash only): extreme sparsity ratio to cut inference compute
Developer ecosystem and software stack
- CANN (Huawei's compute stack, CUDA-class) + torch_npu (PyTorch adapter)
- Standard PyTorch code switches to Ascend via
import torch_npu - Deployment surfaces: Huawei Cloud ModelArts (API), GitCode Ascend Tribe (self-hosted), HarmonyOS native integration
6. The First "No NVIDIA" Frontier Model: Ascend Hardware Adaptation
openPangu 2.0 is the first frontier-scale model fully trained on non-NVIDIA hardware—end to end on Huawei Ascend 910B NPUs, with no A100/H100 in the loop.
| Metric | Data |
|---|---|
| Single-card throughput (Ascend) | 2× mainstream open-source models |
| Super-node training efficiency | +30% |
| 512K long-sequence training throughput | +50% |
| Train/inference consistency | >99% (a long-standing MoE pain point) |
| Inference latency | 1.2× better than comparable industry models |
| On-device 30B embedded model | 50% faster inference, 20% less memory; runs offline on Kirin chips |
| Flash-Int8 quantization | W4A8, 40% memory reduction, <10% accuracy loss |
7. Competitor Comparison and Selection Matrix
Head-to-head parameters
| Model | Total params | Active params | Context | Training hardware | Openness |
|---|---|---|---|---|---|
| openPangu 2.0 Pro | 505B | 18B | 512K | Ascend NPU | Full stack (7 components) |
| openPangu 2.0 Flash | 92B | 6B | 512K | Ascend NPU | Full stack (7 components) |
| DeepSeek V4 Pro | 1.6T | ~200B | 128K | NVIDIA | Weights + inference |
| Qwen 3.7 Max | ~400B+ | varies | 128K | NVIDIA | Weights + inference + partial training |
| Kimi K2.7 | 1T | 32B | 256K | NVIDIA | Weights + inference |
| Llama 4 405B | 405B | — | 128K | NVIDIA | Weights + inference |
Capability matrix by scenario
| Scenario | Recommendation | Why |
|---|---|---|
| Code generation / complex reasoning | DeepSeek V4 Pro | ~200B active parameters, current performance leader |
| Agent / multi-tool orchestration | Kimi K2.7 | Mature MCP ecosystem |
| Ultra-long documents (>256K tokens) | openPangu 2.0 Pro | 512K context is the clear choice |
| Domestic compliance / sovereign AI | openPangu 2.0 | Only frontier model trained on purely domestic hardware |
| Ascend / Huawei Cloud deployment | openPangu 2.0 | Native optimization, 2× throughput |
| On-device / mobile deployment | Embedded 30B | Local inference on Kirin chips |
| Low-cost local inference | Flash | 6B active, runnable on ~96GB VRAM |
Note: Independent third-party benchmarks are still in progress; capability assessments below partly reflect architectural inference and will be updated when results publish.
8. Access and Deployment: ModelArts API and GitCode Self-Hosting
Option 1: Huawei Cloud ModelArts API (simplest)
- Create a Huawei Cloud account
- Open ModelArts → AI Gallery → search "openPangu 2.0"
- Subscribe to Flash or Pro and obtain the API endpoint
Option 2: GitCode self-deployment
Repository hub: gitcode.com/org/ascend-tribe
openPangu-2.0-Flash: Flash weightsopenPangu-2.0-Flash-Int8: quantized build (40% less memory)openPangu-2.0-Infer: inference sourceopenPangu-2.0-Op: Ascend high-performance operators
Hardware requirements (reference)
| Version | Recommended hardware | Minimum config |
|---|---|---|
| Flash (6B active) | Single Ascend 910B | ~96GB unified memory |
| Flash-Int8 | Single Atlas A2 | ~48GB VRAM |
| Pro (18B active) | 4+ Ascend 910B cards | Multi-card cluster (validate after July weight release) |
9. Strategic Significance, HarmonyOS Agent, and openPangu License
- Geopolitics: With A100/H100 restrictions on China, openPangu 2.0 proves frontier-scale training without NVIDIA is feasible
- Full-stack open-source value: reproducible research, enterprise continued pre-training, lower barrier to the Ascend ecosystem
- HarmonyOS Agent foundation: HarmonyOS 7 enters the Agent era; HarmonyOS Agent Framework 2.0 reports >90% success on complex tasks; on-device 30B runs offline
- openPangu License: commercial use allowed, royalty-free, non-exclusive (see GitCode repos for exact terms)
10. Five-Step Getting Started Runbook
Step 1 — Define scenario and version
Ultra-long documents → Pro; low-cost API → Flash; compliance → either version; on-device → Embedded 30B.
Step 2 — Choose access path
No hardware: Huawei Cloud ModelArts API. Have Ascend: download weights from GitCode and self-host.
Step 3 — Configure the Ascend software stack
Step 4 — Run inference or call the API
Flash single-card inference.py; quantized path → Flash-Int8; Pro multi-card distributed_inference.py.
Step 5 — Track the open-source roadmap and benchmark updates
Watch GitCode Ascend Tribe; update deployment notes when Pro lands in July; replace architectural inference once third-party scores publish.
11. Citable Technical Facts
- openPangu 2.0 Pro: 505B total / 18B active; Flash: 92B / 6B; both ship 512K context.
- First frontier-scale model trained and open-sourced on non-NVIDIA hardware; training stack is Ascend 910B.
- Ascend single-card throughput is 2× mainstream open models; train/inference consistency >99%; 512K long-sequence training throughput +50%.
- Planned release of seven major components, including pre-training, post-training code, and training operators—rare at this MoE scale.
12. Conclusion: Not an All-Round Champion, but Irreplaceable on Key Axes
DeepSeek V4 Pro still leads on code generation and hard reasoning, but openPangu 2.0 is nearly unmatched on 512K ultra-long context, sovereign domestic training, 2× Ascend-native throughput, full-stack open source, and HarmonyOS on-device integration. Flash weights went live June 30—right in the news cycle.
If you wire openPangu APIs from a laptop or generic Linux VPS, orchestrate HarmonyOS Agents, or run a multi-model gateway, long-running production setups often hit lid-close disconnects, missing Apple toolchains, and ops overhead. For 7×24 stable Agent workloads, OpenClaw gateways, and native iOS/macOS toolchains, renting a VPSMAC M4 Mac cloud node is the lower-friction path—swap models as the open ecosystem evolves while keeping a native macOS runtime stable.
Some benchmark figures in this article are architectural estimates; we will update when independent third-party results publish. Published: July 1, 2026.