Skip to content

DeepSeek V4-Pro and Huawei Ascend 950DT were co-designed from the ground up, cutting inference costs 75%

· by Pondero Newsdesk

The short version

A SemiAnalysis trace-level analysis published in June 2026 found that DeepSeek V4-Pro and Huawei's Ascend 950DT accelerator were co-designed together, not adapted post-hoc, slashing per-million-token costs to 0.20 yuan and raising new questions about the reach of US semiconductor export controls.

DeepSeek V4-Pro and Huawei Ascend 950DT were co-designed from the ground up, cutting inference costs 75%

DeepSeek released V4-Pro as an API preview on April 24, 2026. A subsequent trace-level analysis by Wall Street research firm SemiAnalysis, reported by Pandaily in June 2026, found the model was built alongside Huawei's Ascend 950DT accelerator from the start, not ported to it afterward.

What

DeepSeek's V4 preview release notes confirm V4-Pro carries 1.6 trillion total parameters with 49 billion active, and V4-Flash carries 284 billion total with 13 billion active. Both launched April 24 with 1-million-token context windows and two operational modes: standard (Instant Mode) and extended reasoning (Expert Mode).

The co-design finding came from SemiAnalysis, per the Pandaily report published June 15. Huawei's CANN 8.5 software stack and the Ascend 950DT's dual-die unified memory architecture were built with DeepSeek's inference patterns in mind. The 950DT's MC-squared technology merges communication primitives and compute into single kernels, removing the data transfer bottleneck that traditionally limits inference on non-NVIDIA hardware. The practical result: only CUDA and CANN had full day-zero support for V4 inference; AMD ROCm managed one to two tokens per second, and NVIDIA TRT-LLM hit a silent memory corruption bug that took weeks to diagnose.

Per the same Pandaily report, the co-design allowed DeepSeek to price V4-Pro API calls at 0.20 yuan per million tokens, roughly 50 times cheaper than comparable Anthropic offerings by DeepSeek's own comparison. That 75% cost reduction relative to V4-Flash pricing was passed directly to API customers. DeepSeek's token-traffic share on the Vercel AI Gateway climbed from under 1% to 17% in May 2026, per the same Pandaily analysis, surpassing OpenAI to reach third place on that platform.

Why it matters

The standard assumption among Western chip analysts was that Chinese AI labs develop models on NVIDIA hardware first, then adapt for domestic chips. The SemiAnalysis finding inverts that assumption for DeepSeek V4, at least at the inference layer. If the co-design extended to training as well, the efficacy of US semiconductor export controls targeting NVIDIA A100 and H100 access becomes a harder question to answer from the outside.

The hardware side is moving quickly. TrendForce reported on June 8 that Huawei Cloud accelerated Ascend 950DT deployment from Q4 2026 to August, citing DeepSeek V4.2 demand as a key reason. ByteDance locked in half of Ascend 950 production capacity, with Alibaba and Tencent also placing orders for tens of thousands of units, per the Pandaily report. China Mobile purchased 776 Ascend node sets totaling 6,208 accelerators. That level of domestic commitment suggests the Ascend stack is being treated as a production-grade platform, not a fallback.

What to watch next

DeepSeek V4.1 is reportedly scheduled for June 2026, with V4.2 potentially arriving in August to coincide with Ascend 950DT's Huawei Cloud launch, per TrendForce's June 8 report. Whether US Commerce Department tightens export controls in response to Ascend's demonstrated viability as a frontier inference substrate is the broader policy question that will likely follow.

Sources