Model Comparison

DeepSeek V4-Pro and MiMo 2.5 Pro Cost 34x Less Than GPT-5.5. Here's What You Actually Give Up.

June 2, 2026 · 8 min read

Two back-to-back price cuts from China's top AI labs just permanently changed the economics of building with frontier models. DeepSeek locked in its 75% V4-Pro discount. Xiaomi followed days later, slashing MiMo 2.5 Pro API costs by up to 99% on cached inputs. Both models now run at $0.87 per million output tokens — while GPT-5.5 charges $30 and Claude Opus 4.8 charges $25. The coding performance gap is smaller than you think. Here's the math, the benchmarks, and what you'd actually lose by switching.

The pricing table that should make American labs nervous

Let's start with the raw numbers. This is per-million-token API pricing as of June 2, 2026 — what developers actually pay when their app, agent, or pipeline hits a model:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cached Input	1M in + 1M out
OpenAI GPT-5.5	$5.00	$30.00	—	$35.00
Anthropic Claude Opus 4.8	$5.00	$25.00	—	$30.00
Google Gemini 2.5 Pro	$1.25	$10.00	—	$11.25
DeepSeek V4-Pro	$0.435	$0.87	$0.0036	$1.305
Xiaomi MiMo V2.5 Pro	$0.435	$0.87	$0.0036	$1.305

At a simple 1M-in, 1M-out comparison, DeepSeek V4-Pro and MiMo 2.5 Pro are 26.8x cheaper than GPT-5.5 and 23x cheaper than Claude Opus 4.8. On output alone — where most agent costs actually land — the gap is 34.5x against GPT-5.5. And when your system prompts and document contexts hit the cache (which they will, constantly, in production), cached input drops to $0.0036 per million tokens. That's effectively free.

These numbers aren't promotional. DeepSeek's 75% discount on V4-Pro became permanent on May 22. Xiaomi cut MiMo 2.5 Pro prices on May 26, bringing it to parity with DeepSeek. Fuli Luo, head of Xiaomi's MiMo team and a former core DeepSeek developer who co-built DeepSeek-V2, published the technical explanation: "Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even."

This isn't venture-subsidized dumping. These prices reflect architectural efficiency — both models use aggressive KV cache compression that dramatically reduces the compute cost per token. DeepSeek V4-Pro's KV cache at one million tokens of context is 10% the size of its predecessor's. Single-token inference runs at 27% of the previous compute cost. Xiaomi's hierarchical KV cache optimization achieves similar gains through a different mechanism. The cost reductions are real, and they're permanent.

DeepSeek V4-Pro: the open-source whale that won't go away

DeepSeek V4-Pro is a 1.6 trillion parameter Mixture-of-Experts model released under the MIT License. It shipped on April 24, 2026 — 484 days after V3, and timed to land the same week as GPT-5.5 and Claude Opus 4.7. The timing was not subtle.

The benchmark that matters most for builders: SWE-bench Verified, which measures real GitHub issue resolution. DeepSeek V4-Pro scores 80.6%. Claude Opus 4.6 scores 80.8%. That's a 0.2-point gap — effectively identical real-world coding capability. The price difference for that 0.2 points? 34x on output tokens.

On broader benchmarks, the picture is more mixed but still impressive for something this cheap:

SWE-bench Pro: 55.4% (vs GPT-5.5's 58.6%, Claude Opus 4.7's 64.3%)
Terminal-Bench 2.0: 67.9% (vs GPT-5.5's 82.7%, Claude Opus 4.7's 69.4%)
BrowseComp (web agent): 83.4% — beats Claude Opus 4.7's 79.3%, nearly matches GPT-5.5's 84.4%
GPQA Diamond (graduate-level reasoning): 90.1% (vs GPT-5.5's 93.6%, Claude Opus 4.7's 94.2%)
Humanity's Last Exam (with tools): 48.2% (vs GPT-5.5's 52.2%, Claude Opus 4.7's 54.7%)
MCP Atlas (tool orchestration): 73.6% (vs GPT-5.5's 75.3%, Claude Opus 4.7's 79.1%)

The pattern is consistent: DeepSeek V4-Pro trails the premium models by 3–15 percentage points on the hardest reasoning tasks, but on applied coding benchmarks, the gap nearly disappears. For the majority of production workloads — code generation, bug fixing, web agents, tool orchestration — the model is competitive with systems that cost 26–34x more.

Artificial Analysis now ranks DeepSeek V4-Pro as the top model globally for intelligence-per-dollar after the permanent price cut.

Xiaomi MiMo 2.5 Pro: the multimodal dark horse

If DeepSeek is the budget coding workhorse, Xiaomi MiMo 2.5 Pro is the multimodal specialist that happens to code well too. It's a 1.22 trillion parameter MoE model (activating 42B per task) released under the MIT License — and unlike DeepSeek V4-Pro, it handles text, images, video, and audio natively in a single model.

Xiaomi is an unlikely AI contender. The company known for smartphones and electric vehicles committed $8.7 billion over three years to AI, announced by CEO Lei Jun in March. The release cadence since then suggests the money is already moving: MiMo V2-Flash in December 2025, V2-Pro in March 2026, and now V2.5 in late April. Three major model generations in four months.

The benchmark story:

SWE-bench Pro: 57.2% — ahead of DeepSeek V4-Pro's 55.4%, competitive with GPT-5.5's 58.6%
ClawEval (agentic tasks): 63.8% success rate, consuming only ~70K tokens per trajectory — 40–60% fewer tokens than Claude Opus 4.6 or GPT-5.4 for comparable results
GDPVal-AA (Elo): 1,581 — ahead of Kimi K2.6 and GLM 5.1
Humanity's Last Exam: 48.0% — competitive with DeepSeek but trailing GPT-5.4's 58.7%
Multimodal benchmarks: on par with GPT-5.4 and Gemini 3.1 Pro

Where MiMo 2.5 Pro genuinely stands out is token efficiency. It uses 42% fewer tokens than Kimi K2.6 at equivalent benchmark scores. For agentic "claw" tasks — automated workflows where the model makes hundreds or thousands of sequential tool calls — MiMo 2.5 Pro is designed to sustain coherence across extremely long sessions. Xiaomi demonstrated it autonomously building a complete SysY compiler in Rust with a perfect score on hidden test suites, and a full-featured video editor through 1,868 sequential tool calls.

On the token plan side, Xiaomi's billing refresh is aggressive: the $100 Max plan now gets you 82 billion tokens, up from 1.6 billion. That's a 50x increase in effective token allowance.

Both models carry a 1 million token context window, matching the frontier standard. Both are MIT-licensed, meaning you can download, modify, and deploy them on your own infrastructure with zero API costs.

What you actually lose when you switch

This is the section that matters. The price difference is real. So is the performance gap. Here's what you're trading:

Where the gap is small (switch without much pain)

Code generation and bug fixing. SWE-bench Verified at 80.6% vs 80.8% is functionally identical for most teams. If your agents are writing and fixing code, you won't notice the difference on routine work.
Web browsing and information retrieval. DeepSeek V4-Pro's 83.4% on BrowseComp actually beats Claude Opus 4.7. If your agents are scraping, summarizing, or navigating the web, you're giving up nothing.
Tool orchestration. MCP Atlas scores cluster within 5 points across all models. For API-calling agents and tool-chain pipelines, the cheaper models keep up.
Multimodal tasks. MiMo 2.5 Pro is genuinely competitive with GPT-5.4 and Gemini 3.1 Pro on image, video, and audio understanding — and you're not paying a multimodal premium.

Where the gap is real (stay with premium if this is your core workload)

Hardest reasoning problems. On Humanity's Last Exam with tools, DeepSeek scores 48.2% vs Claude Opus 4.7's 54.7% — a meaningful 6.5-point gap. If your product depends on solving graduate-level math, physics, or biology problems, the premium models are still better.
Long-horizon autonomous coding. DeepSeek's 55.4% on SWE-bench Pro vs Claude Opus 4.7's 64.3% is a 9-point gap. For agents that independently navigate large codebases and complete multi-file refactors without human guidance, Claude Opus is still the stronger choice. Though MiMo 2.5 Pro at 57.2% narrows this somewhat.
Terminal-heavy agentic work. GPT-5.5's 82.7% on Terminal-Bench 2.0 is in a different league from DeepSeek's 67.9%. If your agents live in the terminal, OpenAI's model is objectively better.
Safety-critical decisions. Anthropic's alignment assessment found Opus 4.8 reaches "new highs" on prosocial behavior with misalignment rates "comparable to Claude Mythos Preview." If your use case involves medical, legal, or financial decisions where errors have real consequences, the alignment gap matters.

The strategy: don't pick one. Route.

Here's the thing nobody says out loud: you don't have to choose. The smartest teams I've talked to are running model routers — lightweight middleware that sends routine tasks to DeepSeek or MiMo and escalates hard problems to Claude Opus or GPT-5.5. The cost math is compelling:

80% of requests go to the $0.87 tier
15% hit Gemini 2.5 Pro at $10
5% need Claude Opus 4.8 at $25

Your blended output cost drops from $25–30 per million tokens to roughly $2.70 per million tokens. That's a 90% cost reduction with near-zero quality loss on the majority of requests.

This is what open-source model availability actually enables. DeepSeek and Xiaomi both ship MIT-licensed weights. You can run them on your own GPUs — no API dependency, no rate limits, no surprise price hikes. MiniMax M2.7 ($0.30/$1.20) and Kimi K2.5 ($0.60/$2.50) offer similar economics. Four Chinese frontier models shipped in a 12-day window in early May, all under one-third of Opus 4.7's per-token cost. The supply side is only getting more competitive.

The American labs are betting on capability, not cost

OpenAI doubled GPT-5.5's output price to $30 per million tokens at launch. Anthropic kept Opus 4.7's rate card flat but shipped a new tokenizer that can produce up to 35% more tokens for the same input text — your bill goes up even though the "price" didn't. Google's Gemini 2.5 Pro at $1.25/$10 is the closest American model to competitive pricing, and it's still 8x more expensive than DeepSeek V4-Pro on output.

The strategy is clear: American labs are betting that enterprises will pay a premium for slightly better reasoning, stronger alignment guarantees, and the safety of a US-based vendor. DeepSeek and Xiaomi are betting that for the vast majority of workloads, "good enough at 1/30th the price" beats "slightly better at 30x the cost."

I think the Chinese labs are right about the direction, even if the premium models still hold the high ground on the hardest problems. The pricing pressure isn't a temporary promotion — it's a structural shift driven by architecture. When your KV cache is 10% the size of last year's model, your costs drop whether you want them to or not. And once those efficiency gains hit the API price, they don't go back up.

For most builders, the question isn't whether to switch. It's which workloads to switch first.

The token bill you pay today is someone else's architectural decision from two years ago. DeepSeek and Xiaomi just made a different one. Your CFO is about to notice.