Models

China’s open-weights surge: four frontier-class models in twelve days

In April, four Chinese labs — Z.ai, MiniMax, Moonshot, and DeepSeek — released open-weights models that land near the Western frontier on agentic engineering benchmarks at less than a third of the inference cost. For builders, the “self-hosted is a downgrade” assumption no longer holds.

The four models, briefly

DeepSeek V4

The strongest all-rounder of the four. Mixture-of-Experts, strong on coding and math reasoning, with a permissive license that allows commercial use. The community has already produced quantized variants that run on a single high-end consumer GPU at usable speed.

Z.ai GLM-5.1

The best of the four for tool-use and structured output, in our testing. Notable for unusually clean function calling and a long-context retrieval profile that holds up past 200k tokens. If you are building agents that call APIs, this is the one to evaluate first.

Moonshot Kimi K2.6

The Chinese-language strength leader and the closest thing to a “Claude-style” conversational model in the open-weights tier. Long-context up to a million tokens. Slower than the others on routine queries but unusually patient on complex inputs.

MiniMax M2.7

The video and multimodal specialist. Four of the top five video models by Elo are now Chinese-built; M2.7 is the open-weights entry. Useful for teams who need on-prem video understanding without sending frames to a hosted API.

Why this matters for builders outside China

Three concrete shifts.

The risks people downplay

License terms vary across the four and have changed at least once each in the last six months — read the actual file before committing. Supply-chain provenance for weights is harder to verify than for closed models; teams in regulated industries should sign and verify hashes from official sources. And although these models match the frontier on standard benchmarks, the long tail of unusual prompts is still where the closed models pull ahead.

How to evaluate one this week

  1. Pull a representative sample of your real workload — at least 100 prompts with the inputs and the answers you wished you had gotten.
  2. Run them through your current model and the open-weights candidate. Score blind.
  3. Look at the disagreements, not the averages. The averages will be close. The interesting question is: when one model wins, why?
  4. If the open-weights model wins or ties on more than 80% of the workload, route those cases to it and keep the frontier model for the rest.