ARTICLE AD BOX
Xiaomi just rolled out a turbocharged version of its flagship AI model, smashing speed records by running a trillion-parameter model at over 1,000 tokens per second—all on regular cloud GPUs. No expensive, custom chips needed.
Breakneck Speed
Their new MiMo-V2.5-Pro-UltraSpeed clocks a steady 1,000 tokens per second and sometimes even hits 1,200. Compare that to today’s top models: GPT-5.5 sticks around 68 tokens per second, while Claude Opus 4.6 does 71. Xiaomi’s model is about fifteen times faster, which is wild, especially considering it doesn’t rely on specialized hardware like Google’s TPUs or NVIDIA’s Blackwell chips.
How Did They Pull It Off?
Three main tricks make UltraSpeed tick:
- FP4 expert quantization: crunches down the math without losing accuracy
- DFlash speculative decoding: guesses several tokens ahead, all at once
- TileRT runtime optimization: amped up with help from the folks at TileRT
Both the FP4-DFlash checkpoint (on Hugging Face) and TileRT modules (on GitHub) are open-source, so if you want to kick the tires yourself, you can.
Under the Hood
UltraSpeed is basically MiMo-V2.5-Pro on rocket fuel. Here’s what the base model brings:
- 1.02 trillion parameters in total, with 42 billion active at any one moment (that’s a Mixture-of-Experts setup)
- 1 million-token context window
- Hybrid-attention model, including Multi-Token Prediction
Normally, MiMo-V2.5-Pro runs at 60-80 tokens/second for heavy tasks, but the lighter MiMo-V2.5 base paces along at 100-150 tokens/second. UltraSpeed just blows right past them both.
Benchmark Scores
On real tasks—especially coding or agent jobs—MiMo-V2.5-Pro keeps pace with the very top models out there:
- SWE-bench Pro: 57.2% (Claude Opus 4.6 scores 57.3%, GPT-5.4 gets 57.7%)
- Terminal-Bench 2.0: 68.4 (Claude Opus 4.6 reaches 65.4)
- ClawEval: Passes at 64%, right in line with the competition
And it gets there by burning through 40–60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, or GPT-5.4 on ClawEval.
Showcase Projects
Xiaomi put their model to the test with some challenging, autonomous builds:
- SysY Compiler in Rust: Built an entire compiler from scratch in just over 4 hours, aced all 233 test cases from Peking University
- Video Editor App: Generated 8,192 lines of code, over 11.5 hours and 1,868 tool calls, building a full desktop application
- Analog Circuit Design: Optimized an FVF-LDO regulator in about an hour, hitting all six performance targets in a closed loop
The Trade-offs
Here’s where things get tricky:
- UltraSpeed costs about 3 times more than the standard model—around $1.29 per million input tokens and $2.61 per million output tokens.
- There’s a limited public trial running June 9–23, 2026, with applications needed and enterprises given top priority.
- Daily usage has tight restrictions: max 10 requests queued per account, 30-minute sessions, and it cuts you off after 5 minutes of idling.
- UltraSpeed won’t work with Xiaomi’s Token Plan and there’s no special API pricing for the US or UK (at least not yet).
What People Are Watching
All these blazing-fast numbers? They’re Xiaomi’s own claims so far—nobody independent has verified them yet. Still, since the open-source checkpoint is up, expect outside testers to jump on it fast.
One thing to watch: while the model crushes it in coding tasks, its performance in free-form conversation dips a bit. So, we’ll have to see how it holds up in everyday, real-world apps.
Still, for anyone chasing ultra-low-latency AI in real-time use cases—think fraud detection, high-frequency trading, or instant translation—UltraSpeed could be a game changer, all without the headache of tracking down specialized chips.







English (US) ·