Xiaomi Unveils MiMo-V2.5-Pro-UltraSpeed: AI Model Hits 1,000 Tokens Per Second on Standard GPUs

1 day ago 7
ARTICLE AD BOX

Xiaomi just rolled out a turbocharged version of its flagship AI model, smashing speed records by running a trillion-parameter model at over 1,000 tokens per second—all on regular cloud GPUs. No expensive, custom chips needed.

Breakneck Speed

Their new MiMo-V2.5-Pro-UltraSpeed clocks a steady 1,000 tokens per second and sometimes even hits 1,200. Compare that to today’s top models: GPT-5.5 sticks around 68 tokens per second, while Claude Opus 4.6 does 71. Xiaomi’s model is about fifteen times faster, which is wild, especially considering it doesn’t rely on specialized hardware like Google’s TPUs or NVIDIA’s Blackwell chips.

How Did They Pull It Off?

Three main tricks make UltraSpeed tick:

  • FP4 expert quantization: crunches down the math without losing accuracy
  • DFlash speculative decoding: guesses several tokens ahead, all at once
  • TileRT runtime optimization: amped up with help from the folks at TileRT

Both the FP4-DFlash checkpoint (on Hugging Face) and TileRT modules (on GitHub) are open-source, so if you want to kick the tires yourself, you can.

Under the Hood

UltraSpeed is basically MiMo-V2.5-Pro on rocket fuel. Here’s what the base model brings:

  • 1.02 trillion parameters in total, with 42 billion active at any one moment (that’s a Mixture-of-Experts setup)
  • 1 million-token context window
  • Hybrid-attention model, including Multi-Token Prediction

Normally, MiMo-V2.5-Pro runs at 60-80 tokens/second for heavy tasks, but the lighter MiMo-V2.5 base paces along at 100-150 tokens/second. UltraSpeed just blows right past them both.

Benchmark Scores

On real tasks—especially coding or agent jobs—MiMo-V2.5-Pro keeps pace with the very top models out there:

  • SWE-bench Pro: 57.2% (Claude Opus 4.6 scores 57.3%, GPT-5.4 gets 57.7%)
  • Terminal-Bench 2.0: 68.4 (Claude Opus 4.6 reaches 65.4)
  • ClawEval: Passes at 64%, right in line with the competition

And it gets there by burning through 40–60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, or GPT-5.4 on ClawEval.

Showcase Projects

Xiaomi put their model to the test with some challenging, autonomous builds:

  • SysY Compiler in Rust: Built an entire compiler from scratch in just over 4 hours, aced all 233 test cases from Peking University
  • Video Editor App: Generated 8,192 lines of code, over 11.5 hours and 1,868 tool calls, building a full desktop application
  • Analog Circuit Design: Optimized an FVF-LDO regulator in about an hour, hitting all six performance targets in a closed loop

The Trade-offs

Here’s where things get tricky:

  • UltraSpeed costs about 3 times more than the standard model—around $1.29 per million input tokens and $2.61 per million output tokens.
  • There’s a limited public trial running June 9–23, 2026, with applications needed and enterprises given top priority.
  • Daily usage has tight restrictions: max 10 requests queued per account, 30-minute sessions, and it cuts you off after 5 minutes of idling.
  • UltraSpeed won’t work with Xiaomi’s Token Plan and there’s no special API pricing for the US or UK (at least not yet).

What People Are Watching

All these blazing-fast numbers? They’re Xiaomi’s own claims so far—nobody independent has verified them yet. Still, since the open-source checkpoint is up, expect outside testers to jump on it fast.

One thing to watch: while the model crushes it in coding tasks, its performance in free-form conversation dips a bit. So, we’ll have to see how it holds up in everyday, real-world apps.

Still, for anyone chasing ultra-low-latency AI in real-time use cases—think fraud detection, high-frequency trading, or instant translation—UltraSpeed could be a game changer, all without the headache of tracking down specialized chips.

Read Entire Article