AI Platform — June 10, 2026
What does inference cost and what platform do I build on?
- pricing
- china
- deepseek
- tencent-cloud
- xiaomi
- routing
The read
Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run.
What moved
-
DeepSeek V4 pricing triggers China-wide AI API price war — Tencent Cloud cuts DeepSeek-V4 hosting 97.5%, Xiaomi cuts MiMo-V2.5 99% — South China Morning Post DeepSeek’s aggressive V4 API pricing has forced Chinese rivals to slash costs. Tencent Cloud cut its hosted DeepSeek-V4 series prices by ~97.5% to fully match DeepSeek’s official rates with no cloud premium (effective June 2). Xiaomi cut MiMo-V2.5 API pricing by up to 99%, which pushed it to #6 on OpenRouter with 1.7 trillion tokens/week processed (up >999% week-over-week). MiniMax instead launched a hybrid token + subscription model for M3, with subscription tiers from $7.24 to $69.28/month. Builder angle: If you route any workload to Chinese-hosted open models, per-token costs for MiMo-V2.5 just dropped ~99% and Tencent Cloud’s DeepSeek-V4 hosting now matches DeepSeek’s direct API price — both are now viable cheap-tier options in cost-based routing tables.
-
Google’s GKE Inference Gateway cuts time-to-first-token 92.8% via prefix-cache-aware routing — Google Cloud Blog An independent Principled Technologies benchmark found GKE Inference Gateway — which caches KV-cache prefixes and routes requests to the pod already holding the matching prefix instead of round-robin — delivered 15.7% higher output token throughput (7,169 vs 6,042 tok/s), 92.8% lower time-to-first-token (188ms vs 2,625ms), and 62.6% lower inter-token latency (30.2ms vs 81ms) versus a managed-Kubernetes baseline serving Llama 3.1 8B on identical 8x A100 hardware. Snap reports 75-80% prefix cache hit rates running this in production via the open-source llm-d stack. Builder angle: Routing on KV-cache locality instead of round-robin is a direct GPU-count and latency lever for RAG/long-system-prompt workloads — same throughput on fewer accelerators, with TTFT improvements that matter for interactive agents.
-
Xiaomi’s MiMo-V2.5-Pro-UltraSpeed hits 1,000+ tok/s on a 1T-parameter model via FP4 + DFlash speculative decoding — 3x price for 10x output — Xiaomi MiMo Blog Xiaomi and TileRT combined selective FP4 quantization on MoE expert weights with ‘DFlash’ block-level speculative decoding (avg. accepted-token length 6.30 in coding, 5.56 in math, 4.29 in agent tasks) to decode a 1-trillion-parameter model at over 1,000 tok/s (up to ~1,200) on a single standard 8-GPU node — about 10x the standard MiMo-V2.5-Pro throughput. The UltraSpeed API tier is priced at 3x the standard MiMo-V2.5-Pro rate, available via a limited application-based trial running June 9-23, 2026. Builder angle: A 3x price for ~10x tokens/sec changes the cost-per-completed-task math for latency-bound agentic and trading workloads previously bottlenecked on trillion-parameter decode speed — worth benchmarking against smaller models during the trial window.
Also tracking
- Salesforce ships Agentforce Mobile SDK to GA and opens ADL Connect API beta for scriptable RAG data libraries — source — RAG grounding data for Salesforce agents can now be created, uploaded, and promoted through a scriptable REST API as part of CI/CD, while the same agent ships into native iOS/Android/React Native apps via a GA SDK.
- Vercel ships Drives for Sandbox in private beta for persistent storage across disposable AI agent sandboxes — source — Removes the re-provisioning cost of disposable agent sandboxes by letting builders persist a coding agent’s workspace (repo clone, deps, build cache) across runs instead of rebuilding it from scratch each time.