AI Platform — June 9, 2026

What does inference cost and what platform do I build on?

8 Jun, 2026

routing
latency
throughput
cerebras
kimi
benchmarks

The read

Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run.

What moved

Cerebras positions Kimi K2.6 at 981 tok/s output — 5.4× faster than Gemini 3.5 Flash with half the TTFT — Cerebras Blog Cerebras published an Artificial Analysis-backed benchmark showing Kimi K2.6 on Cerebras hardware achieves 981 output tok/s versus Gemini 3.5 Flash’s 181 tok/s (5.4×), TTFT of 452ms vs 960ms, and end-to-end task completion of 5.6s vs 17.5s. Artificial Analysis quality scores are comparable (53.9 vs 55.3). The post emphasizes that Kimi K2.6 is open-weight and fine-tune-able, contrasting with Gemini’s closed API. Builder angle: Builders with latency-sensitive or high-throughput workloads have a concrete routing signal: route open-weight Kimi K2.6 to Cerebras to hit sub-500ms TTFT and ~1000 tok/s — at similar quality to Gemini 3.5 Flash.
Google Gemini 2.0 Flash permanently shut down June 1 — builders must migrate to Gemini 3.x at substantially higher prices — Google AI Developer Docs Google’s pricing page (updated 2026-06-02) confirms gemini-2.0-flash-001 and gemini-2.0-flash-lite-001 were shut down June 1, 2026 with no further access. Migration options: Gemini 3.5 Flash at $1.50/$9.00 per 1M input/output tokens (Standard tier), or the budget Gemini 3.1 Flash-Lite at $0.25/$1.50. Google also introduced three inference tiers — Flex (lower cost, batch-speed SLA), Standard, and Priority (80% premium) — plus a new context caching fee structure at $0.025–$8.10/1M tokens/hour depending on model. Builder angle: Any production call to a gemini-2.0-flash model ID is now a dead endpoint; migration to Gemini 3.1 Flash-Lite preserves budget (comparable pricing to 2.0 Flash), while Gemini 3.5 Flash output is ~6× pricier and requires evaluating whether the quality uplift justifies it.
Databricks rolls out Instructed-Retriever-1 to all customers: FP8 + speculative decoding cuts search latency 3× — Databricks Blog Databricks’ Knowledge Assistant now uses Instructed-Retriever-1, a MoE model served with FP8 quantization (NVIDIA ModelOpt, zero measured quality degradation) and speculative decoding contributing a 30%+ speed-up. Production results: 3× faster search, 2× faster answer generation, TTFT ~2s, E2E latency below 10s. The model handles both query generation and reranking in parallel via test-time scaling. Builder angle: First production-validated data point for FP8 + speculative decoding combined on a MoE serving stack — a cost/latency template builders can reference when evaluating similar optimizations for their own self-hosted or Databricks-hosted inference pipelines.

Also tracking

Albireo (arXiv): 2× throughput and 54% lower energy vs vLLM via non-scalable scheduling overhead elimination — source — Research paper (June 1, 2026) — 2× throughput, 48% latency reduction, 54% lower energy vs vLLM on same hardware; no deployable release yet, tracks as pre-deployment signal for self-hosted inference operators.