AI Platform — June 7, 2026
What does inference cost and what platform do I build on?
- routing
- vllm
- agentic
- saar
- open-source
- latency
The read
Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run.
What moved
-
vLLM Semantic Router v0.3 Themis ships SAAR stateful routing with RouterArena #1 ranking at $0.11/1K queries — vLLM Blog vLLM Semantic Router v0.3 Themis ships Session-Aware Agentic Routing (SAAR) as a production-ready feature that locks multi-turn agent sessions to a specific model during active tool loops and provider-state continuations, resetting only at safe idle or drift boundaries. The release ranks #1 on RouterArena with a 75.4 weighted score at a $0.11/1K queries cost point, adds 18 new signal families (PII detection, jailbreak, complexity, embedding, etc.), introduces a canonical v0.3 YAML config replacing fragmented layouts, and extends hardware support to AMD ROCm and Intel OpenVINO alongside NVIDIA. Builder angle: Builders running multi-turn agents can now delegate model-continuity logic to vLLM SR—SAAR prevents mid-session model switches during tool loops without custom routing code, while prefix-cache-aware switch pricing keeps costs visible.
-
DigitalOcean Inference Gateway ships prefix-aware routing live, with cached-token pricing coming soon — DigitalOcean Blog DigitalOcean’s Serverless Inference now routes requests to GPU instances already holding a shared system-prompt prefix in KV cache. At 1M daily requests where 70% share a common prefix, prefix-aware routing recovers ~34 GPU-hours/day; at 10M requests, ~340 GPU-hours/day—up to 4x effective compute cost reduction per request for prefix-heavy workloads. vLLM runtime optimizations on AMD Instinct MI325X and NVIDIA Hopper GPUs back the gains. Cached-token pricing (lower per-token cost on cache hits) is announced as launching on Serverless Inference within the next few weeks. Builder angle: Builders with high shared-system-prompt traffic on DigitalOcean Serverless Inference get immediate cache-hit routing; upcoming cached-token pricing will translate cache hits into direct per-token cost savings.
-
DeepSeek makes V4 Pro 75% discount permanent, undercutting GPT-5 and Claude Opus at $0.87/M output tokens — The Next Web DeepSeek locked its promotional 75% price cut on V4 Pro permanently on May 24, after initially scheduling it to expire May 31. New rates: $0.003625 input / $0.87 output per million tokens (down from $0.0145–$3.48). At these rates, V4 Pro with 1M-token context undercuts OpenAI GPT-5 ($2.50/$10 per M), Anthropic Claude Opus 4.7 ($5/$25), and Google Gemini 3.5 Flash ($0.15/$0.60 output). Cache-hit input pricing can drop further to $0.0036/M. Builder angle: The permanent cut makes DeepSeek V4 Pro a durable low-cost tier in inference routing tables—builders targeting sub-$1/M output with long-context (1M tokens) and frontier-class reasoning now have a persistent option rather than a promotional window.
Also tracking
- OpenAI updates GPT-Rosalind with GPT-5.5 tool use and 31% token efficiency gains on life-science benchmarks — source — Domain-specific life-sciences model update; 31% fewer tokens than GPT-5.5 on GeneBench with higher accuracy—cost signal for builders in biotech/pharma verticals but no general API pricing change.