# Builder's Daily — AI Platform > Rolling 14-day signal for beat `ai-platform`. Ephemeral context — not evergreen corpus. > Author: Amit Kumar Agrawal | https://artificialcuriositylabs.ai > Generated: 2026-06-10 > Human index: https://artificialcuriositylabs.ai/daily/ai-platform/ > RSS: https://artificialcuriositylabs.ai/daily/ai-platform/rss.xml --- # AI Platform — June 10, 2026 **URL:** https://artificialcuriositylabs.ai/daily/ai-platform/2026-06-10/ **Beat:** ai-platform **Date:** 2026-06-10 **Topics:** pricing, china, deepseek, tencent-cloud, xiaomi, routing **Summary:** DeepSeek V4 pricing triggers China-wide AI API price war — Tencent Cloud cuts DeepSeek-V4 hosting 97.5%, Xiaomi cuts MiMo-V2.5 99%; Google's GKE Inferen… ## The read Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run. ## What moved - **DeepSeek V4 pricing triggers China-wide AI API price war — Tencent Cloud cuts DeepSeek-V4 hosting 97.5%, Xiaomi cuts MiMo-V2.5 99%** — [South China Morning Post](https://www.scmp.com/tech/article/3356138/deepseek-v4-forces-rivals-slash-prices-rattling-chinas-cloud-providers) DeepSeek's aggressive V4 API pricing has forced Chinese rivals to slash costs. Tencent Cloud cut its hosted DeepSeek-V4 series prices by ~97.5% to fully match DeepSeek's official rates with no cloud premium (effective June 2). Xiaomi cut MiMo-V2.5 API pricing by up to 99%, which pushed it to #6 on OpenRouter with 1.7 trillion tokens/week processed (up >999% week-over-week). MiniMax instead launched a hybrid token + subscription model for M3, with subscription tiers from $7.24 to $69.28/month. **Builder angle:** If you route any workload to Chinese-hosted open models, per-token costs for MiMo-V2.5 just dropped ~99% and Tencent Cloud's DeepSeek-V4 hosting now matches DeepSeek's direct API price — both are now viable cheap-tier options in cost-based routing tables. - **Google's GKE Inference Gateway cuts time-to-first-token 92.8% via prefix-cache-aware routing** — [Google Cloud Blog](https://cloud.google.com/blog/products/containers-kubernetes/gke-inference-gateway-prefix-caching-accelerates-ai-inference) An independent Principled Technologies benchmark found GKE Inference Gateway — which caches KV-cache prefixes and routes requests to the pod already holding the matching prefix instead of round-robin — delivered 15.7% higher output token throughput (7,169 vs 6,042 tok/s), 92.8% lower time-to-first-token (188ms vs 2,625ms), and 62.6% lower inter-token latency (30.2ms vs 81ms) versus a managed-Kubernetes baseline serving Llama 3.1 8B on identical 8x A100 hardware. Snap reports 75-80% prefix cache hit rates running this in production via the open-source llm-d stack. **Builder angle:** Routing on KV-cache locality instead of round-robin is a direct GPU-count and latency lever for RAG/long-system-prompt workloads — same throughput on fewer accelerators, with TTFT improvements that matter for interactive agents. - **Xiaomi's MiMo-V2.5-Pro-UltraSpeed hits 1,000+ tok/s on a 1T-parameter model via FP4 + DFlash speculative decoding — 3x price for 10x output** — [Xiaomi MiMo Blog](https://mimo.xiaomi.com/blog/mimo-tilert-1000tps) Xiaomi and TileRT combined selective FP4 quantization on MoE expert weights with 'DFlash' block-level speculative decoding (avg. accepted-token length 6.30 in coding, 5.56 in math, 4.29 in agent tasks) to decode a 1-trillion-parameter model at over 1,000 tok/s (up to ~1,200) on a single standard 8-GPU node — about 10x the standard MiMo-V2.5-Pro throughput. The UltraSpeed API tier is priced at 3x the standard MiMo-V2.5-Pro rate, available via a limited application-based trial running June 9-23, 2026. **Builder angle:** A 3x price for ~10x tokens/sec changes the cost-per-completed-task math for latency-bound agentic and trading workloads previously bottlenecked on trillion-parameter decode speed — worth benchmarking against smaller models during the trial window. ## Also tracking - **Salesforce ships Agentforce Mobile SDK to GA and opens ADL Connect API beta for scriptable RAG data libraries** — [source](https://developer.salesforce.com/blogs/2026/06/the-salesforce-developers-guide-to-the-summer-26-release) — RAG grounding data for Salesforce agents can now be created, uploaded, and promoted through a scriptable REST API as part of CI/CD, while the same agent ships into native iOS/Android/React Native apps via a GA SDK. - **Vercel ships Drives for Sandbox in private beta for persistent storage across disposable AI agent sandboxes** — [source](https://vercel.com/changelog/drives-for-vercel-sandbox-in-private-beta) — Removes the re-provisioning cost of disposable agent sandboxes by letting builders persist a coding agent's workspace (repo clone, deps, build cache) across runs instead of rebuilding it from scratch each time. --- # AI Platform — June 9, 2026 **URL:** https://artificialcuriositylabs.ai/daily/ai-platform/2026-06-09/ **Beat:** ai-platform **Date:** 2026-06-09 **Topics:** routing, latency, throughput, cerebras, kimi, benchmarks **Summary:** Cerebras positions Kimi K2.6 at 981 tok/s output — 5.4× faster than Gemini 3.5 Flash with half the TTFT; Google Gemini 2.0 Flash permanently shut down J… ## The read Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run. ## What moved - **Cerebras positions Kimi K2.6 at 981 tok/s output — 5.4× faster than Gemini 3.5 Flash with half the TTFT** — [Cerebras Blog](https://www.cerebras.ai/blog/which-is-faster-gemini-3-5-flash-or-kimi-k2-6-on-cerebras) Cerebras published an Artificial Analysis-backed benchmark showing Kimi K2.6 on Cerebras hardware achieves 981 output tok/s versus Gemini 3.5 Flash's 181 tok/s (5.4×), TTFT of 452ms vs 960ms, and end-to-end task completion of 5.6s vs 17.5s. Artificial Analysis quality scores are comparable (53.9 vs 55.3). The post emphasizes that Kimi K2.6 is open-weight and fine-tune-able, contrasting with Gemini's closed API. **Builder angle:** Builders with latency-sensitive or high-throughput workloads have a concrete routing signal: route open-weight Kimi K2.6 to Cerebras to hit sub-500ms TTFT and ~1000 tok/s — at similar quality to Gemini 3.5 Flash. - **Google Gemini 2.0 Flash permanently shut down June 1 — builders must migrate to Gemini 3.x at substantially higher prices** — [Google AI Developer Docs](https://ai.google.dev/gemini-api/docs/pricing) Google's pricing page (updated 2026-06-02) confirms gemini-2.0-flash-001 and gemini-2.0-flash-lite-001 were shut down June 1, 2026 with no further access. Migration options: Gemini 3.5 Flash at $1.50/$9.00 per 1M input/output tokens (Standard tier), or the budget Gemini 3.1 Flash-Lite at $0.25/$1.50. Google also introduced three inference tiers — Flex (lower cost, batch-speed SLA), Standard, and Priority (80% premium) — plus a new context caching fee structure at $0.025–$8.10/1M tokens/hour depending on model. **Builder angle:** Any production call to a gemini-2.0-flash model ID is now a dead endpoint; migration to Gemini 3.1 Flash-Lite preserves budget (comparable pricing to 2.0 Flash), while Gemini 3.5 Flash output is ~6× pricier and requires evaluating whether the quality uplift justifies it. - **Databricks rolls out Instructed-Retriever-1 to all customers: FP8 + speculative decoding cuts search latency 3×** — [Databricks Blog](https://www.databricks.com/blog/3x-faster-search-parallel-test-time-scaling-instructed-retriever-1) Databricks' Knowledge Assistant now uses Instructed-Retriever-1, a MoE model served with FP8 quantization (NVIDIA ModelOpt, zero measured quality degradation) and speculative decoding contributing a 30%+ speed-up. Production results: 3× faster search, 2× faster answer generation, TTFT ~2s, E2E latency below 10s. The model handles both query generation and reranking in parallel via test-time scaling. **Builder angle:** First production-validated data point for FP8 + speculative decoding combined on a MoE serving stack — a cost/latency template builders can reference when evaluating similar optimizations for their own self-hosted or Databricks-hosted inference pipelines. ## Also tracking - **Albireo (arXiv): 2× throughput and 54% lower energy vs vLLM via non-scalable scheduling overhead elimination** — [source](https://arxiv.org/abs/2606.01927) — Research paper (June 1, 2026) — 2× throughput, 48% latency reduction, 54% lower energy vs vLLM on same hardware; no deployable release yet, tracks as pre-deployment signal for self-hosted inference operators. --- # AI Platform — June 8, 2026 **URL:** https://artificialcuriositylabs.ai/daily/ai-platform/2026-06-08/ **Beat:** ai-platform **Date:** 2026-06-08 **Topics:** prefix-caching, kv-cache, routing, pricing, billing, agent-sdk **Summary:** DigitalOcean ships prefix-aware routing and incoming cached-token pricing, claims up to 4x lower effective compute cost; Anthropic moves Claude Agent SD… ## The read Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run. ## What moved - **DigitalOcean ships prefix-aware routing and incoming cached-token pricing, claims up to 4x lower effective compute cost** — [DigitalOcean Blog](https://www.digitalocean.com/blog/reduce-llm-inference-costs-prefix-caching) DigitalOcean's Inference Gateway now routes requests to the GPU replica already holding a matching KV-cache prefix instead of round-robin, lifting cache hit rates from ~25% to 75%+ on shared-prefix workloads. The post (June 2) says this can cut effective compute cost up to 4x per request on identical hardware and recover '34 GPU-hours saved every single day' at 1M requests/day with 70% prompt overlap; it also previews cached-token pricing that bills cache hits at a discount instead of full recompute rates. **Builder angle:** Apps with high prompt overlap (system prompts, RAG templates, tool schemas) can cut inference spend by routing to cache-aware gateways now and onto discounted cached-token pricing once it ships, instead of recomputing identical prefixes on every call. - **Anthropic moves Claude Agent SDK to separate credit-pool billing on June 15, ending subscription coverage for automated workloads** — [Anthropic Support (Claude Help Center)](https://support.claude.com/en/articles/15036540-use-the-claude-agent-sdk-with-your-claude-plan) Starting June 15, 2026, Agent SDK usage — the Python/TypeScript SDK, headless `claude -p`, the Claude Code GitHub Actions integration, and third-party apps authenticated via the Agent SDK — moves off standard subscription limits onto a dedicated monthly credit pool billed at API rates: $20/mo for Pro, $100 for Max 5x, $200 for Max 20x, and $20/$100 for Team Standard/Premium seats. Usage beyond the credit either flows to pay-as-you-go rates (if enabled) or halts until the next billing cycle. Interactive Claude Code, web chat, and app usage are unaffected. **Builder angle:** Teams running Claude in CI/CD, cron jobs, or background agents via the Agent SDK need a separate budget line starting June 15 — flat-rate subscriptions stop covering automated/headless usage and overage either costs API rates or stops the agent. - **Study finds reasoning-model list prices mislead on real cost — up to 28x reversal between cheaper-listed and actually-cheaper models** — [arXiv](https://arxiv.org/abs/2603.23971) Comparing reasoning-model pairs across tasks, researchers found that in 32% of model-pair comparisons the model with the lower listed per-token price actually cost more in total — by as much as 28x — because thinking-token consumption varies wildly (one model used up to 900% more reasoning tokens than another on the same query). Concrete case cited: Gemini 3 Flash lists 80% cheaper than GPT-5.4 but costs 38% more overall once thinking-token volume is counted. **Builder angle:** Don't pick a reasoning model on its per-token sticker price — measure actual thinking-token consumption per task, since a 'cheaper' model can quietly cost multiples more once reasoning overhead is counted. ## Also tracking - **Cloudflare Agents SDK v0.14.0 adds Agent Skills, chat messengers, scheduled tasks, and durable Think Workflows** — [source](https://developers.cloudflare.com/changelog/post/2026-06-02-agents-sdk-v0140/) — Adds a declarative scheduling DSL and durable Workflow-backed reasoning steps to the Agents SDK, letting builders move recurring/long-running agent logic out of custom cron and state-management code and into Cloudflare's managed runtime. - **Microsoft ships @azure/functions-skills, an npx-installable agent toolkit for the new Azure Functions serverless agents runtime** — [source](https://devblogs.microsoft.com/azure-sdk/introducing-azure-functions-skills-ai-era-workspace/) — Gives builders a single CLI to scaffold, validate, and deploy event-driven AI agents onto Azure's serverless runtime with identity-based defaults baked in, rather than hand-wiring Functions + MCP + agent config separately. --- # AI Platform — June 7, 2026 **URL:** https://artificialcuriositylabs.ai/daily/ai-platform/2026-06-07/ **Beat:** ai-platform **Date:** 2026-06-07 **Topics:** routing, vllm, agentic, saar, open-source, latency **Summary:** vLLM Semantic Router v0.3 Themis ships SAAR stateful routing with RouterArena #1 ranking at $0.11/1K queries; DigitalOcean Inference Gateway ships prefi… ## The read Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run. ## What moved - **vLLM Semantic Router v0.3 Themis ships SAAR stateful routing with RouterArena #1 ranking at $0.11/1K queries** — [vLLM Blog](https://vllm.ai/blog/2026-06-05-v0.3-vllm-sr-themis-release) vLLM Semantic Router v0.3 Themis ships Session-Aware Agentic Routing (SAAR) as a production-ready feature that locks multi-turn agent sessions to a specific model during active tool loops and provider-state continuations, resetting only at safe idle or drift boundaries. The release ranks #1 on RouterArena with a 75.4 weighted score at a $0.11/1K queries cost point, adds 18 new signal families (PII detection, jailbreak, complexity, embedding, etc.), introduces a canonical v0.3 YAML config replacing fragmented layouts, and extends hardware support to AMD ROCm and Intel OpenVINO alongside NVIDIA. **Builder angle:** Builders running multi-turn agents can now delegate model-continuity logic to vLLM SR—SAAR prevents mid-session model switches during tool loops without custom routing code, while prefix-cache-aware switch pricing keeps costs visible. - **DigitalOcean Inference Gateway ships prefix-aware routing live, with cached-token pricing coming soon** — [DigitalOcean Blog](https://www.digitalocean.com/blog/reduce-llm-inference-costs-prefix-caching) DigitalOcean's Serverless Inference now routes requests to GPU instances already holding a shared system-prompt prefix in KV cache. At 1M daily requests where 70% share a common prefix, prefix-aware routing recovers ~34 GPU-hours/day; at 10M requests, ~340 GPU-hours/day—up to 4x effective compute cost reduction per request for prefix-heavy workloads. vLLM runtime optimizations on AMD Instinct MI325X and NVIDIA Hopper GPUs back the gains. Cached-token pricing (lower per-token cost on cache hits) is announced as launching on Serverless Inference within the next few weeks. **Builder angle:** Builders with high shared-system-prompt traffic on DigitalOcean Serverless Inference get immediate cache-hit routing; upcoming cached-token pricing will translate cache hits into direct per-token cost savings. - **DeepSeek makes V4 Pro 75% discount permanent, undercutting GPT-5 and Claude Opus at $0.87/M output tokens** — [The Next Web](https://thenextweb.com/news/deepseek-v4-pro-75-percent-price-cut-permanent) DeepSeek locked its promotional 75% price cut on V4 Pro permanently on May 24, after initially scheduling it to expire May 31. New rates: $0.003625 input / $0.87 output per million tokens (down from $0.0145–$3.48). At these rates, V4 Pro with 1M-token context undercuts OpenAI GPT-5 ($2.50/$10 per M), Anthropic Claude Opus 4.7 ($5/$25), and Google Gemini 3.5 Flash ($0.15/$0.60 output). Cache-hit input pricing can drop further to $0.0036/M. **Builder angle:** The permanent cut makes DeepSeek V4 Pro a durable low-cost tier in inference routing tables—builders targeting sub-$1/M output with long-context (1M tokens) and frontier-class reasoning now have a persistent option rather than a promotional window. ## Also tracking - **OpenAI updates GPT-Rosalind with GPT-5.5 tool use and 31% token efficiency gains on life-science benchmarks** — [source](https://openai.com/index/introducing-new-capabilities-to-gpt-rosalind) — Domain-specific life-sciences model update; 31% fewer tokens than GPT-5.5 on GeneBench with higher accuracy—cost signal for builders in biotech/pharma verticals but no general API pricing change. --- # AI Platform — June 6, 2026 **URL:** https://artificialcuriositylabs.ai/daily/ai-platform/2026-06-06/ **Beat:** ai-platform **Date:** 2026-06-06 **Topics:** prefix-caching, routing, vllm, cost-optimization, pricing, github-copilot **Summary:** DigitalOcean Inference Gateway ships prefix-aware routing with 75%+ cache hit rates; GitHub Copilot switches all plans to usage-based AI Credits billing… ## The read Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run. ## What moved - **DigitalOcean Inference Gateway ships prefix-aware routing with 75%+ cache hit rates** — [DigitalOcean Blog](https://www.digitalocean.com/blog/reduce-llm-inference-costs-prefix-caching) DigitalOcean's Inference Gateway (June 2, 2026) routes requests to vLLM pods most likely to hold matching KV-cache prefix blocks, using sha256_cbor_64bit block hashes and combined prefix-cache plus GPU-utilization scorers. On shared-prefix workloads, cache hit rates rise from roughly 25% under round-robin to 75%+, cutting effective compute cost by up to 4x on identical hardware; prefix caching with cached-token pricing is rolling out to Serverless Inference in coming weeks. **Builder angle:** Multi-replica inference fleets can cut redundant prefill spend by routing to cache-warm pods instead of adding GPUs—especially for agent loops with fixed system prompts. - **GitHub Copilot switches all plans to usage-based AI Credits billing** — [GitHub Changelog](https://github.blog/changelog/2026-06-01-updates-to-github-copilot-billing-and-plans/) As of June 1, 2026, all Copilot plans bill by GitHub AI Credits consumed (each credit equals $0.01 of value) instead of premium request units. Included monthly allowances are 1,500 credits on Pro ($10), 7,000 on Pro+ ($39), and 20,000 on Max ($100); overages require an additional spending budget. Copilot code review now also consumes GitHub Actions minutes alongside AI Credits. **Builder angle:** Copilot cost is now token-metered like API inference—agentic and review-heavy workflows need credit budgets and plan-tier math before defaulting to premium models. - **DeepSeek makes V4 Pro 75% API price cut permanent at $0.87 per million output tokens** — [The Next Web](https://thenextweb.com/news/deepseek-v4-pro-75-percent-price-cut-permanent) DeepSeek locked in a promotional 75% discount on V4 Pro API pricing after a May 31 expiry date, setting permanent rates from $0.003625 to $0.87 per million tokens (down from $0.0145 to $3.48). The model supports a 1M-token context window at the lower price, undercutting GPT-5, Claude Opus 4.7, and Gemini Flash tiers on per-token output cost. **Builder angle:** Long-context and high-volume workloads have a materially cheaper frontier-tier option—builders should model routing simple tasks to DeepSeek while weighing compliance and latency tradeoffs. ## Also tracking - **Vercel Sandbox Drives add persistent attachable storage for agent workspaces** — [source](https://vercel.com/changelog) — Agent sandboxes can retain cloned repos, dependencies, and build artifacts across disposable runs instead of cold-starting every session. - **skills.sh API launches with Vercel OIDC auth for querying 600k+ open-source skills** — [source](https://vercel.com/changelog/the-skills-sh-api-is-now-available) — Deployed apps on Vercel can discover and audit agent skills programmatically without storing long-lived API secrets.