AI Platform — June 6, 2026

What does inference cost and what platform do I build on?

5 Jun, 2026

prefix-caching
routing
vllm
cost-optimization
pricing
github-copilot

The read

Token price is the new kWh, and the platform you ship on determines how fast you reach production. Jevons says falling inference cost drives more loops and heavier agents — track pricing, routing, and ship infrastructure moves that change what builders can afford to run.

What moved

DigitalOcean Inference Gateway ships prefix-aware routing with 75%+ cache hit rates — DigitalOcean Blog DigitalOcean’s Inference Gateway (June 2, 2026) routes requests to vLLM pods most likely to hold matching KV-cache prefix blocks, using sha256_cbor_64bit block hashes and combined prefix-cache plus GPU-utilization scorers. On shared-prefix workloads, cache hit rates rise from roughly 25% under round-robin to 75%+, cutting effective compute cost by up to 4x on identical hardware; prefix caching with cached-token pricing is rolling out to Serverless Inference in coming weeks. Builder angle: Multi-replica inference fleets can cut redundant prefill spend by routing to cache-warm pods instead of adding GPUs—especially for agent loops with fixed system prompts.
GitHub Copilot switches all plans to usage-based AI Credits billing — GitHub Changelog As of June 1, 2026, all Copilot plans bill by GitHub AI Credits consumed (each credit equals $0.01 of value) instead of premium request units. Included monthly allowances are 1,500 credits on Pro ($10), 7,000 on Pro+ ($39), and 20,000 on Max ($100); overages require an additional spending budget. Copilot code review now also consumes GitHub Actions minutes alongside AI Credits. Builder angle: Copilot cost is now token-metered like API inference—agentic and review-heavy workflows need credit budgets and plan-tier math before defaulting to premium models.
DeepSeek makes V4 Pro 75% API price cut permanent at $0.87 per million output tokens — The Next Web DeepSeek locked in a promotional 75% discount on V4 Pro API pricing after a May 31 expiry date, setting permanent rates from $0.003625 to $0.87 per million tokens (down from $0.0145 to $3.48). The model supports a 1M-token context window at the lower price, undercutting GPT-5, Claude Opus 4.7, and Gemini Flash tiers on per-token output cost. Builder angle: Long-context and high-volume workloads have a materially cheaper frontier-tier option—builders should model routing simple tasks to DeepSeek while weighing compliance and latency tradeoffs.

Also tracking

Vercel Sandbox Drives add persistent attachable storage for agent workspaces — source — Agent sandboxes can retain cloned repos, dependencies, and build artifacts across disposable runs instead of cold-starting every session.
skills.sh API launches with Vercel OIDC auth for querying 600k+ open-source skills — source — Deployed apps on Vercel can discover and audit agent skills programmatically without storing long-lived API secrets.