Skip to content
Go back

Builder's Daily / Agent Reliability

Agent Reliability — June 10, 2026

Is my agent's data fresh, behavior observable, and safe to run?

The read

You cannot run what you cannot see. Grounding is institutional context encoded in retrieval; observability is electricity-metering for agent loops; security moves from policy decks to runtime guardrails. Reliability is the stack between “it works in demo” and “it runs in prod.”

What moved

  • Perplexity launches ‘Search as Code’: agents write Python to compose retrieval, rerank, and dedup primitives directlyPerplexity Research Perplexity replaced its sequential function-calling search loop with ‘Search as Code’ (SaC): models generate task-specific Python that runs in sandboxes and calls an Agentic Search SDK exposing atomic primitives (retrieval, ranking, filtering, deduplication). On a CVE-advisory task this cut token usage 85% (288.7K to 42.9K tokens), and SaC scored +29% on DSQA and +45% on a new WANDR benchmark, with medium-reasoning SaC beating all non-SaC systems at under $1/task. Rolling out now in Perplexity Computer and the Agent API. Builder angle: Builders get composable, code-level retrieval/rerank/dedup primitives instead of fixed search endpoints, enabling per-task retrieval strategies at a fraction of the token cost of loop-based agentic search.

  • LlamaParse adds word/line/cell-level bounding boxes for audit-grade citation groundingLlamaIndex Blog LlamaParse now supports an opt-in output_options.granular_bboxes parameter to return word-, line-, or cell-level coordinates instead of coarse layout-level boxes. The system applies coordinates only to text explicitly present on the page (not inferred values or AI summaries), enabling exact-location citations for dense documents like financial filings and tables. Available across paid tiers, with Agentic Plus adding extra verification passes. Builder angle: RAG pipelines can now ground citations to a specific word or table cell rather than highlighting a whole page or paragraph, closing a gap for compliance and financial-document agents that need audit-grade provenance.

  • Arize: Microsoft’s open trust stack makes OpenInference the shared trace contract linking ASSERT evals, ACS runtime controls, and Phoenix/Arize AXArize Blog At Build 2026 Microsoft introduced ASSERT (MIT-licensed, spec-driven agent evaluation and regression-testing framework that turns behavior specs into test cases and graded traces) and Agent Control Specification (ACS), a portable runtime-guardrail standard with checkpoints at input, LLM call, state, tool execution, and output. Both standardize on OpenInference, the OpenTelemetry-for-AI standard Arize created (33+ framework integrations, two-line instrumentation): ASSERT reads OpenInference spans as judge evidence, ACS emits its control decisions as spans, and the same trace stream feeds Phoenix or Arize AX for production monitoring. Builder angle: One OpenInference instrumentation pass now feeds CI eval gates (ASSERT), runtime guardrails (ACS), and production observability (Phoenix/Arize AX) without separate re-instrumentation per tool.

Also tracking

  • Sedai launches autonomous AI Agent Optimization platform with real-time per-team/per-model cost attribution and AI-judge-based routingsource — Drop-in layer for per-team/per-model token-cost attribution and automated cost-aware model routing across providers without re-instrumenting agent code.
  • Zscaler launches AI Broker, AI Access Graph, and Endpoint AI Security to govern agent identity and MCP/A2A trafficsource — Gives a concrete pattern for scoping which MCP/A2A tools an agent can reach per identity and tracking data lineage in real time — a deployable access-control and audit layer for agent fleets.
Share this post on: