Builder's Daily / Agent Reliability
Agent Reliability — June 10, 2026
Is my agent's data fresh, behavior observable, and safe to run?
- retrieval-architecture
- agentic-search
- reranking
- cost
- ingestion
- parsing
The read
You cannot run what you cannot see. Grounding is institutional context encoded in retrieval; observability is electricity-metering for agent loops; security moves from policy decks to runtime guardrails. Reliability is the stack between “it works in demo” and “it runs in prod.”
What moved
-
Perplexity launches ‘Search as Code’: agents write Python to compose retrieval, rerank, and dedup primitives directly — Perplexity Research Perplexity replaced its sequential function-calling search loop with ‘Search as Code’ (SaC): models generate task-specific Python that runs in sandboxes and calls an Agentic Search SDK exposing atomic primitives (retrieval, ranking, filtering, deduplication). On a CVE-advisory task this cut token usage 85% (288.7K to 42.9K tokens), and SaC scored +29% on DSQA and +45% on a new WANDR benchmark, with medium-reasoning SaC beating all non-SaC systems at under $1/task. Rolling out now in Perplexity Computer and the Agent API. Builder angle: Builders get composable, code-level retrieval/rerank/dedup primitives instead of fixed search endpoints, enabling per-task retrieval strategies at a fraction of the token cost of loop-based agentic search.
-
LlamaParse adds word/line/cell-level bounding boxes for audit-grade citation grounding — LlamaIndex Blog LlamaParse now supports an opt-in
output_options.granular_bboxesparameter to return word-, line-, or cell-level coordinates instead of coarse layout-level boxes. The system applies coordinates only to text explicitly present on the page (not inferred values or AI summaries), enabling exact-location citations for dense documents like financial filings and tables. Available across paid tiers, with Agentic Plus adding extra verification passes. Builder angle: RAG pipelines can now ground citations to a specific word or table cell rather than highlighting a whole page or paragraph, closing a gap for compliance and financial-document agents that need audit-grade provenance. -
Arize: Microsoft’s open trust stack makes OpenInference the shared trace contract linking ASSERT evals, ACS runtime controls, and Phoenix/Arize AX — Arize Blog At Build 2026 Microsoft introduced ASSERT (MIT-licensed, spec-driven agent evaluation and regression-testing framework that turns behavior specs into test cases and graded traces) and Agent Control Specification (ACS), a portable runtime-guardrail standard with checkpoints at input, LLM call, state, tool execution, and output. Both standardize on OpenInference, the OpenTelemetry-for-AI standard Arize created (33+ framework integrations, two-line instrumentation): ASSERT reads OpenInference spans as judge evidence, ACS emits its control decisions as spans, and the same trace stream feeds Phoenix or Arize AX for production monitoring. Builder angle: One OpenInference instrumentation pass now feeds CI eval gates (ASSERT), runtime guardrails (ACS), and production observability (Phoenix/Arize AX) without separate re-instrumentation per tool.
Also tracking
- Sedai launches autonomous AI Agent Optimization platform with real-time per-team/per-model cost attribution and AI-judge-based routing — source — Drop-in layer for per-team/per-model token-cost attribution and automated cost-aware model routing across providers without re-instrumenting agent code.
- Zscaler launches AI Broker, AI Access Graph, and Endpoint AI Security to govern agent identity and MCP/A2A traffic — source — Gives a concrete pattern for scoping which MCP/A2A tools an agent can reach per identity and tracking data lineage in real time — a deployable access-control and audit layer for agent fleets.