Builder's Daily / Observability & Evals
Observability & Evals — June 6, 2026
How do I trace, evaluate, and monitor agents in production?
- tracing
- evaluation
- opentelemetry
- multi-turn
- microsoft
- agentcore
The read
You cannot run what you cannot see. As agent loops get cheaper and longer, traces, evals, and cost attribution become electricity-metering for software — necessary, not optional, and increasingly a human judgment layer over raw metrics.
What moved
-
Microsoft Foundry extends tracing and evals to any agent framework at Build 2026 — Microsoft Foundry Blog Foundry observability (tracing and evals GA for hosted agents) now reaches LangChain, LangGraph, OpenAI SDK, Microsoft Agent Framework, and custom stacks via OpenTelemetry. Build 2026 adds multi-turn evaluation, context-specific rubric evaluators, intelligent trace sampling for production, trace replay and visualization, traces-to-dataset for offline regression, AZD inline dev observability, user simulation for edge-case pressure tests, agent optimizer (private preview), and ROI dashboards tying task completion and cost efficiency to trace-level drill-down. Builder angle: Point your existing OTel exporter at Foundry to get multi-turn evals, rubric scoring, and production trace sampling without swapping orchestration frameworks.
-
Amazon Bedrock AgentCore ships Lambda code-based evaluators for CI gates and online monitoring — AWS Machine Learning Blog AgentCore Evaluations now accepts custom Lambda evaluators registered at TRACE, TOOL_CALL, or SESSION levels. Evaluators receive OTel span payloads and return PASS/FAIL labels plus optional scores to CloudWatch Logs and Bedrock-AgentCore/Evaluations metrics. The same evaluator ID runs on-demand for dev iteration, CI/CD deployment gates, and online evaluation with 0.01–100% session sampling. Sample covers schema validation, numerical drift checks, workflow-order enforcement, and Comprehend PII detection alongside LLM-as-a-Judge evaluators. Builder angle: Encode deterministic agent contracts—tool schemas, workflow order, PII rules—as Lambda evaluators that block deploys in CI and alarm in production on the same evaluator ID.
-
Langfuse adds GitHub Actions experiment gates and deterministic code evaluators — Langfuse Changelog langfuse/experiment-action@v1.0.0 runs versioned dataset experiments in GitHub Actions, posts pass/regress/fail status on pull requests, and fails workflows when scores miss thresholds. Released May 28 alongside code evaluators: Python or TypeScript evaluate functions in the Langfuse UI score live observations or experiment runs for JSON parseability, schema validation, exact match, and required tool arguments without network egress, returning native scores for dashboards and Score Analytics. Builder angle: Wire experiment-action to a regression dataset and code evaluators so agent PRs fail on deterministic contract breaks before semantic judge drift shows up in production.
Also tracking
- Boomi May 2026 release streams Claude Code agent OTel traces into Agent Control Tower — source — Standardizes external Claude Code agents on OpenTelemetry so execution traces, performance metrics, and cost land alongside Agentstudio agents in one control plane.
- Baidu unveils Agent Monitor observability suite at Create 2026 — source — China’s deployment-first agent stack now includes a named monitor for agent behavior and cost tracking alongside AgentBuilder 3.0 and Ernie Agent Runtime.