Building an AI Agent Evaluation Dashboard: Notes on Artificial Intelligence

Patrick Donahue · Levelbrook Consulting

Try the interactive demo

The leap from single-shot Large Language Model (LLM) completions to autonomous agents is profound. An agent doesn't just answer a question; it perceives, reasons, and acts within an environment to achieve a goal. This shift introduces a significant engineering challenge: how do we reliably test something that is non-deterministic, stateful, and capable of complex, multi-step behavior? Standard unit tests and benchmarks fall short. We need something more akin to an avionics testbed—a system for running repeatable scenarios and deeply inspecting the results.

This is the domain of agent evaluation, and it's a fascinating problem space. It's not just about pass/fail. We need to measure efficiency (cost, latency, tokens), robustness (error handling), and correctness (did the agent achieve the goal in the intended way?). This requires building a dedicated platform for configuring, running, and reviewing agent performance in replayable environments. It's a problem that sits at the intersection of data engineering, application development, and user experience.

Architecting the Evaluation Harness

Let's sketch out a pragmatic architecture for such a system. The goal is a web-based interface where an engineer can define a test scenario, configure an agent to run against it, execute the test, and then review a detailed, step-by-step replay of the agent's actions.

Our stack will be Python for the backend agent execution (leveraging its rich AI/ML ecosystem), a modern React frontend for the interactive dashboard, and PostgreSQL as our structured data store. Communication will be handled over a combination of REST and Server-Sent Events (SSE) for real-time updates.

Data Model

A solid data model in PostgreSQL is the foundation. The core entities are the Environment, the Agent, and the Evaluation Run which ties them together.

Environments: Defines the initial state of a test scenario. This could be a virtual filesystem, a set of mocked API endpoints, or a user prompt. Storing the configuration as JSONB provides flexibility.
Agents: Stores the agent's configuration—the underlying model (e.g., claude-3-opus-20240229), the system prompt, and the definitions of available tools.
EvaluationRuns: The record of a specific agent being run in a specific environment. It tracks status (pending, running, completed) and high-level outcomes.
RunSteps: This is the heart of the replay. It’s an append-only log of every thought, action, and observation an agent makes during a run. A step_type enum and a payload JSONB column capture the rich, structured data of the agent's trace.
Metrics: A table to store quantitative results for each run—cost, latency, token counts, and any custom correctness scores.

-- Simplified SQL schema
CREATE TABLE "EvaluationRuns" (
    "id" UUID PRIMARY KEY,
    "agent_id" UUID,
    "environment_id" UUID,
    "status" TEXT NOT NULL,
    "created_at" TIMESTAMPTZ NOT NULL
);

CREATE TABLE "RunSteps" (
    "id" BIGSERIAL PRIMARY KEY,
    "run_id" UUID REFERENCES "EvaluationRuns"("id"),
    "step_index" INTEGER NOT NULL,
    "step_type" TEXT NOT NULL, -- 'thought', 'action', 'observation'
    "payload" JSONB,
    "timestamp" TIMESTAMPTZ NOT NULL
);

Backend and Real-Time UX

The backend needs to be split. A web server (e.g., FastAPI) handles standard CRUD operations via a REST API for managing agents and environments. The actual execution of an EvaluationRun, however, is a long-running, resource-intensive task that must be offloaded to a separate worker process using a queue like Celery with Redis.

This architecture ensures the web interface remains responsive. When a user kicks off an evaluation, the API enqueues a job and immediately returns a run_id. The frontend can then subscribe to a real-time stream for that run.

For real-time updates, Server-Sent Events (SSE) are an excellent fit. They are simpler than WebSockets and perfect for the one-way push of RunStep data from server to client. In my work on Ruby on Rails systems, I've seen the power of similar technologies like Turbo Streams for creating live UIs, and the pattern translates directly. The Python worker, as it executes each step of the agent's logic, can publish events to a channel (e.g., a Redis Pub/Sub channel) which the web server then forwards to the client over an SSE connection. This is much like the thinking behind my open-source ai_stream gem, which implements the Vercel AI SDK streaming protocol for Ruby—standardizing data streaming is key to building these responsive interfaces.

Where Things Break at Scale

This architecture is robust, but scaling introduces challenges. The RunSteps table can grow enormous, necessitating partitioning strategies in PostgreSQL. Concurrently running hundreds of evaluations requires a scalable pool of containerized workers and careful management of third-party API rate limits that the agents might be calling. Most critically, agent execution environments must be strictly sandboxed—using ephemeral Docker containers per run is a non-negotiable for security and reproducibility.

Pragmatic Tradeoffs and the Human in the Loop

As a senior engineer, I know that the perfect technical solution is rarely the right one. Pragmatism is key.

Automated Metrics vs. Human Judgment

Purely automated correctness metrics are brittle. An agent tasked with "writing a summary to summary.txt" might create the file, passing a simple file-existence check, but fill it with nonsense. The ultimate arbiter of quality for complex tasks is often a human.

Therefore, the UI is not just a log viewer; it's a review tool. A crucial feature is allowing a human expert to review a completed run and apply a qualitative score or tag issues. This human-generated data is often the most valuable metric you can collect. The system should make it trivial to "promote" an interesting failed or successful run into a new "golden" environment for future regression testing.

API Design: REST vs. GraphQL

For the dashboard's core CRUD functionality, REST is simple and effective. However, for the detailed replay view, which needs to fetch a run along with all its steps, metrics, and related agent/environment configurations, a single GraphQL query could be more efficient than orchestrating multiple REST calls. A pragmatic approach is to start with well-structured REST endpoints and only introduce GraphQL if the data fetching logic on the frontend becomes overly complex.

A Closing Reflection

Building an evaluation dashboard is more than just a data-shuffling exercise. It's about crafting an instrument for observing and understanding a new kind of software behavior. The tight feedback loop this tooling enables—run, observe, refine, repeat—is what will ultimately bridge the gap between today's impressive but brittle agent demos and the reliable, production-grade AI systems of tomorrow. It's the essential, unglamorous work of building the CI/CD pipeline for the age of artificial intelligence.