Agent Oversight

Agent Health Monitor

Continuously monitor agent health with automated checks for memory, latency, entity tracking, and stale data detection. Metrics are retained for 90 days at 1-minute resolution and 2 years at 1-hour resolution.

Health Metrics

Each health check captures multiple metrics to provide a comprehensive view of agent performance.

Health Score

0-100 composite score reflecting overall agent health.

Memory Usage

Track memory consumption in MB to detect memory leaks.

Entity Count

Number of active entities managed by the agent.

Response Latency

Average response time in milliseconds.

Stale Data

Count of stale or outdated data entries requiring cleanup.

Check Types

Scheduled

Automated checks run at configured intervals

Manual

On-demand health checks triggered by operators

Triggered

Automatically triggered by monitoring rule events

Reporting Metrics

Push metrics to DRD from your agent process. The SDK provides a lightweight reporter that batches and sends metrics every 30 seconds by default.

POST /v1/agents/{agentId}/health

curl -X POST https://api.drd.io/v1/agents/01956abc-def0/health \
  -H "Authorization: Bearer drd_ag_..." \
  -H "Content-Type: application/json" \
  -d '{
    "timestamp": "2026-02-14T10:30:00Z",
    "metrics": {
      "memory": {
        "heapUsedMb": 245.8,
        "heapTotalMb": 512.0,
        "rssMb": 310.2,
        "externalMb": 12.4,
        "gcPauseMs": 4.2
      },
      "latency": {
        "p50Ms": 120,
        "p95Ms": 340,
        "p99Ms": 890,
        "avgMs": 145
      },
      "errors": {
        "count": 3,
        "rate": 0.02,
        "categories": {
          "timeout": 1,
          "validation": 1,
          "upstream": 1
        }
      },
      "throughput": {
        "requestsPerSecond": 12.5,
        "tasksCompleted": 450,
        "queueDepth": 8
      },
      "cpu": {
        "processPercent": 34.2,
        "systemLoad": 2.1,
        "threadCount": 12
      },
      "uptime": {
        "uptimeSeconds": 86400,
        "lastRestartAt": "2026-02-13T10:30:00Z"
      }
    },
    "status": "healthy"
  }'

SDK: Automatic Health Reporter

Use the SDK's built-in health reporter to automatically collect and report metrics from your agent process.

health-reporter.ts

import { DRDClient, HealthReporter } from "@drd/sdk";

const drd = new DRDClient({ apiKey: process.env.DRD_API_KEY! });

// Start automatic health reporting
const reporter = new HealthReporter(drd, {
  agentId: "01956abc-def0...",
  intervalMs: 30_000,          // Report every 30s (default)
  collectMemory: true,         // Auto-collect memory stats
  collectCpu: true,            // Auto-collect CPU stats
  collectLatency: true,        // Track request latencies
});

reporter.start();

// Record custom metrics
reporter.recordLatency("model_call", 342);
reporter.recordLatency("db_query", 12);
reporter.recordError("timeout", new Error("Model provider timeout"));

// Record a custom metric
reporter.recordCustom("cache_hit_rate", 0.87);
reporter.recordCustom("active_sessions", 42);

// Get current health snapshot
const snapshot = reporter.getSnapshot();
console.log("Status:", snapshot.status);
console.log("Memory:", snapshot.metrics.memory.heapUsedMb, "MB");
console.log("Error rate:", snapshot.metrics.errors.rate);

// Graceful shutdown
process.on("SIGTERM", async () => {
  await reporter.flush();
  reporter.stop();
});

Automated Health Checks

Configure active health checks that DRD runs against your agent endpoints. Health checks verify that your agent is responsive and functioning correctly.

health-checks.ts

import { DRD } from '@drd/sdk';

const drd = new DRD({ apiKey: process.env.DRD_API_KEY });

// Configure health checks for an agent
const healthCheck = await drd.agents.healthChecks.create("01956abc-def0...", {
  name: "Primary Health Check",
  type: "http",
  config: {
    url: "https://my-agent.example.com/health",
    method: "GET",
    expectedStatus: 200,
    expectedBody: { status: "ok" },
    timeoutMs: 5000,
    headers: {
      "X-Health-Check": "drd",
    },
  },
  schedule: {
    intervalSeconds: 60,     // Check every 60 seconds
    retries: 2,              // Retry 2 times before marking unhealthy
    retryDelayMs: 5000,      // Wait 5s between retries
  },
  thresholds: {
    unhealthy: 3,            // 3 consecutive failures = unhealthy
    degraded: 1,             // 1 failure = degraded
    recovery: 2,             // 2 consecutive passes = recovered
  },
  notifications: {
    onUnhealthy: ["slack", "pagerduty"],
    onDegraded: ["slack"],
    onRecovered: ["slack"],
  },
});

console.log("Health check ID:", healthCheck.id);
console.log("Next check at:", healthCheck.nextCheckAt);

Health statuses: Agents have three health states: healthy (all checks passing), degraded (some checks failing or metrics above thresholds), and unhealthy (critical checks failing or agent unresponsive).

Alerting

Create metric-based alerts that fire when thresholds are breached. Alerts support compound conditions, cool-down periods, and escalation chains.

alerts.ts

// Create a latency alert
const latencyAlert = await drd.agents.alerts.create("01956abc-def0...", {
  name: "High Latency Alert",
  condition: {
    metric: "latency.p95Ms",
    operator: "gt",
    value: 500,
    window: "5m",
  },
  severity: "warning",
  cooldownMinutes: 30,
  channels: [
    { type: "slack", webhookUrl: "https://hooks.slack.com/..." },
  ],
});

// Create a memory pressure alert
const memoryAlert = await drd.agents.alerts.create("01956abc-def0...", {
  name: "Memory Pressure",
  condition: {
    metric: "memory.heapUsedMb",
    operator: "gt",
    value: 450,
    window: "10m",
  },
  severity: "critical",
  cooldownMinutes: 15,
  channels: [
    { type: "pagerduty", routingKey: "R0123..." },
    { type: "email", to: ["oncall@example.com"] },
  ],
  autoRemediation: {
    action: "restart_agent",
    maxAttempts: 2,
    delaySeconds: 60,
  },
});

// Create an error rate alert
const errorAlert = await drd.agents.alerts.create("01956abc-def0...", {
  name: "Error Rate Spike",
  condition: {
    metric: "errors.rate",
    operator: "gt",
    value: 0.05,
    window: "3m",
  },
  severity: "critical",
  channels: [
    { type: "slack", webhookUrl: "https://hooks.slack.com/..." },
    { type: "pagerduty", routingKey: "R0123..." },
  ],
});

Historical Metrics

Query historical metrics for trend analysis, capacity planning, and incident investigation.

query-metrics.ts

// Query historical metrics
const metrics = await drd.agents.health.query("01956abc-def0...", {
  metrics: ["memory.heapUsedMb", "latency.p95Ms", "errors.rate"],
  from: "2026-02-13T00:00:00Z",
  to: "2026-02-14T00:00:00Z",
  resolution: "5m",
});

for (const point of metrics.dataPoints) {
  console.log(point.timestamp);
  console.log("  Memory:", point.memory.heapUsedMb, "MB");
  console.log("  P95 Latency:", point.latency.p95Ms, "ms");
  console.log("  Error Rate:", (point.errors.rate * 100).toFixed(2), "%");
}

// Get health summary for all agents
const { data: agentHealth } = await drd.agents.health.summary({
  status: "degraded",
  sortBy: "errorRate",
  order: "desc",
  limit: 25,
});

for (const agent of agentHealth) {
  console.log(agent.name, agent.healthStatus, agent.errorRate, agent.p95Latency);
}

Health and Trust Score Impact

Agent health directly affects trust scores. Persistent health issues lower the reliability component of the trust score.

Health Status	Trust Impact	Recovery
Healthy	No impact, positive reliability signal	N/A
Degraded	-5 reliability points after 1 hour	Full recovery after 24h healthy
Unhealthy	-15 reliability points, immediate	Gradual recovery over 7 days

Auto-quarantine: Agents that remain unhealthy for more than 4 hours are automatically quarantined. Quarantined agents cannot perform actions until a workspace admin manually reviews and restores them.

Next Steps

Token Monitor

Track token costs and budgets

Learn more →

Monitoring

Real-time agent monitoring

Learn more →

Trust Decisions

Review trust decision logs

Learn more →

Agent Sandbox

Isolated testing environments for agent validation

Learn more →