Loading...
Loading...
Continuously monitor agent health with automated checks for memory, latency, entity tracking, and stale data detection. Metrics are retained for 90 days at 1-minute resolution and 2 years at 1-hour resolution.
Each health check captures multiple metrics to provide a comprehensive view of agent performance.
0-100 composite score reflecting overall agent health.
Track memory consumption in MB to detect memory leaks.
Number of active entities managed by the agent.
Average response time in milliseconds.
Count of stale or outdated data entries requiring cleanup.
Scheduled
Automated checks run at configured intervals
Manual
On-demand health checks triggered by operators
Triggered
Automatically triggered by monitoring rule events
Push metrics to DRD from your agent process. The SDK provides a lightweight reporter that batches and sends metrics every 30 seconds by default.
curl -X POST https://api.drd.io/v1/agents/01956abc-def0/health \
-H "Authorization: Bearer drd_ag_..." \
-H "Content-Type: application/json" \
-d '{
"timestamp": "2026-02-14T10:30:00Z",
"metrics": {
"memory": {
"heapUsedMb": 245.8,
"heapTotalMb": 512.0,
"rssMb": 310.2,
"externalMb": 12.4,
"gcPauseMs": 4.2
},
"latency": {
"p50Ms": 120,
"p95Ms": 340,
"p99Ms": 890,
"avgMs": 145
},
"errors": {
"count": 3,
"rate": 0.02,
"categories": {
"timeout": 1,
"validation": 1,
"upstream": 1
}
},
"throughput": {
"requestsPerSecond": 12.5,
"tasksCompleted": 450,
"queueDepth": 8
},
"cpu": {
"processPercent": 34.2,
"systemLoad": 2.1,
"threadCount": 12
},
"uptime": {
"uptimeSeconds": 86400,
"lastRestartAt": "2026-02-13T10:30:00Z"
}
},
"status": "healthy"
}'Use the SDK's built-in health reporter to automatically collect and report metrics from your agent process.
import { DRDClient, HealthReporter } from "@drd/sdk";
const drd = new DRDClient({ apiKey: process.env.DRD_API_KEY! });
// Start automatic health reporting
const reporter = new HealthReporter(drd, {
agentId: "01956abc-def0...",
intervalMs: 30_000, // Report every 30s (default)
collectMemory: true, // Auto-collect memory stats
collectCpu: true, // Auto-collect CPU stats
collectLatency: true, // Track request latencies
});
reporter.start();
// Record custom metrics
reporter.recordLatency("model_call", 342);
reporter.recordLatency("db_query", 12);
reporter.recordError("timeout", new Error("Model provider timeout"));
// Record a custom metric
reporter.recordCustom("cache_hit_rate", 0.87);
reporter.recordCustom("active_sessions", 42);
// Get current health snapshot
const snapshot = reporter.getSnapshot();
console.log("Status:", snapshot.status);
console.log("Memory:", snapshot.metrics.memory.heapUsedMb, "MB");
console.log("Error rate:", snapshot.metrics.errors.rate);
// Graceful shutdown
process.on("SIGTERM", async () => {
await reporter.flush();
reporter.stop();
});Configure active health checks that DRD runs against your agent endpoints. Health checks verify that your agent is responsive and functioning correctly.
import { DRD } from '@drd/sdk';
const drd = new DRD({ apiKey: process.env.DRD_API_KEY });
// Configure health checks for an agent
const healthCheck = await drd.agents.healthChecks.create("01956abc-def0...", {
name: "Primary Health Check",
type: "http",
config: {
url: "https://my-agent.example.com/health",
method: "GET",
expectedStatus: 200,
expectedBody: { status: "ok" },
timeoutMs: 5000,
headers: {
"X-Health-Check": "drd",
},
},
schedule: {
intervalSeconds: 60, // Check every 60 seconds
retries: 2, // Retry 2 times before marking unhealthy
retryDelayMs: 5000, // Wait 5s between retries
},
thresholds: {
unhealthy: 3, // 3 consecutive failures = unhealthy
degraded: 1, // 1 failure = degraded
recovery: 2, // 2 consecutive passes = recovered
},
notifications: {
onUnhealthy: ["slack", "pagerduty"],
onDegraded: ["slack"],
onRecovered: ["slack"],
},
});
console.log("Health check ID:", healthCheck.id);
console.log("Next check at:", healthCheck.nextCheckAt);Health statuses: Agents have three health states: healthy (all checks passing), degraded (some checks failing or metrics above thresholds), and unhealthy (critical checks failing or agent unresponsive).
Create metric-based alerts that fire when thresholds are breached. Alerts support compound conditions, cool-down periods, and escalation chains.
// Create a latency alert
const latencyAlert = await drd.agents.alerts.create("01956abc-def0...", {
name: "High Latency Alert",
condition: {
metric: "latency.p95Ms",
operator: "gt",
value: 500,
window: "5m",
},
severity: "warning",
cooldownMinutes: 30,
channels: [
{ type: "slack", webhookUrl: "https://hooks.slack.com/..." },
],
});
// Create a memory pressure alert
const memoryAlert = await drd.agents.alerts.create("01956abc-def0...", {
name: "Memory Pressure",
condition: {
metric: "memory.heapUsedMb",
operator: "gt",
value: 450,
window: "10m",
},
severity: "critical",
cooldownMinutes: 15,
channels: [
{ type: "pagerduty", routingKey: "R0123..." },
{ type: "email", to: ["oncall@example.com"] },
],
autoRemediation: {
action: "restart_agent",
maxAttempts: 2,
delaySeconds: 60,
},
});
// Create an error rate alert
const errorAlert = await drd.agents.alerts.create("01956abc-def0...", {
name: "Error Rate Spike",
condition: {
metric: "errors.rate",
operator: "gt",
value: 0.05,
window: "3m",
},
severity: "critical",
channels: [
{ type: "slack", webhookUrl: "https://hooks.slack.com/..." },
{ type: "pagerduty", routingKey: "R0123..." },
],
});Query historical metrics for trend analysis, capacity planning, and incident investigation.
// Query historical metrics
const metrics = await drd.agents.health.query("01956abc-def0...", {
metrics: ["memory.heapUsedMb", "latency.p95Ms", "errors.rate"],
from: "2026-02-13T00:00:00Z",
to: "2026-02-14T00:00:00Z",
resolution: "5m",
});
for (const point of metrics.dataPoints) {
console.log(point.timestamp);
console.log(" Memory:", point.memory.heapUsedMb, "MB");
console.log(" P95 Latency:", point.latency.p95Ms, "ms");
console.log(" Error Rate:", (point.errors.rate * 100).toFixed(2), "%");
}
// Get health summary for all agents
const { data: agentHealth } = await drd.agents.health.summary({
status: "degraded",
sortBy: "errorRate",
order: "desc",
limit: 25,
});
for (const agent of agentHealth) {
console.log(agent.name, agent.healthStatus, agent.errorRate, agent.p95Latency);
}Agent health directly affects trust scores. Persistent health issues lower the reliability component of the trust score.
| Health Status | Trust Impact | Recovery |
|---|---|---|
| Healthy | No impact, positive reliability signal | N/A |
| Degraded | -5 reliability points after 1 hour | Full recovery after 24h healthy |
| Unhealthy | -15 reliability points, immediate | Gradual recovery over 7 days |
Auto-quarantine: Agents that remain unhealthy for more than 4 hours are automatically quarantined. Quarantined agents cannot perform actions until a workspace admin manually reviews and restores them.