Infrastructure

System Health

Monitor the health of every infrastructure component. Track response times, manage incidents, and view uptime history for your DRD deployment.

Monitored Components

Database

PostgreSQL connection health, query latency, and connection pool status.

API

REST and tRPC endpoint response times, error rates, and throughput.

Queue

Job queue depth, processing rate, and worker health.

Cache

Cache hit rates, eviction rates, and memory utilization.

Storage

File storage availability, upload/download latency, and capacity.

Integration

Third-party service connectivity and webhook delivery health.

Status Indicators

Status	Description	Action
operational	Component is functioning normally within SLA parameters	None
degraded	Component experiencing higher latency or partial errors but still functional	Investigate
partial_outage	Component partially unavailable; some requests failing	Alert team
major_outage	Component fully unavailable; all requests failing	Incident created
maintenance	Component undergoing scheduled maintenance	Check ETA

Incident Tracking

DRD automatically creates incidents when component status changes. Each incident has a lifecycle from detection through resolution with full timeline.

Detected

Monitoring detected an anomaly. Incident created and on-call team notified within 60 seconds.

Investigating

Engineering team is actively investigating. Status page updated. Affected customers notified.

Identified

Root cause identified. Fix in progress. Estimated time to resolution published.

Monitoring

Fix deployed and being monitored. Component back to operational but under close watch.

Resolved

Incident fully resolved. Post-mortem scheduled. Customer notification sent with summary.

Automated Detection

DRD uses anomaly detection across 50+ metrics to automatically detect incidents. Mean time to detect (MTTD) is under 30 seconds for major outages and under 5 minutes for degraded performance.

system-health.ts

import { DRD } from '@drd/sdk';

const drd = new DRD({ apiKey: process.env.DRD_API_KEY });

// Check current health
const health = await drd.systemHealth.getCurrent();
health.forEach(check => {
  console.log(check.component, check.status, check.responseTimeMs);
});

// Create an incident
await drd.systemHealth.createIncident({
  title: 'Elevated API latency',
  severity: 'medium',
  description: 'p99 latency > 500ms on /api/guard endpoint',
  affectedComponents: ['api', 'cache'],
});

Health Check Endpoints

GET /v1/health

curl https://api.drd.io/v1/health

# Response (no auth required)
{
  "ok": true,
  "data": {
    "status": "operational",
    "version": "2026.02.14",
    "components": {
      "api_gateway": { "status": "operational", "latencyMs": 12 },
      "trust_engine": { "status": "operational", "latencyMs": 8 },
      "policy_engine": { "status": "operational", "latencyMs": 5 },
      "event_store": { "status": "operational", "latencyMs": 3 },
      "content_protection": { "status": "operational", "latencyMs": 15 },
      "webhook_delivery": { "status": "operational", "latencyMs": 22 }
    },
    "uptime": {
      "last24h": 100.0,
      "last7d": 99.99,
      "last30d": 99.98
    },
    "checkedAt": "2026-02-14T12:00:00Z"
  }
}

GET /v1/health/metrics

curl "https://api.drd.io/v1/health/metrics?window=24h&resolution=hourly" \
  -H "Authorization: Bearer drd_ws_sk_live_Abc123..."

# Response
{
  "ok": true,
  "data": {
    "window": "24h",
    "resolution": "hourly",
    "metrics": {
      "requestsTotal": 8542100,
      "errorRate": 0.002,
      "p50LatencyMs": 14,
      "p95LatencyMs": 45,
      "p99LatencyMs": 120,
      "points": [
        { "time": "2026-02-13T12:00:00Z", "requests": 356000, "p50": 13, "p99": 110 },
        { "time": "2026-02-13T13:00:00Z", "requests": 362000, "p50": 14, "p99": 115 }
      ]
    }
  }
}

GET /v1/health/incidents

curl "https://api.drd.io/v1/health/incidents?status=resolved&limit=5" \
  -H "Authorization: Bearer drd_ws_sk_live_Abc123..."

# Response
{
  "ok": true,
  "data": [
    {
      "id": "inc_01JM7XBN4RTYP",
      "title": "Elevated API Gateway Latency",
      "status": "resolved",
      "severity": "minor",
      "affectedComponents": ["api_gateway"],
      "detectedAt": "2026-02-10T14:22:00Z",
      "resolvedAt": "2026-02-10T14:45:00Z",
      "durationMinutes": 23,
      "timeline": [
        { "status": "detected", "at": "2026-02-10T14:22:00Z", "message": "p99 latency exceeded 500ms" },
        { "status": "investigating", "at": "2026-02-10T14:25:00Z", "message": "Team investigating" },
        { "status": "resolved", "at": "2026-02-10T14:45:00Z", "message": "Cache layer restored" }
      ]
    }
  ]
}

SDK Example

health-sdk.ts

import { DRDClient } from "@drd-io/sdk";

const drd = new DRDClient({
  apiKey: process.env.DRD_API_KEY!,
  workspace: process.env.DRD_WORKSPACE!,
});

// Quick health check
const health = await drd.health.check();
console.log(`Status: ${health.status}`);
console.log(`Uptime (30d): ${health.uptime.last30d}%`);

// Check individual components
for (const [name, component] of Object.entries(health.components)) {
  console.log(`${name}: ${component.status} (${component.latencyMs}ms)`);
}

// Get performance metrics
const metrics = await drd.health.metrics({ window: "24h", resolution: "hourly" });
console.log(`Total requests (24h): ${metrics.requestsTotal}`);
console.log(`Error rate: ${(metrics.errorRate * 100).toFixed(2)}%`);
console.log(`p99 latency: ${metrics.p99LatencyMs}ms`);

// Subscribe to incident updates (webhook)
await drd.health.subscribe({
  events: ["incident.created", "incident.updated", "incident.resolved"],
  url: "https://your-app.com/webhooks/drd-health",
});

Next Steps

Monitoring

Agent-level monitoring

Learn more →

Dashboard

Health widgets for dashboards

Learn more →

Webhooks

Alert on health changes

Learn more →

Agent Health

Agent-level health metrics and error tracking

Learn more →

Infrastructure

System Health

Monitor the health of every infrastructure component. Track response times, manage incidents, and view uptime history for your DRD deployment.

Monitored Components

Database

PostgreSQL connection health, query latency, and connection pool status.

API

REST and tRPC endpoint response times, error rates, and throughput.

Queue

Job queue depth, processing rate, and worker health.

Cache

Cache hit rates, eviction rates, and memory utilization.

Storage

File storage availability, upload/download latency, and capacity.

Integration

Third-party service connectivity and webhook delivery health.

Status Indicators

Status	Description	Action
operational	Component is functioning normally within SLA parameters	None
degraded	Component experiencing higher latency or partial errors but still functional	Investigate
partial_outage	Component partially unavailable; some requests failing	Alert team
major_outage	Component fully unavailable; all requests failing	Incident created
maintenance	Component undergoing scheduled maintenance	Check ETA

Incident Tracking

DRD automatically creates incidents when component status changes. Each incident has a lifecycle from detection through resolution with full timeline.

Detected

Monitoring detected an anomaly. Incident created and on-call team notified within 60 seconds.

Investigating

Engineering team is actively investigating. Status page updated. Affected customers notified.

Identified

Root cause identified. Fix in progress. Estimated time to resolution published.

Monitoring

Fix deployed and being monitored. Component back to operational but under close watch.

Resolved

Incident fully resolved. Post-mortem scheduled. Customer notification sent with summary.

Automated Detection

DRD uses anomaly detection across 50+ metrics to automatically detect incidents. Mean time to detect (MTTD) is under 30 seconds for major outages and under 5 minutes for degraded performance.

system-health.ts

import { DRD } from '@drd/sdk';

const drd = new DRD({ apiKey: process.env.DRD_API_KEY });

// Check current health
const health = await drd.systemHealth.getCurrent();
health.forEach(check => {
  console.log(check.component, check.status, check.responseTimeMs);
});

// Create an incident
await drd.systemHealth.createIncident({
  title: 'Elevated API latency',
  severity: 'medium',
  description: 'p99 latency > 500ms on /api/guard endpoint',
  affectedComponents: ['api', 'cache'],
});

Health Check Endpoints

GET /v1/health

curl https://api.drd.io/v1/health

# Response (no auth required)
{
  "ok": true,
  "data": {
    "status": "operational",
    "version": "2026.02.14",
    "components": {
      "api_gateway": { "status": "operational", "latencyMs": 12 },
      "trust_engine": { "status": "operational", "latencyMs": 8 },
      "policy_engine": { "status": "operational", "latencyMs": 5 },
      "event_store": { "status": "operational", "latencyMs": 3 },
      "content_protection": { "status": "operational", "latencyMs": 15 },
      "webhook_delivery": { "status": "operational", "latencyMs": 22 }
    },
    "uptime": {
      "last24h": 100.0,
      "last7d": 99.99,
      "last30d": 99.98
    },
    "checkedAt": "2026-02-14T12:00:00Z"
  }
}

GET /v1/health/metrics

curl "https://api.drd.io/v1/health/metrics?window=24h&resolution=hourly" \
  -H "Authorization: Bearer drd_ws_sk_live_Abc123..."

# Response
{
  "ok": true,
  "data": {
    "window": "24h",
    "resolution": "hourly",
    "metrics": {
      "requestsTotal": 8542100,
      "errorRate": 0.002,
      "p50LatencyMs": 14,
      "p95LatencyMs": 45,
      "p99LatencyMs": 120,
      "points": [
        { "time": "2026-02-13T12:00:00Z", "requests": 356000, "p50": 13, "p99": 110 },
        { "time": "2026-02-13T13:00:00Z", "requests": 362000, "p50": 14, "p99": 115 }
      ]
    }
  }
}

GET /v1/health/incidents

curl "https://api.drd.io/v1/health/incidents?status=resolved&limit=5" \
  -H "Authorization: Bearer drd_ws_sk_live_Abc123..."

# Response
{
  "ok": true,
  "data": [
    {
      "id": "inc_01JM7XBN4RTYP",
      "title": "Elevated API Gateway Latency",
      "status": "resolved",
      "severity": "minor",
      "affectedComponents": ["api_gateway"],
      "detectedAt": "2026-02-10T14:22:00Z",
      "resolvedAt": "2026-02-10T14:45:00Z",
      "durationMinutes": 23,
      "timeline": [
        { "status": "detected", "at": "2026-02-10T14:22:00Z", "message": "p99 latency exceeded 500ms" },
        { "status": "investigating", "at": "2026-02-10T14:25:00Z", "message": "Team investigating" },
        { "status": "resolved", "at": "2026-02-10T14:45:00Z", "message": "Cache layer restored" }
      ]
    }
  ]
}

SDK Example

health-sdk.ts

import { DRDClient } from "@drd-io/sdk";

const drd = new DRDClient({
  apiKey: process.env.DRD_API_KEY!,
  workspace: process.env.DRD_WORKSPACE!,
});

// Quick health check
const health = await drd.health.check();
console.log(`Status: ${health.status}`);
console.log(`Uptime (30d): ${health.uptime.last30d}%`);

// Check individual components
for (const [name, component] of Object.entries(health.components)) {
  console.log(`${name}: ${component.status} (${component.latencyMs}ms)`);
}

// Get performance metrics
const metrics = await drd.health.metrics({ window: "24h", resolution: "hourly" });
console.log(`Total requests (24h): ${metrics.requestsTotal}`);
console.log(`Error rate: ${(metrics.errorRate * 100).toFixed(2)}%`);
console.log(`p99 latency: ${metrics.p99LatencyMs}ms`);

// Subscribe to incident updates (webhook)
await drd.health.subscribe({
  events: ["incident.created", "incident.updated", "incident.resolved"],
  url: "https://your-app.com/webhooks/drd-health",
});

Next Steps

Monitoring

Agent-level monitoring

Learn more →

Dashboard

Health widgets for dashboards

Learn more →

Webhooks

Alert on health changes

Learn more →

Agent Health

Agent-level health metrics and error tracking

Learn more →