# 13 -- Production Checklist

Source: https://mcp-hangar.io/docs/cookbook/13-production-checklist

---
> Before you go live, walk through this list.

## Security

- [ ] TLS termination configured (reverse proxy or load balancer)
- [ ] `auth.enabled: true` and `auth.allow_anonymous: false`
- [ ] API keys created for each service principal
- [ ] RBAC roles assigned with least-privilege
- [ ] Tool access policies set for sensitive tools
- [ ] Secrets use environment variable interpolation (`${VAR}`), not plain text in config
- [ ] Docker MCP servers use `read_only: true` and `network: none` where possible

## Reliability

- [ ] Health checks enabled on all MCP servers (`health_check_interval_s`)
- [ ] Circuit breaker thresholds tuned (`max_consecutive_failures`)
- [ ] MCP Server groups configured for critical MCP servers (at least 2 members)
- [ ] `min_healthy` set to match your SLA requirements
- [ ] Idle TTL set appropriately (300s for subprocess, 600s for containers)
- [ ] Rate limiting enabled to prevent overload
- [ ] Event store configured (`event_store.driver: sqlite`)

## Observability

- [ ] Prometheus scraping `/metrics` endpoint
- [ ] Grafana dashboards imported from `monitoring/grafana/`
- [ ] Alertmanager rules configured for:
  - MCP server state transitions to DEAD
  - Circuit breaker OPEN events
  - Health check failure rate above threshold
  - Tool call error rate above threshold
- [ ] Structured JSON logging enabled (`MCP_JSON_LOGS=true`)
- [ ] Log level set to `INFO` for production (`MCP_LOG_LEVEL=INFO`)

## Configuration

- [ ] Config file reviewed for correctness (no `validate` subcommand exists)
- [ ] Hot-reload tested via `mcp-hangar add` API (no SIGHUP handler exists)
- [ ] Environment-specific configs separated (dev/staging/prod)

## Deployment

- [ ] Running behind a reverse proxy (nginx, Caddy, Envoy)
- [ ] Health probe endpoints exposed for orchestrator (`/health/live`, `/health/ready`, `/health/startup`)
- [ ] Graceful shutdown configured (SIGTERM handling)
- [ ] Resource limits set (memory, CPU) for container deployments
- [ ] Persistent volume for event store SQLite database
- [ ] Docker image pinned to specific version tag, not `latest`

## Kubernetes (if applicable)

> The MCP-Hangar Operator is an external component shipped from
> [hangar-operator](https://github.com/mcp-hangar/hangar-operator).
> See [Recipe 11](11-discovery-kubernetes.md#prerequisites) for install instructions.

- [ ] MCP-Hangar Operator installed (see [Recipe 11 prerequisites](11-discovery-kubernetes.md#prerequisites))
- [ ] CRDs applied (`MCPServer`, `MCPServerGroup`, `MCPDiscoverySource`)
- [ ] RBAC (Kubernetes) configured for operator service account
- [ ] Network policies restricting MCP server-to-MCP server communication
- [ ] Resource requests and limits in Helm values
- [ ] PodDisruptionBudget for Hangar deployment

## Testing

- [ ] Failover tested: kill a primary MCP server, verify backup takes over
- [ ] Cold start tested: invoke a tool on a cold MCP server, verify latency
- [ ] Rate limit tested: flood API, verify 429 responses
- [ ] Auth tested: invalid key returns 401, insufficient role returns 403
- [ ] Config reload tested: edit config.yaml, verify changes apply
- [ ] Recovery tested: kill all MCP servers, verify they reinitialize

## Runbook

- [ ] Incident response documented
- [ ] MCP Server restart procedure documented
- [ ] Config rollback procedure documented
- [ ] Contact list for MCP server owners maintained
