Observability Guide

Name: MCP Hangar
Author: MCP Hangar

This guide covers MCP Hangar's observability features: metrics, tracing, logging, and health checks.

Quick Start
Monitoring Stack
Metrics
Grafana Dashboards
Alerting
Tracing
Langfuse Integration
Logging
Health Checks
SLIs/SLOs
Troubleshooting
Best Practices

Quick Start

Prerequisites

# Core package
pip install mcp-hangar

# For full observability support
pip install mcp-hangar[observability]

Start Monitoring Stack

The monitoring stack is in monitoring/ and includes Prometheus, Grafana, and Alertmanager:

# Using Docker Compose
cd monitoring
docker compose up -d

# Using Podman
cd monitoring
podman compose up -d

Access dashboards:

Service	URL	Credentials
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	-
Alertmanager	http://localhost:9093	-

Start MCP Hangar with Metrics

# HTTP mode (exposes /metrics endpoint)
mcp-hangar serve --http --port 8000

# With custom config
MCP_CONFIG=config.yaml mcp-hangar serve --http --port 8000

Verify metrics are exposed:

curl http://localhost:8000/metrics | grep mcp_hangar

Monitoring Stack

Architecture

+----------------+     scrape      +------------+
|  MCP Hangar    |---------------->| Prometheus |
|  :8000/metrics |                 |   :9090    |
+----------------+                 +-----+------+
                                         |
                                         | query
                                         v
                                   +------------+
                                   |  Grafana   |
                                   |   :3000    |
                                   +------------+

+----------------+     alerts      +-------------+
|  Prometheus    |---------------->| Alertmanager|
|  alert rules   |                 |    :9093    |
+----------------+                 +-------------+

Configuration Files

File	Purpose
`monitoring/docker-compose.yaml`	Container orchestration
`monitoring/prometheus/prometheus.yaml`	Scrape configuration
`monitoring/prometheus/alerts.yaml`	Alert rules
`monitoring/alertmanager/alertmanager.yaml`	Notification routing
`monitoring/grafana/provisioning/`	Dashboard/datasource provisioning
`monitoring/grafana/dashboards/`	Pre-built dashboard JSON files

Prometheus Configuration

The default configuration scrapes MCP Hangar every 10 seconds:

# monitoring/prometheus/prometheus.yaml
scrape_configs:
  - job_name: 'mcp-hangar'
    static_configs:
      - targets: ['host.docker.internal:8000']
        labels:
          service: 'mcp-hangar'
          tier: 'application'
    metrics_path: /metrics
    scrape_interval: 10s
    scrape_timeout: 5s

For Kubernetes deployments, use service discovery:

scrape_configs:
  - job_name: 'mcp-hangar'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: mcp-hangar
        action: keep

Metrics

MCP Hangar exports Prometheus metrics at /metrics. All metrics use the mcp_hangar_ prefix.

Currently Exported Metrics

Tool Invocations

Metric	Type	Labels	Description
`mcp_hangar_tool_calls_total`	Counter	MCP server, tool, status	Total tool invocations
`mcp_hangar_tool_call_duration_seconds`	Histogram	MCP server, tool	Invocation latency (buckets: 0.01-30s)
`mcp_hangar_tool_call_errors_total`	Counter	MCP server, tool, error_type	Failed invocations by error type

Example queries:

# Tool call rate by mcp_server
sum(rate(mcp_hangar_tool_calls_total[5m])) by (mcp_server)

# P95 latency by tool
histogram_quantile(0.95, sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le, tool))

# Error rate
sum(rate(mcp_hangar_tool_call_errors_total[5m])) / sum(rate(mcp_hangar_tool_calls_total[5m]))

Batch Invocations

Metric	Type	Labels	Description
`mcp_hangar_batch_calls_total`	Counter	result	Batch invocations (success/failure)
`mcp_hangar_batch_duration_seconds`	Histogram	-	Batch execution time
`mcp_hangar_batch_size`	Histogram	-	Number of calls per batch
`mcp_hangar_batch_cancellations_total`	Counter	-	Cancelled batches
`mcp_hangar_batch_circuit_breaker_rejections_total`	Counter	-	Circuit breaker rejections
`mcp_hangar_batch_concurrency`	Gauge	-	Current parallel executions

Example queries:

# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))

# Average batch size
rate(mcp_hangar_batch_size_sum[5m]) / rate(mcp_hangar_batch_size_count[5m])

Health Checks

Metric	Type	Labels	Description
`mcp_hangar_health_checks_total`	Counter	MCP server, result	Health check executions
`mcp_hangar_health_check_duration_seconds`	Histogram	MCP server	Health check latency
`mcp_hangar_health_check_consecutive_failures`	Gauge	MCP server	Current consecutive failure count

Example queries:

# Unhealthy mcp_servers (>2 consecutive failures)
mcp_hangar_health_check_consecutive_failures > 2

# Health check success rate
sum(rate(mcp_hangar_health_checks_total{result="healthy"}[5m])) by (mcp_server)
/ sum(rate(mcp_hangar_health_checks_total[5m])) by (mcp_server)

MCP Server Lifecycle

Metric	Type	Labels	Description
`mcp_hangar_mcp_server_state`	Gauge	mcp_server	Current state (0=cold, 1=initializing, 2=ready, 3=degraded, 4=dead)
`mcp_hangar_mcp_server_up`	Gauge	mcp_server	1 if MCP server is reachable
`mcp_hangar_mcp_server_starts_total`	Counter	mcp_server	MCP server start attempts
`mcp_hangar_mcp_server_initialized`	Gauge	mcp_server	1 if MCP server has been initialized
`mcp_hangar_mcp_server_cold_start_seconds`	Histogram	mcp_server	Cold start latency
`mcp_hangar_mcp_server_cold_start_in_progress`	Gauge	mcp_server	1 if cold start is in progress

Discovery

Metric	Type	Labels	Description
`mcp_hangar_discovery_mcp_servers`	Gauge	source	Discovered MCP servers per source
`mcp_hangar_discovery_registrations_total`	Counter	source	New registrations
`mcp_hangar_discovery_errors_total`	Counter	source	Errors by source
`mcp_hangar_discovery_cycle_duration_seconds`	Histogram	source	Discovery cycle duration

HTTP Transport

Metric	Type	Labels	Description
`mcp_hangar_http_requests_total`	Counter	mcp_server, method, status_code	HTTP requests to remote MCP servers
`mcp_hangar_http_request_duration_seconds`	Histogram	method	HTTP request latency
`mcp_hangar_http_connection_pool_size`	Gauge	mcp_server	Current HTTP connection pool size

Rate Limiting

Metric	Type	Labels	Description
`mcp_hangar_rate_limit_hits_total`	Counter	principal	Rate limit rejections

GC (Garbage Collection)

Metric	Type	Labels	Description
`mcp_hangar_gc_cycles_total`	Counter	-	GC cycle executions
`mcp_hangar_gc_cycle_duration_seconds`	Histogram	-	GC cycle duration

Grafana Dashboards

Pre-built dashboards are provisioned automatically from monitoring/grafana/dashboards/:

Overview Dashboard

File: overview.json URL: http://localhost:3000/d/mcp-hangar-overview

Provides high-level system health:

Request rate and error rate trends
Latency percentiles (P50, P95, P99)
MCP Server health status
Batch invocation success/failure rates
Health check results
GC cycle performance

MCP Server Details Dashboard

File: MCP server-details.json URL: http://localhost:3000/d/mcp-hangar-MCP server-details

Deep dive into individual MCP servers:

Tool call breakdown by tool name
Per-tool latency histograms
Error distribution by type
Health check history
Consecutive failure tracking

Alerts Dashboard

File: alerts.json URL: http://localhost:3000/d/mcp-hangar-alerts

Alert monitoring and trends:

Active alerts by severity
Alert condition trends (error rate, latency, health)
Historical alert timeline

Governance Dashboard

File: governance.json URL: http://localhost:3000/d/mcp-hangar-governance

MCP Hangar 1.4.0 adds a governance dashboard for tenant and policy operations:

Cost attribution by MCP server, tool, and cost model
Capability violations, behavioral deviations, tool schema drifts, detection rule matches, and enforcement actions
Tool access denials, filtered tools, and active tool-access policies
Batch in-flight calls, concurrency queueing, P95 wait time, and circuit-breaker state

Importing Dashboards Manually

If not using provisioning:

Open Grafana at http://localhost:3000
Go to Dashboards > Import
Upload JSON file from monitoring/grafana/dashboards/
Select Prometheus data source
Click Import

Alerting

Alert Configuration

Alert rules are defined in monitoring/prometheus/alerts.yaml and organized by severity:

Critical Alerts (Page On-Call)

Alert	Condition	For	Description
`MCPHangarNotResponding`	`up{job="mcp-hangar"} == 0`	1m	Service unreachable
`MCPHangarHighErrorRate`	Error rate > 10%	2m	Significant failures
`MCPHangarBatchHighFailureRate`	Batch failure > 20%	3m	Batch operations failing
`MCPHangarCircuitBreakerTripped`	CB rejections > 10/5m	2m	MCP Server isolated
`MCPHangarProviderUnhealthy`	Consecutive failures > 5	2m	MCP Server critically unhealthy
`MCPHangarAllProvidersDown`	All MCP servers down (with servers configured)	1m	Total outage
`MCPHangarCriticalDetectionMatch`	Critical detection rule match	0m	Security: critical behavioral detection

Warning Alerts (Investigate)

Alert	Condition	For	Description
`MCPHangarHighConsecutiveFailures`	Consecutive failures > 2	2m	Health check issues
`MCPHangarHealthCheckSlow`	P95 health check > 5s	5m	Slow health checks
`MCPHangarHighLatencyP95`	P95 latency > 3s	5m	Performance degradation
`MCPHangarHighLatencyP99`	P99 latency > 5s	5m	Tail latency issues
`MCPHangarHighLatencyByTool`	P95 per-tool > 5s	5m	Specific tool slow
`MCPHangarFrequentColdStarts`	Start rate > 0.1/s	10m	Consider increasing idle_ttl
`MCPHangarBatchSlowExecution`	P95 batch > 30s	5m	Slow batch processing
`MCPHangarBatchHighCancellationRate`	Cancellation > 10%	5m	Batches timing out
`MCPHangarBatchSizeTooLarge`	P95 size > 50	5m	Consider smaller batches
`MCPHangarGCSlowCycles`	P95 GC > 0.5s	5m	GC performance issue
`MCPHangarHighMemoryUsage`	Memory > 2GB	10m	Memory pressure
`MCPHangarHighCPUUsage`	CPU > 80%	10m	CPU saturation
`MCPHangarProviderDegraded`	MCP server state = DEGRADED	5m	MCP Server degraded
`MCPHangarRemoteProviderUnreachable`	Connection-refused errors > 10/5m	5m	Remote MCP server unreachable
`MCPHangarDiscoverySourceUnhealthy`	No healthy discovery sources	5m	Discovery sources down
`MCPHangarHighRateLimitRejections`	Rejected rate-limit hits > 1/s	5m	Clients being throttled
`MCPHangarCapabilityViolations`	Capability violations > 0/5m	5m	Security: capability breach
`MCPHangarConcurrencyQueueBuildup`	Concurrency queue building > 1/5m	5m	Backpressure / saturation
`MCPHangarEnforcementActionsActive`	Enforcement actions firing	5m	Security: enforcement active

Governance and Availability Alert Groups

1.4.0 adds two dedicated Prometheus groups in monitoring/prometheus/alerts.yaml:

mcp-hangar-governance -- security, policy/enforcement, and concurrency saturation signals. (Cost is tracked on the governance dashboard but is not alerted.)
mcp-hangar-availability -- MCP server state, discovery health, remote transport errors, and runtime rate limiting.

Use these groups when routing alerts to different teams; for example, security teams can subscribe to governance alerts while platform on-call owns availability and transport alerts.

Info Alerts (Tracking)

Alert	Condition	Description
`MCPHangarProviderStarted`	Any MCP server start	MCP Server lifecycle event
`MCPHangarHighToolCallVolume`	Rate > 100/s	High traffic notification

Alertmanager Configuration

Configure notification routing in monitoring/alertmanager/alertmanager.yaml:

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://your-webhook-endpoint'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<your-service-key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<your-slack-webhook-url>'
        channel: '#mcp-hangar-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Testing Alerts

Verify alert rules are loaded:

# Check Prometheus rules
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'

# Check for firing alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Tracing

OpenTelemetry Integration

MCP Hangar supports distributed tracing via OpenTelemetry. Every tool invocation produces an OTEL span carrying MCP governance attributes (mcp.server.id, mcp.tool.name, mcp.tool.status, enforcement context, and identity context when available).

For the full MCP attribute taxonomy, partner backend recipes (OTEL Collector, OpenLIT, Langfuse, Grafana), and reference docker-compose setups, see: OpenTelemetry Integrations.

from mcp_hangar.observability import init_tracing, trace_span

# Initialize once at startup
init_tracing(
    service_name="mcp-hangar",
    otlp_endpoint="http://localhost:4317",
)

# Create spans for operations
with trace_span("process_request", {"request.id": req_id}) as span:
    span.add_event("checkpoint_reached")
    result = do_work()

MCP Governance Attributes on Spans

TracedMcpServerService automatically creates an OTEL span for each tool invocation with standard MCP governance attributes via set_governance_attributes():

from mcp_hangar.observability.conventions import McpServer, MCP, set_governance_attributes

# set_governance_attributes(span, ...) sets all applicable attributes in one call.
# None values are omitted -- no empty strings pollute OTLP backends.
set_governance_attributes(
    span,
    mcp_server_id="math",
    tool_name="add",
    user_id="alice",           # optional
    session_id="sess-42",      # optional
    policy_result="allow",     # optional
    enforcement_action=None,   # omitted from span
)

OTLP Audit Export

Security-relevant domain events (tool invocations, MCP server state transitions) are automatically exported as OTLP log records when OTEL_EXPORTER_OTLP_ENDPOINT is set. This is handled by OTLPAuditExporter and OTLPAuditEventHandler -- no additional configuration needed.

Events exported:

ToolInvocationCompleted / ToolInvocationFailed -- with MCP server, tool, status, duration, caller identity, cost attribution
McpServerStateChanged -- with MCP server, from_state, to_state

Caller identity attributes (mcp.caller.type, mcp.caller.id, mcp.caller.roles) are automatically propagated from the event's identity_context when available.

Cost attributes (mcp.cost.cents, mcp.cost.model, mcp.cost.input_tokens, mcp.cost.output_tokens) are included when cost attribution is configured.

Compliance Export Formats (Enterprise)

Enterprise deployments can export audit events in SIEM-compatible formats alongside OTLP. Available exporters (in src/mcp_hangar/compliance/):

Format	Class	Use Case
CEF	`CEFExporter`	ArcSight, QRadar, Splunk via CEF
JSON-lines	`JSONLinesExporter`	Splunk HEC, Elasticsearch, custom pipelines
LEEF	`LEEFExporter`	IBM QRadar native format
Syslog (RFC 5424)	`SyslogExporter`	Any syslog-compatible SIEM

All exporters implement the IAuditExporter protocol and output to file, callback, or stderr (for container log collection). Configure via the compliance bootstrap.

Environment Variables

Variable	Default	Description
`MCP_TRACING_ENABLED`	`true`	Enable/disable tracing
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP collector endpoint (also activates OTLP audit export)
`OTEL_SERVICE_NAME`	`mcp-hangar`	Service name in traces

Trace Context Propagation

W3C TraceContext is automatically propagated across agent -> Hangar -> MCP server boundaries:

Inbound: BatchExecutor extracts traceparent from call metadata, creating child spans linked to the agent's root trace.
Outbound: HttpClient injects traceparent into outbound HTTP headers when calling remote MCP servers.
Stdio: Not supported (JSON-RPC over stdin/stdout has no header mechanism).

Manual propagation is also available:

from mcp_hangar.observability import inject_trace_context, extract_trace_context

# Inject into outgoing requests
headers = {}
inject_trace_context(headers)

# Extract from incoming requests
context = extract_trace_context(request_headers)

Langfuse Integration

MCP Hangar integrates with Langfuse for LLM-specific observability.

Configuration

export MCP_LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_HOST=https://cloud.langfuse.com

Or via config.yaml:

observability:
  langfuse:
    enabled: true
    public_key: ${LANGFUSE_PUBLIC_KEY}
    secret_key: ${LANGFUSE_SECRET_KEY}
    host: https://cloud.langfuse.com
    sample_rate: 1.0

Trace Propagation

from mcp_hangar.application.services import TracedMcpServerService

result = traced_service.invoke_tool(
    mcp_server_id="math",
    tool_name="add",
    arguments={"a": 1, "b": 2},
    trace_id="your-langfuse-trace-id",
    user_id="user-123",
    session_id="session-456",
)

See ADR-007 for architectural details.

Logging

Structured Logging

MCP Hangar uses structlog for structured JSON logging:

{
  "timestamp": "2026-02-03T10:30:00.123Z",
  "level": "info",
  "event": "tool_invoked",
  "mcp_server": "math",
  "tool": "add",
  "duration_ms": 150,
  "service": "mcp-hangar"
}

Configuration

logging:
  level: INFO          # DEBUG, INFO, WARNING, ERROR
  json_format: true    # JSON output for log aggregation

Environment variable:

MCP_LOG_LEVEL=DEBUG mcp-hangar serve --http

Log Correlation

Include trace IDs for correlation with distributed traces:

from mcp_hangar.observability import get_current_trace_id
from mcp_hangar.logging_config import get_logger

logger = get_logger(__name__)
logger.info("processing", trace_id=get_current_trace_id())

Health Checks

HTTP Endpoints

Endpoint	Purpose	Use Case
`/health/live`	Liveness	Container restart decisions
`/health/ready`	Readiness	Traffic routing
`/health/startup`	Startup	Initial boot gate

Response Format

{
  "status": "healthy",
  "checks": [
    {
      "name": "mcp_servers",
      "status": "healthy",
      "duration_ms": 1.2
    }
  ],
  "version": "0.6.3",
  "uptime_seconds": 3600.5
}

Kubernetes Configuration

livenessProbe:
  httpGet:
    path: /health/live
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5

SLIs/SLOs

Service Level Indicators

SLI	Metric	Measurement
Availability	Service up	`up{job="mcp-hangar"}`
Latency	Tool call duration	P95 < 3s
Error Rate	Failed invocations	Error rate < 1%
Batch Success	Batch completion	Success rate > 95%

Recommended SLOs

SLI	Target	Window
Availability	99.9%	30 days
Latency (P95)	< 3s	5 minutes
Error Rate	< 1%	5 minutes
Batch Success	> 95%	5 minutes

PromQL Queries

# Availability (service up ratio over 30d)
avg_over_time(up{job="mcp-hangar"}[30d])

# Error budget remaining
1 - (
  sum(increase(mcp_hangar_tool_call_errors_total[30d]))
  / sum(increase(mcp_hangar_tool_calls_total[30d]))
) / 0.01

# P95 latency
histogram_quantile(0.95,
  sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le)
)

# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))

Troubleshooting

Metrics Not Visible

Verify endpoint:

curl http://localhost:8000/metrics | head -20

Check Prometheus targets at http://localhost:9090/targets
Verify network connectivity (use host.docker.internal for Docker on Mac/Windows)

Alerts Not Firing

Check alert rules loaded:

curl http://localhost:9090/api/v1/rules | jq '.data.groups[].name'

Verify metrics exist for alert expressions

Check Alertmanager connectivity:

curl http://localhost:9093/api/v1/status

High Consecutive Failures

If MCPHangarHighConsecutiveFailures fires:

Check MCP server logs for errors
Verify MCP server command/configuration
Restart the MCP server by restarting Hangar or invoking the MCP server (the first tool call triggers a cold start):
```
mcp-hangar status
```

MCP Server Start Errors

Common patterns and fixes:

Error	Cause	Fix
`ModuleNotFoundError`	Missing dependency	`pip install <package>`
`FileNotFoundError`	Wrong path	Check command in config
`PermissionError`	Not executable	`chmod +x <script>`
Exit code 137	OOM killed	Increase memory limits

Best Practices

Metrics

Monitor the right things - Focus on user-facing SLIs
Set appropriate retention - 15 days for metrics, 7 days for traces
Avoid high cardinality - Don't use unbounded values as labels

Alerting

Create runbooks - Document response procedures
Start conservative - Tune thresholds based on baseline
Test regularly - Verify notification channels work
Use severity correctly - Critical = page, Warning = ticket

Dashboards

Layer information - Overview -> Details -> Debug
Include time selectors - Allow drilling into incidents
Add annotations - Mark deployments and incidents

Production Readiness Checklist

Prometheus scraping MCP Hangar metrics
Grafana dashboards imported and working
Alertmanager configured with notification routes
Critical alerts tested (e.g., stop service, verify page)
Runbooks created for each alert
Log aggregation configured (ELK, Loki, etc.)
Tracing enabled and traces visible in Jaeger/Langfuse

Observability Guide

Table of Contents

Quick Start

Prerequisites

Start Monitoring Stack

Start MCP Hangar with Metrics

Monitoring Stack

Architecture

Configuration Files

Prometheus Configuration

Metrics

Currently Exported Metrics

Tool Invocations

Batch Invocations

Health Checks

MCP Server Lifecycle

Discovery

HTTP Transport

Rate Limiting

GC (Garbage Collection)

Grafana Dashboards

Overview Dashboard

MCP Server Details Dashboard

Alerts Dashboard

Governance Dashboard

Importing Dashboards Manually

Alerting

Alert Configuration

Critical Alerts (Page On-Call)

Warning Alerts (Investigate)

Governance and Availability Alert Groups

Info Alerts (Tracking)

Alertmanager Configuration

Testing Alerts

Tracing

OpenTelemetry Integration

MCP Governance Attributes on Spans

OTLP Audit Export

Compliance Export Formats (Enterprise)

Environment Variables

Trace Context Propagation

Langfuse Integration

Configuration

Trace Propagation

Logging

Structured Logging

Configuration

Log Correlation

Health Checks

HTTP Endpoints

Response Format

Kubernetes Configuration

SLIs/SLOs

Service Level Indicators

Recommended SLOs

PromQL Queries

Troubleshooting

Metrics Not Visible

Alerts Not Firing

High Consecutive Failures

MCP Server Start Errors

Best Practices

Metrics

Alerting

Dashboards

Production Readiness Checklist