Monitoring & Operations Guide

Production monitoring, alerting, and operational procedures for MeshOptixIQ.

Health Check Endpoints

Shallow Health Check

Fast health check that verifies the API server is responding:

GET /health

Response: 200 OK
{
  "status": "ok",
  "version": "1.2.0",
  "timestamp": "2026-02-16T10:30:00Z"
}

Use for load balancer health checks (fast response, no database query)

Deep Health Check (Readiness)

Comprehensive health check including database connectivity:

GET /health/ready

Response: 200 OK
{
  "status": "ok",
  "checks": {
    "database": "connected",
    "license": "valid"
  },
  "timestamp": "2026-02-16T10:30:00Z"
}

Use for Kubernetes readiness probes and operational monitoring

License Status

Returns plan tier, expiry date, and days remaining. No authentication required — safe to call from monitoring scripts and ops dashboards.

GET /health/license

Response: 200 OK
{
  "plan": "pro",
  "expires": "2026-12-31",
  "days_remaining": 310,
  "demo_mode": false
}

Use for Prometheus alerting (meshoptixiq_license_days_remaining), ops dashboards, and automated renewal reminders. days_remaining is null when no expiry date is set (perpetual license).

Key Metrics to Monitor

API Performance

Request latency: p50, p95, p99
Error rate: 4xx and 5xx responses
Request throughput: Requests/second
Concurrent connections: Active sessions

License Health

Validation time: License check latency
Grace period: Activation count
Device usage: N/M devices registered
Expiration: Days until expiry

Database Performance

Query time: Average query duration
Connection pool: Active/idle connections
Storage: Database size and growth rate
Memory: Heap usage (Neo4j)

Discovery Operations

Collection success rate: % devices collected
Collection duration: Time per run
Parser failures: Parse error count
SSH timeouts: Connection failure rate

Alert Definitions

Recommended alert thresholds based on baseline metrics:

Critical Alerts (Immediate Response)

Alert	Metric	Threshold	Window
API Error Rate High	5xx errors	> 5%	5 minutes
API Latency Critical	p99 response time	> 2000ms	5 minutes
Health Check Failing	/health/ready	3 consecutive failures	3 minutes
Database Unavailable	Connection status	Disconnected	1 minute

Warning Alerts (Investigation Required)

Alert	Metric	Threshold	Window
API Latency Elevated	p95 response time	> 1000ms	10 minutes
License Validation Slow	License check time	> 1000ms	5 minutes
Grace Period Active	Grace period rate	> 10%	1 hour
Collection Failure Rate	Failed devices	> 20%	Per run
License Expiring Soon	Days until expiry	< 30 days	Daily
GPU Temperature Critical	DCGM temp per GPU	> 80°C	1 minute
InfiniBand Ports Down	IB ports not Active	> 0	2 minutes

CloudWatch Integration

Setup CloudWatch Alarms

Configure CloudWatch alarms for production monitoring:

#!/bin/bash
# setup-cloudwatch-alarms.sh

SNS_TOPIC_ARN="arn:aws:sns:us-east-1:ACCOUNT:meshoptixiq-alerts"

# API Error Rate Alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "MeshOptixIQ-API-HighErrorRate" \
  --alarm-description "5xx error rate exceeds 5%" \
  --metric-name 5XXError \
  --namespace AWS/ApiGateway \
  --statistic Average \
  --period 300 \
  --threshold 0.05 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions "$SNS_TOPIC_ARN"

# API Latency Alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "MeshOptixIQ-API-HighLatency" \
  --alarm-description "p99 latency exceeds 2000ms" \
  --metric-name Latency \
  --namespace AWS/ApiGateway \
  --statistic p99 \
  --period 300 \
  --threshold 2000 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions "$SNS_TOPIC_ARN"

echo "CloudWatch alarms configured successfully"

Create CloudWatch Dashboard

aws cloudwatch put-dashboard \
  --dashboard-name MeshOptixIQ \
  --dashboard-body file://dashboard.json

# dashboard.json structure:
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/ApiGateway", "Count", { "stat": "Sum" }],
          [".", "5XXError", { "stat": "Average" }],
          [".", "Latency", { "stat": "Average" }]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1",
        "title": "API Metrics"
      }
    }
  ]
}

Prometheus & Grafana

Prometheus Metrics Endpoint

Install the [metrics] extra to enable GET /metrics (no authentication required, conventional Prometheus scrape path):

pip install 'meshoptixiq-network-discovery[metrics]'

The [metrics] extras group is pre-installed in all Docker images. For bare-metal Python installs, run the above command after the main install. The /metrics endpoint requires no authentication.

The endpoint exposes the following gauges and counters:

Metric	Type	Description
`meshoptixiq_http_requests_total`	Counter	HTTP requests labelled by `method`, `path`, `status`. Query paths normalised to `/queries/{name}/execute`.
`meshoptixiq_query_duration_seconds`	Histogram	Query execution latency labelled by `query_name`. Use p95/p99 buckets to identify slow queries.
`meshoptixiq_license_days_remaining`	Gauge	Days until license expiry labelled by `plan`. `-1` = no expiry; `-2` = demo mode.
`meshoptixiq_device_count`	Gauge	Total number of discovered network devices in the graph.
`meshoptixiq_flow_store_total`	Gauge	Current number of flows in the ring buffer (Enterprise only). Max 100K.
`meshoptixiq_alert_rules_total`	Gauge	Number of configured alert rules (Pro+ only).
`meshoptixiq_bgp_peers_down`	Gauge	Count of BGP peers NOT in Established state. Alert when > 0 (Pro+ only).
`meshoptixiq_collection_run_duration_seconds`	Histogram	Duration of each collection run. Labelled by `result` (success/partial/failure).

Prometheus Configuration

Scrape metrics from MeshOptixIQ API:

# prometheus.yml
scrape_configs:
  - job_name: 'meshoptixiq'
    static_configs:
      - targets: ['api.meshoptixiq.internal:8000']
    metrics_path: '/metrics'
    scrape_interval: 30s

Alertmanager — License Expiry Rule

Add this rule to your Prometheus alerting configuration to get notified before your license expires:

groups:
  - name: meshoptixiq
    rules:
      - alert: LicenseExpiringSoon
        expr: meshoptixiq_license_days_remaining < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "MeshOptixIQ license expires in {{ $value }} days"
          description: "Renew at https://meshoptixiq.com/pricing or contact sales@meshoptixiq.com"
      - alert: GPUTemperatureCritical
        expr: meshoptixiq_dcgm_temp_celsius > 80
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature critical on {{ $labels.hostname }} GPU {{ $labels.gpu_index }}"
      - alert: InfiniBandPortsDown
        expr: meshoptixiq_bgp_peers_down > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} InfiniBand port(s) are not in Active state"

Key Prometheus Queries

# Request rate
rate(meshoptixiq_http_requests_total[5m])

# Error rate
rate(meshoptixiq_http_requests_total{status=~"5.."}[5m])

# License days remaining (alert when < 30)
meshoptixiq_license_days_remaining

# Query execution rate
rate(meshoptixiq_http_requests_total{path="/queries/{name}/execute"}[5m])

Log Management

Structured Logging

MeshOptixIQ logs in JSON format for easy parsing:

{
  "timestamp": "2026-02-16T10:30:00Z",
  "level": "INFO",
  "logger": "network_discovery.api",
  "message": "Query executed successfully",
  "context": {
    "query_name": "device_neighbors",
    "duration_ms": 145,
    "request_id": "req_abc123",
    "user": "api_key_xyz"
  }
}

CloudWatch Logs Insights

Example queries for log analysis:

# Find slow queries (> 1 second)
fields @timestamp, context.query_name, context.duration_ms
| filter context.duration_ms > 1000
| sort context.duration_ms desc

# Count errors by type
fields @timestamp, level, message
| filter level = "ERROR"
| stats count() by message

# License validation failures
fields @timestamp, message, context
| filter message like /license.*fail/
| sort @timestamp desc

Daily Operations Checklist

Morning Checks (5 minutes)

Check CloudWatch dashboard for anomalies
Review error logs from past 24 hours
Verify backup completion (database dumps)
Check license expiration dates (within 30 days)

Weekly Tasks (30 minutes)

Review performance trends (latency, error rate)
Analyze license activation patterns
Check for available updates
Review and update documentation
Test backup restoration procedure (monthly)

Backup Procedures

Automated Backup Script

#!/bin/bash
# /opt/meshoptixiq/backup.sh
# Run via cron: 0 2 * * * /opt/meshoptixiq/backup.sh

BACKUP_DIR="/backups/meshoptixiq"
DATE=$(date +%Y%m%d)
RETENTION_DAYS=30

# Create backup directory
mkdir -p "$BACKUP_DIR"

# Backup Neo4j
docker exec neo4j neo4j-admin dump \
  --database=neo4j \
  --to=/tmp/neo4j-${DATE}.dump
docker cp neo4j:/tmp/neo4j-${DATE}.dump "$BACKUP_DIR/"

# Compress
gzip "$BACKUP_DIR/neo4j-${DATE}.dump"

# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR/neo4j-${DATE}.dump.gz" \
  s3://backups/meshoptixiq/

# Clean old backups
find "$BACKUP_DIR" -name "neo4j-*.dump.gz" \
  -mtime +${RETENTION_DAYS} -delete

echo "Backup completed: neo4j-${DATE}.dump.gz"

Restore Procedure

# 1. Stop services
docker-compose down

# 2. Restore from backup
gunzip neo4j-20260216.dump.gz
docker run --rm \
  -v $(pwd)/neo4j-20260216.dump:/backup.dump \
  -v neo4j-data:/data \
  neo4j:5.15 \
  neo4j-admin load --from=/backup.dump --database=neo4j --force

# 3. Restart services
docker-compose up -d

# 4. Verify
curl http://localhost:8000/health/ready

Incident Response

Severity Levels

Severity	Response Time	Examples
Critical	5 minutes	API completely down, database unavailable
High	30 minutes	High error rate, degraded performance
Medium	4 hours	Single device collection failing
Low	Next business day	Documentation updates, minor bugs

Response Workflow

Acknowledge: Acknowledge alert within SLA (5 min for critical)
Assess: Check dashboard, review logs, determine impact
Communicate: Update status page, notify affected users if needed
Mitigate: Execute appropriate runbook procedure
Resolve: Verify resolution, clear alert
Document: Create incident report with root cause and prevention steps

Support Escalation

For issues requiring vendor support:

Email: (24-48 hour response for Pro/Enterprise)
Status Page: https://status.meshoptixiq.com
Emergency Hotline: Available for Enterprise customers only

When contacting support, include:

License key (first 10 characters only)
MeshOptixIQ version (meshq version)
Deployment type (Docker, Kubernetes, bare-metal)
Error messages from logs
Output from health check endpoints
Recent changes to configuration or infrastructure