Monitoring & Operations Guide

Production monitoring, alerting, and operational procedures for MeshOptixIQ.

Health Check Endpoints

Shallow Health Check

Fast health check that verifies the API server is responding:

GET /health

Response: 200 OK
{
  "status": "ok",
  "version": "1.2.0",
  "timestamp": "2026-02-16T10:30:00Z"
}

Use for load balancer health checks (fast response, no database query)

Deep Health Check (Readiness)

Comprehensive health check including database connectivity:

GET /health/ready

Response: 200 OK
{
  "status": "ok",
  "checks": {
    "database": "connected",
    "license": "valid"
  },
  "timestamp": "2026-02-16T10:30:00Z"
}

Use for Kubernetes readiness probes and operational monitoring

License Status

Returns plan tier, expiry date, and days remaining. No authentication required — safe to call from monitoring scripts and ops dashboards.

GET /health/license

Response: 200 OK
{
  "plan": "pro",
  "expires": "2026-12-31",
  "days_remaining": 310,
  "demo_mode": false
}

Use for Prometheus alerting (meshoptixiq_license_days_remaining), ops dashboards, and automated renewal reminders. days_remaining is null when no expiry date is set (perpetual license).

Key Metrics to Monitor

API Performance

  • Request latency: p50, p95, p99
  • Error rate: 4xx and 5xx responses
  • Request throughput: Requests/second
  • Concurrent connections: Active sessions

License Health

  • Validation time: License check latency
  • Grace period: Activation count
  • Device usage: N/M devices registered
  • Expiration: Days until expiry

Database Performance

  • Query time: Average query duration
  • Connection pool: Active/idle connections
  • Storage: Database size and growth rate
  • Memory: Heap usage (Neo4j)

Discovery Operations

  • Collection success rate: % devices collected
  • Collection duration: Time per run
  • Parser failures: Parse error count
  • SSH timeouts: Connection failure rate

Alert Definitions

Recommended alert thresholds based on baseline metrics:

Critical Alerts (Immediate Response)

Alert Metric Threshold Window
API Error Rate High 5xx errors > 5% 5 minutes
API Latency Critical p99 response time > 2000ms 5 minutes
Health Check Failing /health/ready 3 consecutive failures 3 minutes
Database Unavailable Connection status Disconnected 1 minute

Warning Alerts (Investigation Required)

Alert Metric Threshold Window
API Latency Elevated p95 response time > 1000ms 10 minutes
License Validation Slow License check time > 1000ms 5 minutes
Grace Period Active Grace period rate > 10% 1 hour
Collection Failure Rate Failed devices > 20% Per run
License Expiring Soon Days until expiry < 30 days Daily

CloudWatch Integration

Setup CloudWatch Alarms

Configure CloudWatch alarms for production monitoring:

#!/bin/bash
# setup-cloudwatch-alarms.sh

SNS_TOPIC_ARN="arn:aws:sns:us-east-1:ACCOUNT:meshoptixiq-alerts"

# API Error Rate Alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "MeshOptixIQ-API-HighErrorRate" \
  --alarm-description "5xx error rate exceeds 5%" \
  --metric-name 5XXError \
  --namespace AWS/ApiGateway \
  --statistic Average \
  --period 300 \
  --threshold 0.05 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions "$SNS_TOPIC_ARN"

# API Latency Alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "MeshOptixIQ-API-HighLatency" \
  --alarm-description "p99 latency exceeds 2000ms" \
  --metric-name Latency \
  --namespace AWS/ApiGateway \
  --statistic p99 \
  --period 300 \
  --threshold 2000 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions "$SNS_TOPIC_ARN"

echo "CloudWatch alarms configured successfully"

Create CloudWatch Dashboard

aws cloudwatch put-dashboard \
  --dashboard-name MeshOptixIQ \
  --dashboard-body file://dashboard.json

# dashboard.json structure:
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/ApiGateway", "Count", { "stat": "Sum" }],
          [".", "5XXError", { "stat": "Average" }],
          [".", "Latency", { "stat": "Average" }]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1",
        "title": "API Metrics"
      }
    }
  ]
}

Prometheus & Grafana

Prometheus Metrics Endpoint

Install the [metrics] extra to enable GET /metrics (no authentication required, conventional Prometheus scrape path):

pip install 'meshoptixiq-network-discovery[metrics]'

The endpoint exposes the following gauges and counters:

Metric Type Description
meshoptixiq_license_days_remaining Gauge Days until license expiry, labelled by plan. -1 = no expiry date; -2 = demo mode.
meshoptixiq_http_requests_total Counter Total HTTP requests, labelled by method, path, status. Query paths are normalised (e.g. /queries/{name}/execute).

Prometheus Configuration

Scrape metrics from MeshOptixIQ API:

# prometheus.yml
scrape_configs:
  - job_name: 'meshoptixiq'
    static_configs:
      - targets: ['api.meshoptixiq.internal:8000']
    metrics_path: '/metrics'
    scrape_interval: 30s

Alertmanager — License Expiry Rule

Add this rule to your Prometheus alerting configuration to get notified before your license expires:

groups:
  - name: meshoptixiq
    rules:
      - alert: LicenseExpiringSoon
        expr: meshoptixiq_license_days_remaining < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "MeshOptixIQ license expires in {{ $value }} days"
          description: "Renew at https://meshoptixiq.com/pricing or contact sales@meshoptixiq.com"

Key Prometheus Queries

# Request rate
rate(meshoptixiq_http_requests_total[5m])

# Error rate
rate(meshoptixiq_http_requests_total{status=~"5.."}[5m])

# License days remaining (alert when < 30)
meshoptixiq_license_days_remaining

# Query execution rate
rate(meshoptixiq_http_requests_total{path="/queries/{name}/execute"}[5m])

Log Management

Structured Logging

MeshOptixIQ logs in JSON format for easy parsing:

{
  "timestamp": "2026-02-16T10:30:00Z",
  "level": "INFO",
  "logger": "network_discovery.api",
  "message": "Query executed successfully",
  "context": {
    "query_name": "device_neighbors",
    "duration_ms": 145,
    "request_id": "req_abc123",
    "user": "api_key_xyz"
  }
}

CloudWatch Logs Insights

Example queries for log analysis:

# Find slow queries (> 1 second)
fields @timestamp, context.query_name, context.duration_ms
| filter context.duration_ms > 1000
| sort context.duration_ms desc

# Count errors by type
fields @timestamp, level, message
| filter level = "ERROR"
| stats count() by message

# License validation failures
fields @timestamp, message, context
| filter message like /license.*fail/
| sort @timestamp desc

Daily Operations Checklist

Morning Checks (5 minutes)

Weekly Tasks (30 minutes)

Backup Procedures

Automated Backup Script

#!/bin/bash
# /opt/meshoptixiq/backup.sh
# Run via cron: 0 2 * * * /opt/meshoptixiq/backup.sh

BACKUP_DIR="/backups/meshoptixiq"
DATE=$(date +%Y%m%d)
RETENTION_DAYS=30

# Create backup directory
mkdir -p "$BACKUP_DIR"

# Backup Neo4j
docker exec neo4j neo4j-admin dump \
  --database=neo4j \
  --to=/tmp/neo4j-${DATE}.dump
docker cp neo4j:/tmp/neo4j-${DATE}.dump "$BACKUP_DIR/"

# Compress
gzip "$BACKUP_DIR/neo4j-${DATE}.dump"

# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR/neo4j-${DATE}.dump.gz" \
  s3://backups/meshoptixiq/

# Clean old backups
find "$BACKUP_DIR" -name "neo4j-*.dump.gz" \
  -mtime +${RETENTION_DAYS} -delete

echo "Backup completed: neo4j-${DATE}.dump.gz"

Restore Procedure

# 1. Stop services
docker-compose down

# 2. Restore from backup
gunzip neo4j-20260216.dump.gz
docker run --rm \
  -v $(pwd)/neo4j-20260216.dump:/backup.dump \
  -v neo4j-data:/data \
  neo4j:5.15 \
  neo4j-admin load --from=/backup.dump --database=neo4j --force

# 3. Restart services
docker-compose up -d

# 4. Verify
curl http://localhost:8000/health/ready

Incident Response

Severity Levels

Severity Response Time Examples
Critical 5 minutes API completely down, database unavailable
High 30 minutes High error rate, degraded performance
Medium 4 hours Single device collection failing
Low Next business day Documentation updates, minor bugs

Response Workflow

  1. Acknowledge: Acknowledge alert within SLA (5 min for critical)
  2. Assess: Check dashboard, review logs, determine impact
  3. Communicate: Update status page, notify affected users if needed
  4. Mitigate: Execute appropriate runbook procedure
  5. Resolve: Verify resolution, clear alert
  6. Document: Create incident report with root cause and prevention steps

Support Escalation

For issues requiring vendor support:

When contacting support, include:

  • License key (first 10 characters only)
  • MeshOptixIQ version (meshq version)
  • Deployment type (Docker, Kubernetes, bare-metal)
  • Error messages from logs
  • Output from health check endpoints
  • Recent changes to configuration or infrastructure