Production monitoring, alerting, and operational procedures for MeshOptixIQ.
Health Check Endpoints
Shallow Health Check
Fast health check that verifies the API server is responding:
GET /health
Response: 200 OK
{
"status": "ok",
"version": "1.2.0",
"timestamp": "2026-02-16T10:30:00Z"
}
Use for load balancer health checks (fast response, no database query)
Deep Health Check (Readiness)
Comprehensive health check including database connectivity:
GET /health/ready
Response: 200 OK
{
"status": "ok",
"checks": {
"database": "connected",
"license": "valid"
},
"timestamp": "2026-02-16T10:30:00Z"
}
Use for Kubernetes readiness probes and operational monitoring
License Status
Returns plan tier, expiry date, and days remaining. No authentication required — safe to call from monitoring scripts and ops dashboards.
GET /health/license
Response: 200 OK
{
"plan": "pro",
"expires": "2026-12-31",
"days_remaining": 310,
"demo_mode": false
}
Use for Prometheus alerting (meshoptixiq_license_days_remaining), ops dashboards, and automated renewal reminders. days_remaining is null when no expiry date is set (perpetual license).
Key Metrics to Monitor
API Performance
Request latency: p50, p95, p99
Error rate: 4xx and 5xx responses
Request throughput: Requests/second
Concurrent connections: Active sessions
License Health
Validation time: License check latency
Grace period: Activation count
Device usage: N/M devices registered
Expiration: Days until expiry
Database Performance
Query time: Average query duration
Connection pool: Active/idle connections
Storage: Database size and growth rate
Memory: Heap usage (Neo4j)
Discovery Operations
Collection success rate: % devices collected
Collection duration: Time per run
Parser failures: Parse error count
SSH timeouts: Connection failure rate
Alert Definitions
Recommended alert thresholds based on baseline metrics:
Critical Alerts (Immediate Response)
Alert
Metric
Threshold
Window
API Error Rate High
5xx errors
> 5%
5 minutes
API Latency Critical
p99 response time
> 2000ms
5 minutes
Health Check Failing
/health/ready
3 consecutive failures
3 minutes
Database Unavailable
Connection status
Disconnected
1 minute
Warning Alerts (Investigation Required)
Alert
Metric
Threshold
Window
API Latency Elevated
p95 response time
> 1000ms
10 minutes
License Validation Slow
License check time
> 1000ms
5 minutes
Grace Period Active
Grace period rate
> 10%
1 hour
Collection Failure Rate
Failed devices
> 20%
Per run
License Expiring Soon
Days until expiry
< 30 days
Daily
CloudWatch Integration
Setup CloudWatch Alarms
Configure CloudWatch alarms for production monitoring: