Health Monitoring
CHAD monitors the health of your detection infrastructure. This includes OpenSearch connectivity, log flow, and query performance.Health Dashboard
Navigate to Health in the sidebar to see:- System health - Overall platform status
- Index pattern health - Per-log-source monitoring
- OpenSearch status - Connection and cluster health
- Background tasks - Scheduler status
Health Indicators
| Status | Icon | Meaning |
|---|---|---|
| Healthy | 🟢 | All systems operational |
| Warning | 🟡 | Degraded but functional |
| Critical | 🔴 | Requires immediate attention |
System Health Checks
CHAD automatically monitors:OpenSearch Connection
- Connectivity - Can reach the cluster
- Authentication - Credentials valid
- Cluster health - OpenSearch cluster status
Background Tasks
- Scheduler - APScheduler running
- Health checks - Periodic health tasks executing
- SigmaHQ sync - Repository sync (if enabled)
Database
- PostgreSQL connectivity - Database accessible
- Migration status - Schema up to date
Index Pattern Health
Each index pattern has dedicated monitoring:No Data Alert
Alert when no logs received:| Setting | Description | Default |
|---|---|---|
| Threshold | Minutes without data | 15 |
| Severity | Alert severity level | Warning |
- High-volume logs: 5 minutes
- Low-volume logs: 60 minutes
- Batch logs: 24 hours (or disable)
Error Rate
Alert when errors exceed threshold:| Setting | Description | Default |
|---|---|---|
| Threshold | Error percentage | 5% |
| Window | Measurement window | 5 minutes |
- Query failures
- Indexing errors
- Timeout errors
Detection Latency
Alert when detection is slow:| Level | Setting | Default |
|---|---|---|
| Warning | Query latency | 500 ms |
| Critical | Query latency | 2000 ms |
- Alerts delayed
- Percolator overloaded
- Cluster resource issues
Configuring Thresholds
Per-Index Settings
- Go to Settings > Index Patterns
- Open the index pattern
- Click Health Settings
- Configure thresholds
- Save
Example: Critical Logs
For security-critical log sources:Example: Batch Logs
For periodic or batch data:Alert Suppression
CHAD uses escalation-based suppression to prevent alert storms:| Alert # | Suppression |
|---|---|
| 1st | Fire immediately |
| 2nd | 15-minute suppression |
| 3rd | 1-hour suppression |
| 4th+ | 4-hour suppression |
- Suppression resets
- Next alert fires immediately
Health Notifications
Health alerts can trigger:- Dashboard indicator - Always visible
- Webhook notifications - Same as security alerts
- Email - If configured
Troubleshooting Health Issues
No Data Alerts
Cause: Logs stopped flowing Investigation:- Check log shipper (Fluentd/Logstash) status
- Verify network connectivity
- Check OpenSearch ingestion
- Review source system health
- Restart log shipper
- Check source system
- Verify index exists in OpenSearch
High Error Rates
Cause: Query or indexing failures Investigation:- Check OpenSearch cluster health
- Review error messages in logs
- Check disk space
- Review recent configuration changes
- Scale OpenSearch cluster
- Fix configuration issues
- Clear disk space
- Roll back changes
High Latency
Cause: Slow queries or overloaded cluster Investigation:- Check OpenSearch node resources
- Review percolator count
- Check for expensive rules
- Monitor cluster metrics
- Add OpenSearch nodes
- Optimize expensive rules
- Increase resources
- Review index settings
OpenSearch Disconnected
Cause: Network or authentication issue Investigation:- Test network connectivity
- Check OpenSearch is running
- Verify credentials
- Check SSL certificates
- Fix network issues
- Restart OpenSearch
- Update credentials
- Renew certificates
Health Metrics History
CHAD stores health metrics for trending:- Go to Health
- Click an index pattern
- View historical graphs
- Data volume over time
- Error rate trends
- Latency percentiles
Best Practices
Set appropriate thresholds
Set appropriate thresholds
Match thresholds to your log volume and SLAs.
Monitor critical logs closely
Monitor critical logs closely
Security logs need stricter thresholds than debug logs.
Review health daily
Review health daily
Check the health dashboard as part of your routine.
Investigate trends
Investigate trends
Gradually increasing latency may indicate growing problems.
Test alerting
Test alerting
Verify health notifications reach your team.