Skip to main content

Health Monitoring

CHAD monitors the health of your detection infrastructure. This includes OpenSearch connectivity, log flow, and query performance.

Health Dashboard

Navigate to Health in the sidebar to see:
  • System health - Overall platform status
  • Index pattern health - Per-log-source monitoring
  • OpenSearch status - Connection and cluster health
  • Background tasks - Scheduler status

Health Indicators

StatusIconMeaning
Healthy🟢All systems operational
Warning🟡Degraded but functional
Critical🔴Requires immediate attention

System Health Checks

CHAD automatically monitors:

OpenSearch Connection

  • Connectivity - Can reach the cluster
  • Authentication - Credentials valid
  • Cluster health - OpenSearch cluster status

Background Tasks

  • Scheduler - APScheduler running
  • Health checks - Periodic health tasks executing
  • SigmaHQ sync - Repository sync (if enabled)

Database

  • PostgreSQL connectivity - Database accessible
  • Migration status - Schema up to date

Index Pattern Health

Each index pattern has dedicated monitoring:

No Data Alert

Alert when no logs received:
SettingDescriptionDefault
ThresholdMinutes without data15
SeverityAlert severity levelWarning
Configure per index pattern based on expected log volume:
  • High-volume logs: 5 minutes
  • Low-volume logs: 60 minutes
  • Batch logs: 24 hours (or disable)

Error Rate

Alert when errors exceed threshold:
SettingDescriptionDefault
ThresholdError percentage5%
WindowMeasurement window5 minutes
Errors include:
  • Query failures
  • Indexing errors
  • Timeout errors

Detection Latency

Alert when detection is slow:
LevelSettingDefault
WarningQuery latency500 ms
CriticalQuery latency2000 ms
High latency means:
  • Alerts delayed
  • Percolator overloaded
  • Cluster resource issues

Configuring Thresholds

Per-Index Settings

  1. Go to Settings > Index Patterns
  2. Open the index pattern
  3. Click Health Settings
  4. Configure thresholds
  5. Save

Example: Critical Logs

For security-critical log sources:
No Data Alert: 5 minutes
Error Rate: 1%
Latency Warning: 200 ms
Latency Critical: 500 ms

Example: Batch Logs

For periodic or batch data:
No Data Alert: 60 minutes
Error Rate: 10%
Latency Warning: 2000 ms
Latency Critical: 5000 ms

Alert Suppression

CHAD uses escalation-based suppression to prevent alert storms:
Alert #Suppression
1stFire immediately
2nd15-minute suppression
3rd1-hour suppression
4th+4-hour suppression
When the condition clears:
  • Suppression resets
  • Next alert fires immediately
This prevents inbox flooding during outages while ensuring visibility.

Health Notifications

Health alerts can trigger:
  • Dashboard indicator - Always visible
  • Webhook notifications - Same as security alerts
  • Email - If configured
Configure in Settings > Notifications.

Troubleshooting Health Issues

No Data Alerts

Cause: Logs stopped flowing Investigation:
  1. Check log shipper (Fluentd/Logstash) status
  2. Verify network connectivity
  3. Check OpenSearch ingestion
  4. Review source system health
Resolution:
  1. Restart log shipper
  2. Check source system
  3. Verify index exists in OpenSearch

High Error Rates

Cause: Query or indexing failures Investigation:
  1. Check OpenSearch cluster health
  2. Review error messages in logs
  3. Check disk space
  4. Review recent configuration changes
Resolution:
  1. Scale OpenSearch cluster
  2. Fix configuration issues
  3. Clear disk space
  4. Roll back changes

High Latency

Cause: Slow queries or overloaded cluster Investigation:
  1. Check OpenSearch node resources
  2. Review percolator count
  3. Check for expensive rules
  4. Monitor cluster metrics
Resolution:
  1. Add OpenSearch nodes
  2. Optimize expensive rules
  3. Increase resources
  4. Review index settings

OpenSearch Disconnected

Cause: Network or authentication issue Investigation:
  1. Test network connectivity
  2. Check OpenSearch is running
  3. Verify credentials
  4. Check SSL certificates
Resolution:
  1. Fix network issues
  2. Restart OpenSearch
  3. Update credentials
  4. Renew certificates

Health Metrics History

CHAD stores health metrics for trending:
  1. Go to Health
  2. Click an index pattern
  3. View historical graphs
Metrics include:
  • Data volume over time
  • Error rate trends
  • Latency percentiles

Best Practices

Match thresholds to your log volume and SLAs.
Security logs need stricter thresholds than debug logs.
Check the health dashboard as part of your routine.
Verify health notifications reach your team.

Next Steps