Health Monitoring

CHAD monitors the health of your detection infrastructure. This includes OpenSearch connectivity, log flow, and query performance.

Health Dashboard

Navigate to Health in the sidebar to see:

System health - Overall platform status
Index pattern health - Per-log-source monitoring
OpenSearch status - Connection and cluster health
Background tasks - Scheduler status

Health Indicators

Status	Icon	Meaning
Healthy	🟢	All systems operational
Warning	🟡	Degraded but functional
Critical	🔴	Requires immediate attention

System Health Checks

CHAD automatically monitors:

OpenSearch Connection

Connectivity - Can reach the cluster
Authentication - Credentials valid
Cluster health - OpenSearch cluster status

Background Tasks

Scheduler - APScheduler running
Health checks - Periodic health tasks executing
SigmaHQ sync - Repository sync (if enabled)

Database

PostgreSQL connectivity - Database accessible
Migration status - Schema up to date

Index Pattern Health

Each index pattern has dedicated monitoring:

No Data Alert

Alert when no logs received:

Setting	Description	Default
Threshold	Minutes without data	15
Severity	Alert severity level	Warning

Configure per index pattern based on expected log volume:

High-volume logs: 5 minutes
Low-volume logs: 60 minutes
Batch logs: 24 hours (or disable)

Error Rate

Alert when errors exceed threshold:

Setting	Description	Default
Threshold	Error percentage	5%
Window	Measurement window	5 minutes

Errors include:

Query failures
Indexing errors
Timeout errors

Detection Latency

Alert when detection is slow:

Level	Setting	Default
Warning	Query latency	500 ms
Critical	Query latency	2000 ms

High latency means:

Alerts delayed
Percolator overloaded
Cluster resource issues

Configuring Thresholds

Per-Index Settings

Go to Settings > Index Patterns
Open the index pattern
Click Health Settings
Configure thresholds
Save

Example: Critical Logs

For security-critical log sources:

No Data Alert: 5 minutes
Error Rate: 1%
Latency Warning: 200 ms
Latency Critical: 500 ms

Example: Batch Logs

For periodic or batch data:

No Data Alert: 60 minutes
Error Rate: 10%
Latency Warning: 2000 ms
Latency Critical: 5000 ms

Alert Suppression

CHAD uses escalation-based suppression to prevent alert storms:

Alert #	Suppression
1st	Fire immediately
2nd	15-minute suppression
3rd	1-hour suppression
4th+	4-hour suppression

When the condition clears:

Suppression resets
Next alert fires immediately

This prevents inbox flooding during outages while ensuring visibility.

Health Notifications

Health alerts can trigger:

Dashboard indicator - Always visible
Webhook notifications - Same as security alerts
Email - If configured

Configure in Settings > Notifications.

Troubleshooting Health Issues

No Data Alerts

Cause: Logs stopped flowing Investigation:

Check log shipper (Fluentd/Logstash) status
Verify network connectivity
Check OpenSearch ingestion
Review source system health

Resolution:

Restart log shipper
Check source system
Verify index exists in OpenSearch

High Error Rates

Cause: Query or indexing failures Investigation:

Check OpenSearch cluster health
Review error messages in logs
Check disk space
Review recent configuration changes

Resolution:

Scale OpenSearch cluster
Fix configuration issues
Clear disk space
Roll back changes

High Latency

Cause: Slow queries or overloaded cluster Investigation:

Check OpenSearch node resources
Review percolator count
Check for expensive rules
Monitor cluster metrics

Resolution:

Add OpenSearch nodes
Optimize expensive rules
Increase resources
Review index settings

OpenSearch Disconnected

Cause: Network or authentication issue Investigation:

Test network connectivity
Check OpenSearch is running
Verify credentials
Check SSL certificates

Resolution:

Fix network issues
Restart OpenSearch
Update credentials
Renew certificates

Health Metrics History

CHAD stores health metrics for trending:

Go to Health
Click an index pattern
View historical graphs

Metrics include:

Data volume over time
Error rate trends
Latency percentiles

Best Practices

Set appropriate thresholds

Match thresholds to your log volume and SLAs.

Monitor critical logs closely

Security logs need stricter thresholds than debug logs.

Review health daily

Check the health dashboard as part of your routine.

Investigate trends

Gradually increasing latency may indicate growing problems.

Test alerting

Verify health notifications reach your team.

Next Steps

Index Patterns

Configure per-index thresholds

Notifications

Set up health notifications

​Health Monitoring

​Health Dashboard

​Health Indicators

​System Health Checks

​OpenSearch Connection

​Background Tasks

​Database

​Index Pattern Health

​No Data Alert

​Error Rate

​Detection Latency

​Configuring Thresholds

​Per-Index Settings

​Example: Critical Logs

​Example: Batch Logs

​Alert Suppression

​Health Notifications

​Troubleshooting Health Issues

​No Data Alerts

​High Error Rates

​High Latency

​OpenSearch Disconnected

​Health Metrics History

​Best Practices

​Next Steps

Index Patterns

Notifications

Health Monitoring

Health Dashboard

Health Indicators

System Health Checks

OpenSearch Connection

Background Tasks

Database

Index Pattern Health

No Data Alert

Error Rate

Detection Latency

Configuring Thresholds

Per-Index Settings

Example: Critical Logs

Example: Batch Logs

Alert Suppression

Health Notifications

Troubleshooting Health Issues

No Data Alerts

High Error Rates

High Latency

OpenSearch Disconnected

Health Metrics History

Best Practices

Next Steps