184 lines
5.2 KiB
Markdown
184 lines
5.2 KiB
Markdown
# Device Health Monitoring System
|
|
|
|
The Device Health Monitoring System automatically monitors all active and approved devices for heartbeat activity and sends alerts when devices go offline for extended periods.
|
|
|
|
## Features
|
|
|
|
### Automatic Health Monitoring
|
|
- **Continuous Monitoring**: Checks device health every 5 minutes
|
|
- **Offline Detection**: Devices are considered offline after 30 minutes without heartbeat
|
|
- **Recovery Detection**: Automatically detects when offline devices come back online
|
|
- **Alert Integration**: Uses the existing alert system for SMS/email/webhook notifications
|
|
|
|
### Alert Capabilities
|
|
- **SMS Alerts**: Send SMS notifications when devices go offline or recover
|
|
- **Email Alerts**: Send email notifications (when configured)
|
|
- **Webhook Integration**: Send webhook notifications for external systems
|
|
- **Recovery Notifications**: Automatic "all clear" messages when devices recover
|
|
|
|
### Configuration
|
|
- **Customizable Thresholds**: Configure offline detection timeouts
|
|
- **Alert Rules**: Use existing alert rule system to configure recipients
|
|
- **Channel Selection**: Choose SMS, email, webhook, or multiple channels
|
|
- **Device-Specific Rules**: Create rules for specific devices or all devices
|
|
|
|
## Setup
|
|
|
|
### 1. Alert Rule Configuration
|
|
|
|
Create alert rules for device offline monitoring using the web interface or API:
|
|
|
|
```json
|
|
{
|
|
"name": "Device Offline Alert",
|
|
"description": "Alert when security devices go offline",
|
|
"conditions": {
|
|
"device_offline": true,
|
|
"device_ids": [1941875381, 1941875382] // Optional: specific devices
|
|
},
|
|
"alert_channels": ["sms", "email"],
|
|
"sms_phone_number": "+46701234567",
|
|
"email": "admin@company.com",
|
|
"is_active": true,
|
|
"priority": "high"
|
|
}
|
|
```
|
|
|
|
### 2. Service Configuration
|
|
|
|
The service automatically starts with the server and can be configured with environment variables:
|
|
|
|
- **Check Interval**: How often to check device health (default: 5 minutes)
|
|
- **Offline Threshold**: How long without heartbeat before considering offline (default: 30 minutes)
|
|
|
|
### 3. SMS Configuration
|
|
|
|
For SMS alerts, configure Twilio credentials:
|
|
|
|
```bash
|
|
TWILIO_ACCOUNT_SID=your_account_sid
|
|
TWILIO_AUTH_TOKEN=your_auth_token
|
|
TWILIO_PHONE_NUMBER=your_twilio_phone
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Get Service Status
|
|
```
|
|
GET /api/device-health/status
|
|
```
|
|
|
|
Returns the current status of the device health monitoring service:
|
|
|
|
```json
|
|
{
|
|
"success": true,
|
|
"data": {
|
|
"isRunning": true,
|
|
"checkIntervalMinutes": 5,
|
|
"offlineThresholdMinutes": 30,
|
|
"offlineDevicesCount": 1,
|
|
"offlineDevices": [
|
|
{
|
|
"deviceId": 1941875383,
|
|
"deviceName": "Guard Tower 3",
|
|
"offlineSince": "2025-09-07T10:00:00Z",
|
|
"alertSent": true
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Trigger Manual Health Check
|
|
```
|
|
POST /api/device-health/check
|
|
```
|
|
|
|
Forces an immediate health check of all devices.
|
|
|
|
### Start/Stop Service
|
|
```
|
|
POST /api/device-health/start
|
|
POST /api/device-health/stop
|
|
```
|
|
|
|
Control the health monitoring service (normally runs automatically).
|
|
|
|
## Alert Messages
|
|
|
|
### Offline Alert
|
|
```
|
|
🚨 DEVICE OFFLINE ALERT 🚨
|
|
|
|
📍 LOCATION: Stockholm Castle
|
|
🔧 DEVICE: Guard Tower 1
|
|
⏰ OFFLINE FOR: 45 minutes
|
|
📅 LAST SEEN: 2025-09-07 14:30:00
|
|
|
|
❌ Device has stopped sending heartbeats.
|
|
🔧 Check device power, network connection, or physical access.
|
|
|
|
⚠️ Security monitoring may be compromised in this area.
|
|
```
|
|
|
|
### Recovery Alert
|
|
```
|
|
✅ DEVICE RECOVERED ✅
|
|
|
|
📍 LOCATION: Stockholm Castle
|
|
🔧 DEVICE: Guard Tower 1
|
|
⏰ RECOVERED AT: 2025-09-07 15:15:00
|
|
|
|
✅ Device is now sending heartbeats again.
|
|
🛡️ Security monitoring restored for this area.
|
|
```
|
|
|
|
## Testing
|
|
|
|
Use the provided test script to verify the system is working:
|
|
|
|
```bash
|
|
python3 test_device_health.py
|
|
```
|
|
|
|
This will:
|
|
- Check the device health service status
|
|
- List all devices and their current health status
|
|
- Show configured alert rules for device offline monitoring
|
|
- Trigger a manual health check
|
|
|
|
## Integration with Existing Systems
|
|
|
|
The device health monitoring integrates seamlessly with:
|
|
|
|
1. **Existing Alert System**: Uses the same alert rules, channels, and logging
|
|
2. **Device Management**: Works with the existing device approval and activation system
|
|
3. **Heartbeat System**: Uses the existing heartbeat infrastructure
|
|
4. **Dashboard**: Device status is already displayed in the device list
|
|
|
|
## Troubleshooting
|
|
|
|
### No Alerts Received
|
|
1. Check if device offline alert rules are configured and active
|
|
2. Verify SMS/email credentials are properly configured
|
|
3. Check device health service status via API
|
|
4. Ensure devices are marked as active and approved
|
|
|
|
### False Positives
|
|
1. Adjust the offline threshold if devices have irregular heartbeat patterns
|
|
2. Check network connectivity between devices and server
|
|
3. Verify heartbeat intervals are properly configured for each device
|
|
|
|
### Service Not Running
|
|
1. Check server logs for startup errors
|
|
2. Verify database connectivity
|
|
3. Restart the server to reinitialize the service
|
|
|
|
## Monitoring and Logs
|
|
|
|
- Service status is logged to console with timestamps
|
|
- Alert sending is logged with recipient and status information
|
|
- Manual health checks can be triggered via API for testing
|
|
- Service automatically handles graceful shutdown on server restart
|