drone-detector/docs/DEVICE_HEALTH_MONITORING.md

# Device Health Monitoring System

The Device Health Monitoring System automatically monitors all active and approved devices for heartbeat activity and sends alerts when devices go offline for extended periods.

## Features

### Automatic Health Monitoring
- **Continuous Monitoring**: Checks device health every 5 minutes
- **Offline Detection**: Devices are considered offline after 30 minutes without heartbeat
- **Recovery Detection**: Automatically detects when offline devices come back online
- **Alert Integration**: Uses the existing alert system for SMS/email/webhook notifications

### Alert Capabilities
- **SMS Alerts**: Send SMS notifications when devices go offline or recover
- **Email Alerts**: Send email notifications (when configured)
- **Webhook Integration**: Send webhook notifications for external systems
- **Recovery Notifications**: Automatic "all clear" messages when devices recover

### Configuration
- **Customizable Thresholds**: Configure offline detection timeouts
- **Alert Rules**: Use existing alert rule system to configure recipients
- **Channel Selection**: Choose SMS, email, webhook, or multiple channels
- **Device-Specific Rules**: Create rules for specific devices or all devices

## Setup

### 1. Alert Rule Configuration

Create alert rules for device offline monitoring using the web interface or API:

```json
{
  "name": "Device Offline Alert",
  "description": "Alert when security devices go offline",
  "conditions": {
    "device_offline": true,
    "device_ids": [1941875381, 1941875382]  // Optional: specific devices
  },
  "alert_channels": ["sms", "email"],
  "sms_phone_number": "+46701234567",
  "email": "admin@company.com",
  "is_active": true,
  "priority": "high"
}
```

### 2. Service Configuration

The service automatically starts with the server and can be configured with environment variables:

- **Check Interval**: How often to check device health (default: 5 minutes)
- **Offline Threshold**: How long without heartbeat before considering offline (default: 30 minutes)

### 3. SMS Configuration

For SMS alerts, configure Twilio credentials:

```bash
TWILIO_ACCOUNT_SID=your_account_sid
TWILIO_AUTH_TOKEN=your_auth_token
TWILIO_PHONE_NUMBER=your_twilio_phone
```

## API Endpoints

### Get Service Status
```
GET /api/device-health/status
```

Returns the current status of the device health monitoring service:

```json
{
  "success": true,
  "data": {
    "isRunning": true,
    "checkIntervalMinutes": 5,
    "offlineThresholdMinutes": 30,
    "offlineDevicesCount": 1,
    "offlineDevices": [
      {
        "deviceId": 1941875383,
        "deviceName": "Guard Tower 3",
        "offlineSince": "2025-09-07T10:00:00Z",
        "alertSent": true
      }
    ]
  }
}
```

### Trigger Manual Health Check
```
POST /api/device-health/check
```

Forces an immediate health check of all devices.

### Start/Stop Service
```
POST /api/device-health/start
POST /api/device-health/stop
```

Control the health monitoring service (normally runs automatically).

## Alert Messages

### Offline Alert
```
🚨 DEVICE OFFLINE ALERT 🚨

📍 LOCATION: Stockholm Castle
🔧 DEVICE: Guard Tower 1
⏰ OFFLINE FOR: 45 minutes
📅 LAST SEEN: 2025-09-07 14:30:00

❌ Device has stopped sending heartbeats.
🔧 Check device power, network connection, or physical access.

⚠️ Security monitoring may be compromised in this area.
```

### Recovery Alert
```
✅ DEVICE RECOVERED ✅

📍 LOCATION: Stockholm Castle
🔧 DEVICE: Guard Tower 1
⏰ RECOVERED AT: 2025-09-07 15:15:00

✅ Device is now sending heartbeats again.
🛡️ Security monitoring restored for this area.
```

## Testing

Use the provided test script to verify the system is working:

```bash
python3 test_device_health.py
```

This will:
- Check the device health service status
- List all devices and their current health status
- Show configured alert rules for device offline monitoring
- Trigger a manual health check

## Integration with Existing Systems

The device health monitoring integrates seamlessly with:

1. **Existing Alert System**: Uses the same alert rules, channels, and logging
2. **Device Management**: Works with the existing device approval and activation system
3. **Heartbeat System**: Uses the existing heartbeat infrastructure
4. **Dashboard**: Device status is already displayed in the device list

## Troubleshooting

### No Alerts Received
1. Check if device offline alert rules are configured and active
2. Verify SMS/email credentials are properly configured
3. Check device health service status via API
4. Ensure devices are marked as active and approved

### False Positives
1. Adjust the offline threshold if devices have irregular heartbeat patterns
2. Check network connectivity between devices and server
3. Verify heartbeat intervals are properly configured for each device

### Service Not Running
1. Check server logs for startup errors
2. Verify database connectivity
3. Restart the server to reinitialize the service

## Monitoring and Logs

- Service status is logged to console with timestamps
- Alert sending is logged with recipient and status information
- Manual health checks can be triggered via API for testing
- Service automatically handles graceful shutdown on server restart