drone-detector/docs/container-health-endpoint.md

# Container Health & Metrics Endpoint Implementation

## How It Works - Different Approaches Explained

### 🎯 **Current Implementation: Multi-Layered Detection**

The system I just implemented uses a **fallback chain** approach - NO agents required! Here's how:

#### **Method 1: Built-in Health Endpoints (Recommended)**
```javascript
// Add to your existing Express.js containers
const express = require('express');
const app = express();

// Simple addition to existing code - no agent needed!
app.get('/health/metrics', (req, res) => {
  const memUsage = process.memoryUsage();
  res.json({
    container: process.env.CONTAINER_NAME || 'backend',
    cpu: getCurrentCPU(),
    memory: {
      usage: `${Math.round(memUsage.heapUsed / 1024 / 1024)}MB`,
      percentage: `${Math.round((memUsage.heapUsed / memUsage.heapTotal) * 100)}%`
    },
    uptime: process.uptime(),
    health: 'healthy'
  });
});
```

**✅ Pros**: Direct from container, accurate, real-time
**❌ Cons**: Requires code changes in each container

#### **Method 2: Docker Stats API (Current Fallback)**
```javascript
// From management container - queries Docker daemon
const { stdout } = await execAsync('docker stats --no-stream --format "table {{.Container}}\\t{{.CPUPerc}}\\t{{.MemUsage}}"');
```

**✅ Pros**: Works with ANY container, no code changes needed
**❌ Cons**: Requires Docker daemon access

#### **Method 3: Docker Compose Status**
```javascript
// Queries docker-compose for container states
const { stdout } = await execAsync('docker-compose ps --format json');
```

**✅ Pros**: Basic status info, works everywhere
**❌ Cons**: Limited metrics, just status/health

---

## 🤖 **Alternative: Agent-Based Approaches**

### **Option A: Sidecar Container Pattern**
```yaml
# docker-compose.yml
services:
  app:
    image: my-app:latest

  metrics-agent:
    image: metrics-agent:latest
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    environment:
      - TARGET_CONTAINER=app
```

**How it works**: Deploy a metrics agent container alongside each service
**✅ Pros**: No code changes, detailed system metrics
**❌ Cons**: Extra containers, more complex deployment

### **Option B: In-Container Agent Process**
```dockerfile
# Add to existing Dockerfile
FROM node:18
COPY . /app
COPY metrics-agent /usr/local/bin/
RUN chmod +x /usr/local/bin/metrics-agent

# Start both app and agent
CMD ["sh", "-c", "metrics-agent & npm start"]
```

**How it works**: Runs a metrics collection process inside each container
**✅ Pros**: Single container, detailed metrics
**❌ Cons**: Modifies container, uses more resources

### **Option C: External Monitoring Tools**

#### **Prometheus + Node Exporter**
```yaml
services:
  node-exporter:
    image: prom/node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
```

#### **cAdvisor (Container Advisor)**
```yaml
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
```

---

## 🔧 **Recommended Implementation Strategy**

### **Phase 1: Docker Stats (Current)**
- ✅ **Already implemented**
- Works immediately with existing containers
- No code changes required
- Provides CPU, Memory, Network, Disk I/O

### **Phase 2: Add Health Endpoints**
```javascript
// Add 3 lines to each container's main file
const { createHealthEndpoint } = require('./utils/health-endpoint');
createHealthEndpoint(app); // app is your Express instance
```

### **Phase 3: Enhanced Monitoring (Optional)**
- Add Prometheus metrics
- Implement custom business metrics
- Add alerting and dashboards

---

## 🎯 **Current System Architecture**

```
Management Container
    ↓
1. Try HTTP health endpoints (app containers)
    ↓ (if fails)
2. Query Docker daemon (all containers)
    ↓ (if fails)
3. Check docker-compose status
    ↓ (if fails)
4. Scan system processes
```

**No agents required!** The management container does all the work:

1. **Health Endpoints**: Makes HTTP calls to containers that support it
2. **Docker Stats**: Queries Docker daemon for ALL container metrics
3. **Process Detection**: Scans system for running services
4. **Smart Fallback**: Always tries to get SOME information

---

## 🚀 **Why This Approach is Great**

### **For Existing Systems**
- **Zero downtime**: Works immediately
- **No refactoring**: Containers don't need changes
- **Comprehensive**: Sees ALL containers (yours + infrastructure)

### **For Future Development**
- **Gradual enhancement**: Add health endpoints when convenient
- **Flexible**: Can switch to any monitoring approach later
- **Standards compliant**: Uses Docker APIs and HTTP standards

### **Production Ready**
- **Reliable fallbacks**: Always gets some data
- **Error handling**: Graceful degradation
- **Performance**: Lightweight HTTP calls
- **Security**: No privileged containers needed

### 2. Prometheus-style Metrics Scraping

```javascript
// In management.js
const scrapePrometheusMetrics = async (containerUrl) => {
  try {
    const response = await fetch(`${containerUrl}/metrics`);
    const metricsText = await response.text();

    // Parse Prometheus format
    const metrics = {};
    metricsText.split('\n').forEach(line => {
      if (line.startsWith('container_cpu_usage')) {
        metrics.cpu = line.split(' ')[1] + '%';
      }
      if (line.startsWith('container_memory_usage_bytes')) {
        const bytes = parseInt(line.split(' ')[1]);
        metrics.memory = Math.round(bytes / 1024 / 1024) + 'MB';
      }
    });

    return metrics;
  } catch (error) {
    return { error: 'Prometheus metrics unavailable' };
  }
};
```

### 3. Socket.IO Real-time Metrics Broadcasting

```javascript
// Each container broadcasts its metrics via Socket.IO
const io = require('socket.io-client');
const socket = io('http://management-backend:3000');

setInterval(() => {
  const metrics = {
    container: process.env.CONTAINER_NAME,
    cpu: getCurrentCPU(),
    memory: getCurrentMemory(),
    timestamp: Date.now()
  };

  socket.emit('container_metrics', metrics);
}, 10000); // Every 10 seconds

// Management backend collects these
io.on('container_metrics', (metrics) => {
  containerMetricsCache[metrics.container] = metrics;
});
```

### 4. Log File Tailing Approach

```javascript
// Parse container logs for metrics
const tailContainerLogs = async (containerName) => {
  try {
    const { stdout } = await execAsync(`docker logs --tail 50 ${containerName} | grep "METRICS:"`);
    const logLines = stdout.split('\n').filter(line => line.includes('METRICS:'));

    if (logLines.length > 0) {
      const lastMetric = logLines[logLines.length - 1];
      const metricsJson = lastMetric.split('METRICS:')[1];
      return JSON.parse(metricsJson);
    }
  } catch (error) {
    return { error: 'Log metrics unavailable' };
  }
};

// Containers log metrics in structured format
console.log(`METRICS: ${JSON.stringify({
  cpu: getCurrentCPU(),
  memory: getCurrentMemory(),
  timestamp: new Date().toISOString()
})}`);
```

### 5. Shared Volume Metrics Files

```javascript
// Each container writes metrics to shared volume
const writeMetricsToFile = () => {
  const metrics = {
    container: process.env.CONTAINER_NAME,
    cpu: getCurrentCPU(),
    memory: getCurrentMemory(),
    timestamp: Date.now()
  };

  fs.writeFileSync(`/shared/metrics/${process.env.CONTAINER_NAME}.json`, JSON.stringify(metrics));
};

// Management reads from shared volume
const readSharedMetrics = () => {
  const metricsDir = '/shared/metrics';
  const files = fs.readdirSync(metricsDir);

  return files.reduce((acc, file) => {
    if (file.endsWith('.json')) {
      const metrics = JSON.parse(fs.readFileSync(path.join(metricsDir, file)));
      acc[file.replace('.json', '')] = metrics;
    }
    return acc;
  }, {});
};
```

### 6. Database-based Metrics Collection

```javascript
// Containers insert metrics into shared database
const recordMetrics = async () => {
  await db.query(`
    INSERT INTO container_metrics (container_name, cpu_usage, memory_usage, timestamp)
    VALUES (?, ?, ?, ?)
  `, [process.env.CONTAINER_NAME, getCurrentCPU(), getCurrentMemory(), new Date()]);
};

// Management queries latest metrics
const getLatestMetrics = async () => {
  const result = await db.query(`
    SELECT container_name, cpu_usage, memory_usage, timestamp
    FROM container_metrics
    WHERE timestamp > NOW() - INTERVAL 1 MINUTE
    ORDER BY timestamp DESC
  `);

  return result.reduce((acc, row) => {
    acc[row.container_name] = {
      cpu: row.cpu_usage,
      memory: row.memory_usage,
      lastUpdate: row.timestamp
    };
    return acc;
  }, {});
};
```

## Implementation Priority

1. **Health Endpoints** - Most reliable, direct communication
2. **Socket.IO Broadcasting** - Real-time, low overhead
3. **Prometheus Metrics** - Industry standard, rich data
4. **Shared Volume Files** - Simple, filesystem-based
5. **Log Tailing** - Works with existing logging
6. **Database Collection** - Persistent, queryable history

## Benefits

- **Fallback Chain**: Multiple methods ensure metrics are always available
- **Self-Reporting**: Containers know their own state best
- **Real-time**: Direct communication provides immediate updates
- **Standardized**: Each method can provide consistent metric format
- **Resilient**: If one method fails, others still work