342 lines
9.4 KiB
Markdown
342 lines
9.4 KiB
Markdown
# Container Health & Metrics Endpoint Implementation
|
|
|
|
## How It Works - Different Approaches Explained
|
|
|
|
### 🎯 **Current Implementation: Multi-Layered Detection**
|
|
|
|
The system I just implemented uses a **fallback chain** approach - NO agents required! Here's how:
|
|
|
|
#### **Method 1: Built-in Health Endpoints (Recommended)**
|
|
```javascript
|
|
// Add to your existing Express.js containers
|
|
const express = require('express');
|
|
const app = express();
|
|
|
|
// Simple addition to existing code - no agent needed!
|
|
app.get('/health/metrics', (req, res) => {
|
|
const memUsage = process.memoryUsage();
|
|
res.json({
|
|
container: process.env.CONTAINER_NAME || 'backend',
|
|
cpu: getCurrentCPU(),
|
|
memory: {
|
|
usage: `${Math.round(memUsage.heapUsed / 1024 / 1024)}MB`,
|
|
percentage: `${Math.round((memUsage.heapUsed / memUsage.heapTotal) * 100)}%`
|
|
},
|
|
uptime: process.uptime(),
|
|
health: 'healthy'
|
|
});
|
|
});
|
|
```
|
|
|
|
**✅ Pros**: Direct from container, accurate, real-time
|
|
**❌ Cons**: Requires code changes in each container
|
|
|
|
#### **Method 2: Docker Stats API (Current Fallback)**
|
|
```javascript
|
|
// From management container - queries Docker daemon
|
|
const { stdout } = await execAsync('docker stats --no-stream --format "table {{.Container}}\\t{{.CPUPerc}}\\t{{.MemUsage}}"');
|
|
```
|
|
|
|
**✅ Pros**: Works with ANY container, no code changes needed
|
|
**❌ Cons**: Requires Docker daemon access
|
|
|
|
#### **Method 3: Docker Compose Status**
|
|
```javascript
|
|
// Queries docker-compose for container states
|
|
const { stdout } = await execAsync('docker-compose ps --format json');
|
|
```
|
|
|
|
**✅ Pros**: Basic status info, works everywhere
|
|
**❌ Cons**: Limited metrics, just status/health
|
|
|
|
---
|
|
|
|
## 🤖 **Alternative: Agent-Based Approaches**
|
|
|
|
### **Option A: Sidecar Container Pattern**
|
|
```yaml
|
|
# docker-compose.yml
|
|
services:
|
|
app:
|
|
image: my-app:latest
|
|
|
|
metrics-agent:
|
|
image: metrics-agent:latest
|
|
volumes:
|
|
- /proc:/host/proc:ro
|
|
- /sys:/host/sys:ro
|
|
environment:
|
|
- TARGET_CONTAINER=app
|
|
```
|
|
|
|
**How it works**: Deploy a metrics agent container alongside each service
|
|
**✅ Pros**: No code changes, detailed system metrics
|
|
**❌ Cons**: Extra containers, more complex deployment
|
|
|
|
### **Option B: In-Container Agent Process**
|
|
```dockerfile
|
|
# Add to existing Dockerfile
|
|
FROM node:18
|
|
COPY . /app
|
|
COPY metrics-agent /usr/local/bin/
|
|
RUN chmod +x /usr/local/bin/metrics-agent
|
|
|
|
# Start both app and agent
|
|
CMD ["sh", "-c", "metrics-agent & npm start"]
|
|
```
|
|
|
|
**How it works**: Runs a metrics collection process inside each container
|
|
**✅ Pros**: Single container, detailed metrics
|
|
**❌ Cons**: Modifies container, uses more resources
|
|
|
|
### **Option C: External Monitoring Tools**
|
|
|
|
#### **Prometheus + Node Exporter**
|
|
```yaml
|
|
services:
|
|
node-exporter:
|
|
image: prom/node-exporter
|
|
ports:
|
|
- "9100:9100"
|
|
volumes:
|
|
- /proc:/host/proc:ro
|
|
- /sys:/host/sys:ro
|
|
- /:/rootfs:ro
|
|
```
|
|
|
|
#### **cAdvisor (Container Advisor)**
|
|
```yaml
|
|
services:
|
|
cadvisor:
|
|
image: gcr.io/cadvisor/cadvisor:latest
|
|
ports:
|
|
- "8080:8080"
|
|
volumes:
|
|
- /:/rootfs:ro
|
|
- /var/run:/var/run:rw
|
|
- /sys:/sys:ro
|
|
- /var/lib/docker/:/var/lib/docker:ro
|
|
```
|
|
|
|
---
|
|
|
|
## 🔧 **Recommended Implementation Strategy**
|
|
|
|
### **Phase 1: Docker Stats (Current)**
|
|
- ✅ **Already implemented**
|
|
- Works immediately with existing containers
|
|
- No code changes required
|
|
- Provides CPU, Memory, Network, Disk I/O
|
|
|
|
### **Phase 2: Add Health Endpoints**
|
|
```javascript
|
|
// Add 3 lines to each container's main file
|
|
const { createHealthEndpoint } = require('./utils/health-endpoint');
|
|
createHealthEndpoint(app); // app is your Express instance
|
|
```
|
|
|
|
### **Phase 3: Enhanced Monitoring (Optional)**
|
|
- Add Prometheus metrics
|
|
- Implement custom business metrics
|
|
- Add alerting and dashboards
|
|
|
|
---
|
|
|
|
## 🎯 **Current System Architecture**
|
|
|
|
```
|
|
Management Container
|
|
↓
|
|
1. Try HTTP health endpoints (app containers)
|
|
↓ (if fails)
|
|
2. Query Docker daemon (all containers)
|
|
↓ (if fails)
|
|
3. Check docker-compose status
|
|
↓ (if fails)
|
|
4. Scan system processes
|
|
```
|
|
|
|
**No agents required!** The management container does all the work:
|
|
|
|
1. **Health Endpoints**: Makes HTTP calls to containers that support it
|
|
2. **Docker Stats**: Queries Docker daemon for ALL container metrics
|
|
3. **Process Detection**: Scans system for running services
|
|
4. **Smart Fallback**: Always tries to get SOME information
|
|
|
|
---
|
|
|
|
## 🚀 **Why This Approach is Great**
|
|
|
|
### **For Existing Systems**
|
|
- **Zero downtime**: Works immediately
|
|
- **No refactoring**: Containers don't need changes
|
|
- **Comprehensive**: Sees ALL containers (yours + infrastructure)
|
|
|
|
### **For Future Development**
|
|
- **Gradual enhancement**: Add health endpoints when convenient
|
|
- **Flexible**: Can switch to any monitoring approach later
|
|
- **Standards compliant**: Uses Docker APIs and HTTP standards
|
|
|
|
### **Production Ready**
|
|
- **Reliable fallbacks**: Always gets some data
|
|
- **Error handling**: Graceful degradation
|
|
- **Performance**: Lightweight HTTP calls
|
|
- **Security**: No privileged containers needed
|
|
|
|
### 2. Prometheus-style Metrics Scraping
|
|
|
|
```javascript
|
|
// In management.js
|
|
const scrapePrometheusMetrics = async (containerUrl) => {
|
|
try {
|
|
const response = await fetch(`${containerUrl}/metrics`);
|
|
const metricsText = await response.text();
|
|
|
|
// Parse Prometheus format
|
|
const metrics = {};
|
|
metricsText.split('\n').forEach(line => {
|
|
if (line.startsWith('container_cpu_usage')) {
|
|
metrics.cpu = line.split(' ')[1] + '%';
|
|
}
|
|
if (line.startsWith('container_memory_usage_bytes')) {
|
|
const bytes = parseInt(line.split(' ')[1]);
|
|
metrics.memory = Math.round(bytes / 1024 / 1024) + 'MB';
|
|
}
|
|
});
|
|
|
|
return metrics;
|
|
} catch (error) {
|
|
return { error: 'Prometheus metrics unavailable' };
|
|
}
|
|
};
|
|
```
|
|
|
|
### 3. Socket.IO Real-time Metrics Broadcasting
|
|
|
|
```javascript
|
|
// Each container broadcasts its metrics via Socket.IO
|
|
const io = require('socket.io-client');
|
|
const socket = io('http://management-backend:3000');
|
|
|
|
setInterval(() => {
|
|
const metrics = {
|
|
container: process.env.CONTAINER_NAME,
|
|
cpu: getCurrentCPU(),
|
|
memory: getCurrentMemory(),
|
|
timestamp: Date.now()
|
|
};
|
|
|
|
socket.emit('container_metrics', metrics);
|
|
}, 10000); // Every 10 seconds
|
|
|
|
// Management backend collects these
|
|
io.on('container_metrics', (metrics) => {
|
|
containerMetricsCache[metrics.container] = metrics;
|
|
});
|
|
```
|
|
|
|
### 4. Log File Tailing Approach
|
|
|
|
```javascript
|
|
// Parse container logs for metrics
|
|
const tailContainerLogs = async (containerName) => {
|
|
try {
|
|
const { stdout } = await execAsync(`docker logs --tail 50 ${containerName} | grep "METRICS:"`);
|
|
const logLines = stdout.split('\n').filter(line => line.includes('METRICS:'));
|
|
|
|
if (logLines.length > 0) {
|
|
const lastMetric = logLines[logLines.length - 1];
|
|
const metricsJson = lastMetric.split('METRICS:')[1];
|
|
return JSON.parse(metricsJson);
|
|
}
|
|
} catch (error) {
|
|
return { error: 'Log metrics unavailable' };
|
|
}
|
|
};
|
|
|
|
// Containers log metrics in structured format
|
|
console.log(`METRICS: ${JSON.stringify({
|
|
cpu: getCurrentCPU(),
|
|
memory: getCurrentMemory(),
|
|
timestamp: new Date().toISOString()
|
|
})}`);
|
|
```
|
|
|
|
### 5. Shared Volume Metrics Files
|
|
|
|
```javascript
|
|
// Each container writes metrics to shared volume
|
|
const writeMetricsToFile = () => {
|
|
const metrics = {
|
|
container: process.env.CONTAINER_NAME,
|
|
cpu: getCurrentCPU(),
|
|
memory: getCurrentMemory(),
|
|
timestamp: Date.now()
|
|
};
|
|
|
|
fs.writeFileSync(`/shared/metrics/${process.env.CONTAINER_NAME}.json`, JSON.stringify(metrics));
|
|
};
|
|
|
|
// Management reads from shared volume
|
|
const readSharedMetrics = () => {
|
|
const metricsDir = '/shared/metrics';
|
|
const files = fs.readdirSync(metricsDir);
|
|
|
|
return files.reduce((acc, file) => {
|
|
if (file.endsWith('.json')) {
|
|
const metrics = JSON.parse(fs.readFileSync(path.join(metricsDir, file)));
|
|
acc[file.replace('.json', '')] = metrics;
|
|
}
|
|
return acc;
|
|
}, {});
|
|
};
|
|
```
|
|
|
|
### 6. Database-based Metrics Collection
|
|
|
|
```javascript
|
|
// Containers insert metrics into shared database
|
|
const recordMetrics = async () => {
|
|
await db.query(`
|
|
INSERT INTO container_metrics (container_name, cpu_usage, memory_usage, timestamp)
|
|
VALUES (?, ?, ?, ?)
|
|
`, [process.env.CONTAINER_NAME, getCurrentCPU(), getCurrentMemory(), new Date()]);
|
|
};
|
|
|
|
// Management queries latest metrics
|
|
const getLatestMetrics = async () => {
|
|
const result = await db.query(`
|
|
SELECT container_name, cpu_usage, memory_usage, timestamp
|
|
FROM container_metrics
|
|
WHERE timestamp > NOW() - INTERVAL 1 MINUTE
|
|
ORDER BY timestamp DESC
|
|
`);
|
|
|
|
return result.reduce((acc, row) => {
|
|
acc[row.container_name] = {
|
|
cpu: row.cpu_usage,
|
|
memory: row.memory_usage,
|
|
lastUpdate: row.timestamp
|
|
};
|
|
return acc;
|
|
}, {});
|
|
};
|
|
```
|
|
|
|
## Implementation Priority
|
|
|
|
1. **Health Endpoints** - Most reliable, direct communication
|
|
2. **Socket.IO Broadcasting** - Real-time, low overhead
|
|
3. **Prometheus Metrics** - Industry standard, rich data
|
|
4. **Shared Volume Files** - Simple, filesystem-based
|
|
5. **Log Tailing** - Works with existing logging
|
|
6. **Database Collection** - Persistent, queryable history
|
|
|
|
## Benefits
|
|
|
|
- **Fallback Chain**: Multiple methods ensure metrics are always available
|
|
- **Self-Reporting**: Containers know their own state best
|
|
- **Real-time**: Direct communication provides immediate updates
|
|
- **Standardized**: Each method can provide consistent metric format
|
|
- **Resilient**: If one method fails, others still work
|