9.4 KiB
Container Health & Metrics Endpoint Implementation
How It Works - Different Approaches Explained
🎯 Current Implementation: Multi-Layered Detection
The system I just implemented uses a fallback chain approach - NO agents required! Here's how:
Method 1: Built-in Health Endpoints (Recommended)
// Add to your existing Express.js containers
const express = require('express');
const app = express();
// Simple addition to existing code - no agent needed!
app.get('/health/metrics', (req, res) => {
const memUsage = process.memoryUsage();
res.json({
container: process.env.CONTAINER_NAME || 'backend',
cpu: getCurrentCPU(),
memory: {
usage: `${Math.round(memUsage.heapUsed / 1024 / 1024)}MB`,
percentage: `${Math.round((memUsage.heapUsed / memUsage.heapTotal) * 100)}%`
},
uptime: process.uptime(),
health: 'healthy'
});
});
✅ Pros: Direct from container, accurate, real-time ❌ Cons: Requires code changes in each container
Method 2: Docker Stats API (Current Fallback)
// From management container - queries Docker daemon
const { stdout } = await execAsync('docker stats --no-stream --format "table {{.Container}}\\t{{.CPUPerc}}\\t{{.MemUsage}}"');
✅ Pros: Works with ANY container, no code changes needed ❌ Cons: Requires Docker daemon access
Method 3: Docker Compose Status
// Queries docker-compose for container states
const { stdout } = await execAsync('docker-compose ps --format json');
✅ Pros: Basic status info, works everywhere ❌ Cons: Limited metrics, just status/health
🤖 Alternative: Agent-Based Approaches
Option A: Sidecar Container Pattern
# docker-compose.yml
services:
app:
image: my-app:latest
metrics-agent:
image: metrics-agent:latest
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
environment:
- TARGET_CONTAINER=app
How it works: Deploy a metrics agent container alongside each service ✅ Pros: No code changes, detailed system metrics ❌ Cons: Extra containers, more complex deployment
Option B: In-Container Agent Process
# Add to existing Dockerfile
FROM node:18
COPY . /app
COPY metrics-agent /usr/local/bin/
RUN chmod +x /usr/local/bin/metrics-agent
# Start both app and agent
CMD ["sh", "-c", "metrics-agent & npm start"]
How it works: Runs a metrics collection process inside each container ✅ Pros: Single container, detailed metrics ❌ Cons: Modifies container, uses more resources
Option C: External Monitoring Tools
Prometheus + Node Exporter
services:
node-exporter:
image: prom/node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
cAdvisor (Container Advisor)
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
🔧 Recommended Implementation Strategy
Phase 1: Docker Stats (Current)
- ✅ Already implemented
- Works immediately with existing containers
- No code changes required
- Provides CPU, Memory, Network, Disk I/O
Phase 2: Add Health Endpoints
// Add 3 lines to each container's main file
const { createHealthEndpoint } = require('./utils/health-endpoint');
createHealthEndpoint(app); // app is your Express instance
Phase 3: Enhanced Monitoring (Optional)
- Add Prometheus metrics
- Implement custom business metrics
- Add alerting and dashboards
🎯 Current System Architecture
Management Container
↓
1. Try HTTP health endpoints (app containers)
↓ (if fails)
2. Query Docker daemon (all containers)
↓ (if fails)
3. Check docker-compose status
↓ (if fails)
4. Scan system processes
No agents required! The management container does all the work:
- Health Endpoints: Makes HTTP calls to containers that support it
- Docker Stats: Queries Docker daemon for ALL container metrics
- Process Detection: Scans system for running services
- Smart Fallback: Always tries to get SOME information
🚀 Why This Approach is Great
For Existing Systems
- Zero downtime: Works immediately
- No refactoring: Containers don't need changes
- Comprehensive: Sees ALL containers (yours + infrastructure)
For Future Development
- Gradual enhancement: Add health endpoints when convenient
- Flexible: Can switch to any monitoring approach later
- Standards compliant: Uses Docker APIs and HTTP standards
Production Ready
- Reliable fallbacks: Always gets some data
- Error handling: Graceful degradation
- Performance: Lightweight HTTP calls
- Security: No privileged containers needed
2. Prometheus-style Metrics Scraping
// In management.js
const scrapePrometheusMetrics = async (containerUrl) => {
try {
const response = await fetch(`${containerUrl}/metrics`);
const metricsText = await response.text();
// Parse Prometheus format
const metrics = {};
metricsText.split('\n').forEach(line => {
if (line.startsWith('container_cpu_usage')) {
metrics.cpu = line.split(' ')[1] + '%';
}
if (line.startsWith('container_memory_usage_bytes')) {
const bytes = parseInt(line.split(' ')[1]);
metrics.memory = Math.round(bytes / 1024 / 1024) + 'MB';
}
});
return metrics;
} catch (error) {
return { error: 'Prometheus metrics unavailable' };
}
};
3. Socket.IO Real-time Metrics Broadcasting
// Each container broadcasts its metrics via Socket.IO
const io = require('socket.io-client');
const socket = io('http://management-backend:3000');
setInterval(() => {
const metrics = {
container: process.env.CONTAINER_NAME,
cpu: getCurrentCPU(),
memory: getCurrentMemory(),
timestamp: Date.now()
};
socket.emit('container_metrics', metrics);
}, 10000); // Every 10 seconds
// Management backend collects these
io.on('container_metrics', (metrics) => {
containerMetricsCache[metrics.container] = metrics;
});
4. Log File Tailing Approach
// Parse container logs for metrics
const tailContainerLogs = async (containerName) => {
try {
const { stdout } = await execAsync(`docker logs --tail 50 ${containerName} | grep "METRICS:"`);
const logLines = stdout.split('\n').filter(line => line.includes('METRICS:'));
if (logLines.length > 0) {
const lastMetric = logLines[logLines.length - 1];
const metricsJson = lastMetric.split('METRICS:')[1];
return JSON.parse(metricsJson);
}
} catch (error) {
return { error: 'Log metrics unavailable' };
}
};
// Containers log metrics in structured format
console.log(`METRICS: ${JSON.stringify({
cpu: getCurrentCPU(),
memory: getCurrentMemory(),
timestamp: new Date().toISOString()
})}`);
5. Shared Volume Metrics Files
// Each container writes metrics to shared volume
const writeMetricsToFile = () => {
const metrics = {
container: process.env.CONTAINER_NAME,
cpu: getCurrentCPU(),
memory: getCurrentMemory(),
timestamp: Date.now()
};
fs.writeFileSync(`/shared/metrics/${process.env.CONTAINER_NAME}.json`, JSON.stringify(metrics));
};
// Management reads from shared volume
const readSharedMetrics = () => {
const metricsDir = '/shared/metrics';
const files = fs.readdirSync(metricsDir);
return files.reduce((acc, file) => {
if (file.endsWith('.json')) {
const metrics = JSON.parse(fs.readFileSync(path.join(metricsDir, file)));
acc[file.replace('.json', '')] = metrics;
}
return acc;
}, {});
};
6. Database-based Metrics Collection
// Containers insert metrics into shared database
const recordMetrics = async () => {
await db.query(`
INSERT INTO container_metrics (container_name, cpu_usage, memory_usage, timestamp)
VALUES (?, ?, ?, ?)
`, [process.env.CONTAINER_NAME, getCurrentCPU(), getCurrentMemory(), new Date()]);
};
// Management queries latest metrics
const getLatestMetrics = async () => {
const result = await db.query(`
SELECT container_name, cpu_usage, memory_usage, timestamp
FROM container_metrics
WHERE timestamp > NOW() - INTERVAL 1 MINUTE
ORDER BY timestamp DESC
`);
return result.reduce((acc, row) => {
acc[row.container_name] = {
cpu: row.cpu_usage,
memory: row.memory_usage,
lastUpdate: row.timestamp
};
return acc;
}, {});
};
Implementation Priority
- Health Endpoints - Most reliable, direct communication
- Socket.IO Broadcasting - Real-time, low overhead
- Prometheus Metrics - Industry standard, rich data
- Shared Volume Files - Simple, filesystem-based
- Log Tailing - Works with existing logging
- Database Collection - Persistent, queryable history
Benefits
- Fallback Chain: Multiple methods ensure metrics are always available
- Self-Reporting: Containers know their own state best
- Real-time: Direct communication provides immediate updates
- Standardized: Each method can provide consistent metric format
- Resilient: If one method fails, others still work