DevOps & Observability
How I containerise, deploy, monitor, and recover services. Covers Docker, Nginx reverse proxy, structured logging, metrics collection, alerting, backup strategy, and incident response playbooks.
Deployment Architecture
All services run on a single Hetzner CX23 VPS behind Nginx. Each service is managed by systemd. Static files are served directly by Nginx, while API services are reverse-proxied on internal ports.
Handles TLS via Let's Encrypt, serves static files directly, and proxies API routes to internal Gunicorn workers. Security headers set at this layer.
Each service (price API, uptime monitor, chat server) has its own
.service unit file. Automatic restart on failure with rate limiting.
Push to main triggers: lint, test, SSH deploy.
The deploy step runs git pull and systemctl restart.
Docker Containerization
The Weather ML App runs fully containerised with a multi-stage Dockerfile. Other services are being migrated to Docker Compose for consistent local and production environments.
# Stage 1: Install dependencies FROM python:3.12-slim AS builder WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Stage 2: Production image FROM python:3.12-slim WORKDIR /app COPY --from=builder /usr/local/lib/python3.12 /usr/local/lib/python3.12 COPY . . # Non-root user for security RUN adduser --disabled-password --no-create-home appuser USER appuser EXPOSE 8000 CMD ["gunicorn", "--bind", "0.0.0.0:8000", "app:create_app()"]
- Non-root user : process runs as
appuser, not root - Multi-stage build : build tools excluded from final image
- No secrets in image : environment variables injected at runtime
- Minimal base :
python:3.12-slimreduces attack surface
- Service isolation : each API in its own container
- Volume mounts : SQLite databases persist outside containers
- Health checks :
curlto/healthevery 30s - Restart policy :
unless-stoppedfor production
Structured Logging
All services emit JSON-formatted logs with consistent fields: timestamp, level, service name, request ID, and message. This makes logs machine-parseable and easy to search.
import logging, json, uuid from datetime import datetime, timezone class JSONFormatter(logging.Formatter): def format(self, record): return json.dumps({ "timestamp": datetime.now(timezone.utc).isoformat(), "level": record.levelname, "service": "price-api", "message": record.getMessage(), "request_id": getattr(record, "request_id", None), "module": record.module, }) handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) logger = logging.getLogger("app") logger.addHandler(handler) logger.setLevel(logging.INFO)
{
"timestamp": "2026-03-04T10:22:01.443Z",
"level": "INFO",
"service": "price-api",
"message": "Quote fetched: CSCO @ 62.45",
"request_id": "req_8f3a2b1c",
"module": "quotes"
}
Monitoring & Metrics
The Uptime Monitor polls all services every 60 seconds. Server Pulse exposes real-time CPU, RAM, disk, and network metrics. Both feed into a dashboard I can check from any browser.
Server Pulse reads psutil data every 5 seconds: CPU %, memory usage,
disk I/O, and active connections. Data streams to the frontend via a polling API.
Each request gets a unique request_id injected in middleware.
The ID flows through all log entries and error responses for end-to-end tracing.
A cron job checks uptime data every 5 minutes. If any endpoint fails 3 consecutive checks, it sends an email alert with the last known response time and status.
Incident Response & Recovery
When things break, I follow a simple runbook: detect, diagnose, fix, document. Recovery procedures are scripted and tested.
- Detect : uptime monitor or alert fires
- Assess : check
journalctl, Nginx logs, system resources - Mitigate : restart service, roll back deploy, or scale resources
- Fix : identify root cause, push fix, verify in staging
- Document : write a brief post-mortem with timeline and action items
- SQLite databases : daily backup via
.backupcommand to/backups/ - JSON data files : versioned in git, daily cron to a separate directory
- Server config : Nginx and systemd files tracked in the repo
- Hetzner snapshots : weekly automated VM snapshot
Full recovery from a new Hetzner VPS takes under 30 minutes using the setup script
(deploy/setup.sh). The script installs dependencies, configures Nginx,
sets up systemd services, restores databases from backup, and runs smoke tests.
The deploy process is idempotent : running it twice produces the same result.