Jamie Software Lab
Home / Engineering / DevOps & Observability
Docker Nginx Linux Monitoring CI/CD

DevOps & Observability

How I containerise, deploy, monitor, and recover services. Covers Docker, Nginx reverse proxy, structured logging, metrics collection, alerting, backup strategy, and incident response playbooks.

Server Hetzner CX23 (Debian)
Reverse proxy Nginx + Let's Encrypt
CI/CD GitHub Actions
Uptime target 99.5%

Deployment Architecture

All services run on a single Hetzner CX23 VPS behind Nginx. Each service is managed by systemd. Static files are served directly by Nginx, while API services are reverse-proxied on internal ports.

Production Architecture
Client
browser / API
Nginx
TLS termination
Gunicorn
Flask / FastAPI
SQLite / JSON
data layer
01
Nginx

Handles TLS via Let's Encrypt, serves static files directly, and proxies API routes to internal Gunicorn workers. Security headers set at this layer.

02
Systemd

Each service (price API, uptime monitor, chat server) has its own .service unit file. Automatic restart on failure with rate limiting.

03
GitHub Actions

Push to main triggers: lint, test, SSH deploy. The deploy step runs git pull and systemctl restart.

Docker Containerization

The Weather ML App runs fully containerised with a multi-stage Dockerfile. Other services are being migrated to Docker Compose for consistent local and production environments.

dockerfile : Multi-stage build
# Stage 1: Install dependencies
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Production image
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12 /usr/local/lib/python3.12
COPY . .

# Non-root user for security
RUN adduser --disabled-password --no-create-home appuser
USER appuser

EXPOSE 8000
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "app:create_app()"]
🐳
Container Security
  • Non-root user : process runs as appuser, not root
  • Multi-stage build : build tools excluded from final image
  • No secrets in image : environment variables injected at runtime
  • Minimal base : python:3.12-slim reduces attack surface
📦
Docker Compose Setup
  • Service isolation : each API in its own container
  • Volume mounts : SQLite databases persist outside containers
  • Health checks : curl to /health every 30s
  • Restart policy : unless-stopped for production

Structured Logging

All services emit JSON-formatted logs with consistent fields: timestamp, level, service name, request ID, and message. This makes logs machine-parseable and easy to search.

python : Structured logging setup
import logging, json, uuid
from datetime import datetime, timezone

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "service": "price-api",
            "message": record.getMessage(),
            "request_id": getattr(record, "request_id", None),
            "module": record.module,
        })

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("app")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
json : Example log output
{
  "timestamp": "2026-03-04T10:22:01.443Z",
  "level": "INFO",
  "service": "price-api",
  "message": "Quote fetched: CSCO @ 62.45",
  "request_id": "req_8f3a2b1c",
  "module": "quotes"
}
Log Aggregation Pipeline
Application
JSON to stdout
journald
systemd capture
Log Files
/var/log/
Dashboard
grep + jq queries
Alerts
cron + email

Monitoring & Metrics

The Uptime Monitor polls all services every 60 seconds. Server Pulse exposes real-time CPU, RAM, disk, and network metrics. Both feed into a dashboard I can check from any browser.

60s
Poll interval
24h
History window
7
Monitored endpoints
99.5%
Uptime target
📊
Metrics Collection

Server Pulse reads psutil data every 5 seconds: CPU %, memory usage, disk I/O, and active connections. Data streams to the frontend via a polling API.

🔎
Tracing

Each request gets a unique request_id injected in middleware. The ID flows through all log entries and error responses for end-to-end tracing.

🔔
Alerting

A cron job checks uptime data every 5 minutes. If any endpoint fails 3 consecutive checks, it sends an email alert with the last known response time and status.

Incident Response & Recovery

When things break, I follow a simple runbook: detect, diagnose, fix, document. Recovery procedures are scripted and tested.

🔥
Incident Response Steps
  1. Detect : uptime monitor or alert fires
  2. Assess : check journalctl, Nginx logs, system resources
  3. Mitigate : restart service, roll back deploy, or scale resources
  4. Fix : identify root cause, push fix, verify in staging
  5. Document : write a brief post-mortem with timeline and action items
💾
Backup Strategy
  • SQLite databases : daily backup via .backup command to /backups/
  • JSON data files : versioned in git, daily cron to a separate directory
  • Server config : Nginx and systemd files tracked in the repo
  • Hetzner snapshots : weekly automated VM snapshot
🔄
Disaster Recovery

Full recovery from a new Hetzner VPS takes under 30 minutes using the setup script (deploy/setup.sh). The script installs dependencies, configures Nginx, sets up systemd services, restores databases from backup, and runs smoke tests. The deploy process is idempotent : running it twice produces the same result.