Home / Engineering / DevOps & Observability

Docker Nginx Linux Monitoring CI/CD

DevOps & Observability

How I containerise, deploy, monitor, and recover services. Covers Docker, Nginx reverse proxy, structured logging, metrics collection, alerting, backup strategy, and incident response playbooks.

Server Hetzner CX23 (Debian)

Reverse proxy Nginx + Let's Encrypt

CI/CD GitHub Actions

Uptime target 99.5%

Deployment Architecture

All services run on a single Hetzner CX23 VPS behind Nginx. Each service is managed by systemd. Static files are served directly by Nginx, while API services are reverse-proxied on internal ports.

Production Architecture

Client

browser / API

→

Nginx

TLS termination

→

Gunicorn

Flask / FastAPI

→

SQLite / JSON

data layer

Nginx

Handles TLS via Let's Encrypt, serves static files directly, and proxies API routes to internal Gunicorn workers. Security headers set at this layer.

Systemd

Each service (price API, uptime monitor, chat server) has its own .service unit file. Automatic restart on failure with rate limiting.

GitHub Actions

Push to main triggers: lint, test, SSH deploy. The deploy step runs git pull and systemctl restart.

Docker Containerization

The Weather ML App runs fully containerised with a multi-stage Dockerfile. Other services are being migrated to Docker Compose for consistent local and production environments.

            dockerfile : Multi-stage build
          

# Stage 1: Install dependencies
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Production image
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12 /usr/local/lib/python3.12
COPY . .

# Non-root user for security
RUN adduser --disabled-password --no-create-home appuser
USER appuser

EXPOSE 8000
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "app:create_app()"]
          

🐳

Container Security

Non-root user : process runs as appuser, not root
Multi-stage build : build tools excluded from final image
No secrets in image : environment variables injected at runtime
Minimal base : python:3.12-slim reduces attack surface

📦

Docker Compose Setup

Service isolation : each API in its own container
Volume mounts : SQLite databases persist outside containers
Health checks : curl to /health every 30s
Restart policy : unless-stopped for production

Structured Logging

All services emit JSON-formatted logs with consistent fields: timestamp, level, service name, request ID, and message. This makes logs machine-parseable and easy to search.

            python : Structured logging setup
          

import logging, json, uuid
from datetime import datetime, timezone

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "service": "price-api",
            "message": record.getMessage(),
            "request_id": getattr(record, "request_id", None),
            "module": record.module,
        })

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("app")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
          

            json : Example log output
          

{
  "timestamp": "2026-03-04T10:22:01.443Z",
  "level": "INFO",
  "service": "price-api",
  "message": "Quote fetched: CSCO @ 62.45",
  "request_id": "req_8f3a2b1c",
  "module": "quotes"
}
          

Log Aggregation Pipeline

Application

JSON to stdout

→

journald

systemd capture

→

Log Files

/var/log/

→

Dashboard

grep + jq queries

→

Alerts

cron + email

Monitoring & Metrics

The Uptime Monitor polls all services every 60 seconds. Server Pulse exposes real-time CPU, RAM, disk, and network metrics. Both feed into a dashboard I can check from any browser.

60s

Poll interval

24h

History window

Monitored endpoints

99.5%

Uptime target

📊

Metrics Collection

Server Pulse reads psutil data every 5 seconds: CPU %, memory usage, disk I/O, and active connections. Data streams to the frontend via a polling API.

🔎

Tracing

Each request gets a unique request_id injected in middleware. The ID flows through all log entries and error responses for end-to-end tracing.

🔔

Alerting

A cron job checks uptime data every 5 minutes. If any endpoint fails 3 consecutive checks, it sends an email alert with the last known response time and status.

Incident Response & Recovery

When things break, I follow a simple runbook: detect, diagnose, fix, document. Recovery procedures are scripted and tested.

🔥

Incident Response Steps

Detect : uptime monitor or alert fires
Assess : check journalctl, Nginx logs, system resources
Mitigate : restart service, roll back deploy, or scale resources
Fix : identify root cause, push fix, verify in staging
Document : write a brief post-mortem with timeline and action items

💾

Backup Strategy

SQLite databases : daily backup via .backup command to /backups/
JSON data files : versioned in git, daily cron to a separate directory
Server config : Nginx and systemd files tracked in the repo
Hetzner snapshots : weekly automated VM snapshot

🔄

Disaster Recovery

Full recovery from a new Hetzner VPS takes under 30 minutes using the setup script (deploy/setup.sh). The script installs dependencies, configures Nginx, sets up systemd services, restores databases from backup, and runs smoke tests. The deploy process is idempotent : running it twice produces the same result.