Files
crewli/dev-docs/GLITCHTIP.md
bert.hausmans 932788c643 docs: glitchtip runbook + setup + RFC §3.1 dev amendment
Operational docs for the GlitchTip stack landed in the previous two
commits.

- dev-docs/GLITCHTIP.md: new runbook covering local dev, project
  provisioning + DSN-to-vault flow, production deploy on
  monitoring.hausdesign.nl (DNS, DirectAdmin Let's Encrypt, Apache
  reverse proxy with WS upgrade), backup install + restore drill,
  smoke tests, troubleshooting.
- dev-docs/SETUP.md: services table now includes GlitchTip; new
  docker/glitchtip/.env subsection points at the runbook.
- dev-docs/RFC-WS-7-OBSERVABILITY.md §3.1: amended to record that the
  same compose file drives local dev (Mailpit at bm_mailpit:1025), so
  prod and dev cannot drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 08:15:27 +02:00

8.5 KiB

GlitchTip — operations runbook

Self-hosted error tracking for Crewli. GlitchTip implements the Sentry event protocol; the official Sentry SDKs (sentry-laravel, @sentry/vue, @sentry/cli) work against it without modification.

Reference: RFC-WS-7-OBSERVABILITY.md.

This file documents how to run the stack — locally and on the production monitoring host. PR-2 (backend SDK) and PR-3 (frontend SDK) consume DSNs provisioned via the steps below.


1. Overview

Service Image Role
glitchtip-web glitchtip/glitchtip:6.1.6 Django web UI + ingest API
glitchtip-worker glitchtip/glitchtip:6.1.6 Celery worker + beat (event processing, alerts, partition maintenance)
glitchtip-postgres postgres:16-alpine Primary datastore
glitchtip-redis valkey/valkey:7-alpine Celery broker + cache

The same docker-compose.glitchtip.yml runs both locally (merged with docker-compose.yml) and on the production host (standalone). Container names are identical in both environments to avoid configuration drift.


2. Local development

# Once
cp docker/glitchtip/.env.example docker/glitchtip/.env

# Boot the full stack (MySQL, Redis, Mailpit, GlitchTip)
make services

# First boot takes ~60s while migrations run. Tail progress:
make services-glitchtip-status

Web UI: http://localhost:8200. Outbound mail goes to Mailpit (http://localhost:8025).

Create the first admin user:

docker exec -it glitchtip-web ./manage.py createsuperuser

Stop the stack with make services-stop. Volumes (glitchtip_postgres_data, glitchtip_redis_data, glitchtip_uploads) survive a stop. Wipe with docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml down -vnever on production.


3. Project provisioning

Once the web UI is reachable and the superuser exists:

  1. Sign in at /.
  2. Create an Organization called Crewli.
  3. Create two projects:
    • crewli-api — platform: Python / Django, alert rules: default.
    • crewli-app — platform: JavaScript / Vue, alert rules: default.
  4. For each project, copy the auto-generated DSN from Settings → Client Keys (DSN).
  5. Store both DSNs in 1Password under Crewli / GlitchTip / DSNs:
    • SENTRY_DSN_BACKENDcrewli-api DSN
    • SENTRY_DSN_FRONTENDcrewli-app DSN

PR-2 wires SENTRY_DSN_BACKEND into api/.env.example; PR-3 wires SENTRY_DSN_FRONTEND into apps/app/.env.example. Empty DSN = SDK no-op (verified for both sentry-laravel and @sentry/vue), so dev environments without a DSN are silent.


4. Production deployment

GlitchTip runs on a separate host (monitoring.hausdesign.nl) and is not deployed via the Crewli deploy.sh pipeline.

4.1 Prerequisites

  • Docker + Docker Compose v2 on the monitoring host.
  • DirectAdmin with the Let's Encrypt module enabled.
  • DNS A-record monitoring.hausdesign.nl pointing at the host IP.

4.2 Place the stack

sudo install -d -o crewli -g crewli /opt/glitchtip
sudo install -d -o crewli -g crewli /opt/glitchtip/docker/glitchtip

# Copy compose file + env example to the host (e.g. via scp or git checkout).
# /opt/glitchtip/docker-compose.glitchtip.yml
# /opt/glitchtip/docker/glitchtip/.env.example

4.3 Configure .env

cd /opt/glitchtip
cp docker/glitchtip/.env.example docker/glitchtip/.env
chmod 0600 docker/glitchtip/.env

Fill in the production values (header of .env.example lists the checklist):

SECRET_KEY=<python -c "import secrets; print(secrets.token_urlsafe(50))">
DATABASE_URL=postgres://postgres:<STRONG>@glitchtip-postgres:5432/glitchtip
POSTGRES_PASSWORD=<STRONG>           # MUST match the password in DATABASE_URL
GLITCHTIP_DOMAIN=https://monitoring.hausdesign.nl
DEFAULT_FROM_EMAIL=glitchtip@hausdesign.nl
EMAIL_URL=smtp+tls://USER:PASSWORD@HOST:PORT

Source the <STRONG> password from the 1Password vault.

4.4 DNS + TLS

  1. Create the A-record for monitoring.hausdesign.nl in DNS.
  2. In DirectAdmin: add the subdomain, then enable Let's Encrypt (Domain Setup → SSL Certificates → "Free & automatic certificate from Let's Encrypt"). Wait for the cert to issue.

4.5 Apache reverse proxy

DirectAdmin generates the vhost. Add a custom config (DirectAdmin → Custom HTTPD Configurations) for the monitoring.hausdesign.nl HTTPS vhost:

ProxyPreserveHost On
ProxyRequests Off
ProxyPass        / http://127.0.0.1:8200/
ProxyPassReverse / http://127.0.0.1:8200/

# WebSocket upgrade — GlitchTip uses WS for live event streaming.
RewriteEngine On
RewriteCond %{HTTP:Upgrade} websocket [NC]
RewriteCond %{HTTP:Connection} upgrade [NC]
RewriteRule ^/?(.*) "ws://127.0.0.1:8200/$1" [P,L]

Reload Apache.

4.6 First boot

cd /opt/glitchtip
docker compose -f docker-compose.glitchtip.yml up -d

# Wait for healthchecks (~60s).
docker compose -f docker-compose.glitchtip.yml ps

# Create the admin user.
docker exec -it glitchtip-web ./manage.py createsuperuser

Open https://monitoring.hausdesign.nl, sign in, and enable 2FA on the account immediately (acceptance criterion 1). Profile → Security → Two-Factor Authentication.

Then provision the two projects (§3) and capture DSNs into 1Password.


5. Backup & restore

5.1 Daily backup

scripts/glitchtip-backup.sh runs pg_dump --format=custom, streams it through gzip, writes to ./backups/glitchtip/glitchtip-<ts>.dump.gz with 0600 permissions, and prunes dumps older than 30 days.

Install the cron entry on the production host:

# /etc/cron.d/glitchtip-backup
0 3 * * * crewli /opt/crewli/scripts/glitchtip-backup.sh >> /var/log/glitchtip-backup.log 2>&1

(Replace /opt/crewli with wherever the Crewli repo checkout lives on the monitoring host. The script is portable — only the docker exec target container needs to exist.)

The script exits non-zero on dump failure so cron's MAILTO catches silent regressions.

5.2 Restore drill

# Pick the dump to restore from.
DUMP=./backups/glitchtip/glitchtip-20260506-030000.dump.gz

# Stream the restore into the postgres container.
gunzip < "$DUMP" \
  | docker exec -i glitchtip-postgres pg_restore \
      -U postgres -d glitchtip --clean --if-exists

--clean --if-exists drops existing objects before recreating them, so the database ends up exactly as it was at dump time. Run after a docker compose stop glitchtip-web glitchtip-worker to avoid concurrent writes during the restore.

Bert should drill the restore at least once after the production stack is live (acceptance criterion 11).


6. Monitoring the monitor

Quick smoke tests:

# API responds with JSON (not 502).
curl -sS http://localhost:8200/api/0/

# Worker reporting in (look for "celery@... ready").
docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml \
  logs --tail=50 glitchtip-worker

# All services healthy.
docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml ps

In production, replace localhost:8200 with https://monitoring.hausdesign.nl. Email-alerting is configured in PR-4; until then alerts surface only in the GlitchTip web UI (Issues view).


7. Troubleshooting

Web container unhealthy on first boot

Migrations take ~60s on a fresh volume. The healthcheck start_period is set accordingly. If the container is still unhealthy after two minutes, tail logs:

docker logs glitchtip-web

Most common cause: DATABASE_URL password ≠ POSTGRES_PASSWORD. The postgres container creates the user with the password it sees, GlitchTip authenticates with the password embedded in the URL — they MUST match.

Worker idle / events stuck in queue

Check that REDIS_URL resolves and the worker is connected:

docker logs glitchtip-worker | grep -E "ready|connected|error"

Volume permission errors on Linux hosts

postgres:16-alpine runs as UID 70 internally. If /var/lib/postgresql/data is bind-mounted from the host with mismatched ownership, postgres refuses to start. The default named volume avoids this — only relevant if you later switch to a host bind-mount.

Right-to-erasure (Art. 17)

Currently manual. Locate events for a user ULID via the web UI search, delete via the UI or directly on the postgres container. An automated erasure script is on the BACKLOG (per RFC §4).


8. References