Files
crewli/dev-docs/GLITCHTIP.md
bert.hausmans 932788c643 docs: glitchtip runbook + setup + RFC §3.1 dev amendment
Operational docs for the GlitchTip stack landed in the previous two
commits.

- dev-docs/GLITCHTIP.md: new runbook covering local dev, project
  provisioning + DSN-to-vault flow, production deploy on
  monitoring.hausdesign.nl (DNS, DirectAdmin Let's Encrypt, Apache
  reverse proxy with WS upgrade), backup install + restore drill,
  smoke tests, troubleshooting.
- dev-docs/SETUP.md: services table now includes GlitchTip; new
  docker/glitchtip/.env subsection points at the runbook.
- dev-docs/RFC-WS-7-OBSERVABILITY.md §3.1: amended to record that the
  same compose file drives local dev (Mailpit at bm_mailpit:1025), so
  prod and dev cannot drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 08:15:27 +02:00

284 lines
8.5 KiB
Markdown

# GlitchTip — operations runbook
Self-hosted error tracking for Crewli. GlitchTip implements the Sentry
event protocol; the official Sentry SDKs (`sentry-laravel`, `@sentry/vue`,
`@sentry/cli`) work against it without modification.
Reference: [`RFC-WS-7-OBSERVABILITY.md`](./RFC-WS-7-OBSERVABILITY.md).
This file documents how to run the stack — locally and on the production
monitoring host. PR-2 (backend SDK) and PR-3 (frontend SDK) consume DSNs
provisioned via the steps below.
---
## 1. Overview
| Service | Image | Role |
|---------|-------|------|
| `glitchtip-web` | `glitchtip/glitchtip:6.1.6` | Django web UI + ingest API |
| `glitchtip-worker` | `glitchtip/glitchtip:6.1.6` | Celery worker + beat (event processing, alerts, partition maintenance) |
| `glitchtip-postgres` | `postgres:16-alpine` | Primary datastore |
| `glitchtip-redis` | `valkey/valkey:7-alpine` | Celery broker + cache |
The same `docker-compose.glitchtip.yml` runs both locally (merged with
`docker-compose.yml`) and on the production host (standalone). Container
names are identical in both environments to avoid configuration drift.
---
## 2. Local development
```bash
# Once
cp docker/glitchtip/.env.example docker/glitchtip/.env
# Boot the full stack (MySQL, Redis, Mailpit, GlitchTip)
make services
# First boot takes ~60s while migrations run. Tail progress:
make services-glitchtip-status
```
Web UI: <http://localhost:8200>. Outbound mail goes to Mailpit
(`http://localhost:8025`).
Create the first admin user:
```bash
docker exec -it glitchtip-web ./manage.py createsuperuser
```
Stop the stack with `make services-stop`. Volumes (`glitchtip_postgres_data`,
`glitchtip_redis_data`, `glitchtip_uploads`) survive a stop. Wipe with
`docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml down -v`
**never on production**.
---
## 3. Project provisioning
Once the web UI is reachable and the superuser exists:
1. Sign in at `/`.
2. Create an Organization called **Crewli**.
3. Create two projects:
- **`crewli-api`** — platform: Python / Django, alert rules: default.
- **`crewli-app`** — platform: JavaScript / Vue, alert rules: default.
4. For each project, copy the auto-generated DSN from
*Settings → Client Keys (DSN)*.
5. Store both DSNs in 1Password under `Crewli / GlitchTip / DSNs`:
- `SENTRY_DSN_BACKEND``crewli-api` DSN
- `SENTRY_DSN_FRONTEND``crewli-app` DSN
PR-2 wires `SENTRY_DSN_BACKEND` into `api/.env.example`; PR-3 wires
`SENTRY_DSN_FRONTEND` into `apps/app/.env.example`. Empty DSN = SDK no-op
(verified for both `sentry-laravel` and `@sentry/vue`), so dev environments
without a DSN are silent.
---
## 4. Production deployment
GlitchTip runs on a separate host (`monitoring.hausdesign.nl`) and is **not**
deployed via the Crewli `deploy.sh` pipeline.
### 4.1 Prerequisites
- Docker + Docker Compose v2 on the monitoring host.
- DirectAdmin with the Let's Encrypt module enabled.
- DNS A-record `monitoring.hausdesign.nl` pointing at the host IP.
### 4.2 Place the stack
```bash
sudo install -d -o crewli -g crewli /opt/glitchtip
sudo install -d -o crewli -g crewli /opt/glitchtip/docker/glitchtip
# Copy compose file + env example to the host (e.g. via scp or git checkout).
# /opt/glitchtip/docker-compose.glitchtip.yml
# /opt/glitchtip/docker/glitchtip/.env.example
```
### 4.3 Configure `.env`
```bash
cd /opt/glitchtip
cp docker/glitchtip/.env.example docker/glitchtip/.env
chmod 0600 docker/glitchtip/.env
```
Fill in the production values (header of `.env.example` lists the
checklist):
```env
SECRET_KEY=<python -c "import secrets; print(secrets.token_urlsafe(50))">
DATABASE_URL=postgres://postgres:<STRONG>@glitchtip-postgres:5432/glitchtip
POSTGRES_PASSWORD=<STRONG> # MUST match the password in DATABASE_URL
GLITCHTIP_DOMAIN=https://monitoring.hausdesign.nl
DEFAULT_FROM_EMAIL=glitchtip@hausdesign.nl
EMAIL_URL=smtp+tls://USER:PASSWORD@HOST:PORT
```
Source the `<STRONG>` password from the 1Password vault.
### 4.4 DNS + TLS
1. Create the A-record for `monitoring.hausdesign.nl` in DNS.
2. In DirectAdmin: add the subdomain, then enable Let's Encrypt
(Domain Setup → SSL Certificates → "Free & automatic certificate from
Let's Encrypt"). Wait for the cert to issue.
### 4.5 Apache reverse proxy
DirectAdmin generates the vhost. Add a custom config (DirectAdmin →
Custom HTTPD Configurations) for the `monitoring.hausdesign.nl` HTTPS
vhost:
```apache
ProxyPreserveHost On
ProxyRequests Off
ProxyPass / http://127.0.0.1:8200/
ProxyPassReverse / http://127.0.0.1:8200/
# WebSocket upgrade — GlitchTip uses WS for live event streaming.
RewriteEngine On
RewriteCond %{HTTP:Upgrade} websocket [NC]
RewriteCond %{HTTP:Connection} upgrade [NC]
RewriteRule ^/?(.*) "ws://127.0.0.1:8200/$1" [P,L]
```
Reload Apache.
### 4.6 First boot
```bash
cd /opt/glitchtip
docker compose -f docker-compose.glitchtip.yml up -d
# Wait for healthchecks (~60s).
docker compose -f docker-compose.glitchtip.yml ps
# Create the admin user.
docker exec -it glitchtip-web ./manage.py createsuperuser
```
Open <https://monitoring.hausdesign.nl>, sign in, and **enable 2FA** on
the account immediately (acceptance criterion 1). Profile → Security →
Two-Factor Authentication.
Then provision the two projects (§3) and capture DSNs into 1Password.
---
## 5. Backup & restore
### 5.1 Daily backup
`scripts/glitchtip-backup.sh` runs `pg_dump --format=custom`, streams it
through gzip, writes to `./backups/glitchtip/glitchtip-<ts>.dump.gz` with
`0600` permissions, and prunes dumps older than 30 days.
Install the cron entry on the production host:
```cron
# /etc/cron.d/glitchtip-backup
0 3 * * * crewli /opt/crewli/scripts/glitchtip-backup.sh >> /var/log/glitchtip-backup.log 2>&1
```
(Replace `/opt/crewli` with wherever the Crewli repo checkout lives on
the monitoring host. The script is portable — only the `docker exec`
target container needs to exist.)
The script exits non-zero on dump failure so cron's `MAILTO` catches
silent regressions.
### 5.2 Restore drill
```bash
# Pick the dump to restore from.
DUMP=./backups/glitchtip/glitchtip-20260506-030000.dump.gz
# Stream the restore into the postgres container.
gunzip < "$DUMP" \
| docker exec -i glitchtip-postgres pg_restore \
-U postgres -d glitchtip --clean --if-exists
```
`--clean --if-exists` drops existing objects before recreating them, so
the database ends up exactly as it was at dump time. Run after a
`docker compose stop glitchtip-web glitchtip-worker` to avoid concurrent
writes during the restore.
Bert should drill the restore at least once after the production stack
is live (acceptance criterion 11).
---
## 6. Monitoring the monitor
Quick smoke tests:
```bash
# API responds with JSON (not 502).
curl -sS http://localhost:8200/api/0/
# Worker reporting in (look for "celery@... ready").
docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml \
logs --tail=50 glitchtip-worker
# All services healthy.
docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml ps
```
In production, replace `localhost:8200` with `https://monitoring.hausdesign.nl`.
Email-alerting is configured in PR-4; until then alerts surface only in
the GlitchTip web UI (Issues view).
---
## 7. Troubleshooting
### Web container unhealthy on first boot
Migrations take ~60s on a fresh volume. The healthcheck `start_period`
is set accordingly. If the container is still unhealthy after two
minutes, tail logs:
```bash
docker logs glitchtip-web
```
Most common cause: `DATABASE_URL` password ≠ `POSTGRES_PASSWORD`. The
postgres container creates the user with the password it sees, GlitchTip
authenticates with the password embedded in the URL — they MUST match.
### Worker idle / events stuck in queue
Check that `REDIS_URL` resolves and the worker is connected:
```bash
docker logs glitchtip-worker | grep -E "ready|connected|error"
```
### Volume permission errors on Linux hosts
`postgres:16-alpine` runs as UID 70 internally. If `/var/lib/postgresql/data`
is bind-mounted from the host with mismatched ownership, postgres refuses
to start. The default named volume avoids this — only relevant if you
later switch to a host bind-mount.
### Right-to-erasure (Art. 17)
Currently manual. Locate events for a user ULID via the web UI search,
delete via the UI or directly on the postgres container. An automated
erasure script is on the BACKLOG (per RFC §4).
---
## 8. References
- RFC: [`RFC-WS-7-OBSERVABILITY.md`](./RFC-WS-7-OBSERVABILITY.md)
- GlitchTip docs: <https://glitchtip.com/documentation>
- GlitchTip self-hosting: <https://glitchtip.com/documentation/install>